[06:00:00] <wikibugs>	 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10ArielGlenn) @Krinkle, any chance you can give a link to a few of those errors with their stacktrace in log...
[06:48:39] <elukey>	 good morning folks
[06:48:47] <elukey>	 if you guys are ok I'd proceed with https://phabricator.wikimedia.org/T225642
[06:50:05] <elukey>	 the idea is to:
[06:50:20] <elukey>	 1) merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/492948/ - this is a no-op, only to allow mcrouter to support the new config
[06:50:50] <elukey>	 2) merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520726/ to enable the new config on one (canary) appserver and one (canary) api-appserver
[06:51:24] <elukey>	 the goal is to  test the config and to eventually deploy it on all the canaries, and figure out what to do next 
[06:55:05] <moritzm>	 sounds good to me!
[07:23:22] <effie>	 +1
[08:04:05] <elukey>	 reverted the changed on the first two mw hosts, https://phabricator.wikimedia.org/T225642#5336089
[08:04:15] <elukey>	 the config is in my opinion not correct
[08:04:33] <elukey>	 but in any case I want to be extra careful and wait for Performance
[08:18:51] <jijiki>	 tx luca
[08:23:57] <elukey>	 I am still trying to wrap my head around the results since I am a bit confused about the replication that we have :D
[08:38:09] <elukey>	 ah no ok now I get it, my tests were not correct
[08:38:19] <jijiki>	 lol
[08:38:25] <jijiki>	 why?
[08:38:31] <elukey>	 the order of execution leads to some results that I expected in my mind, but that's not it
[08:39:42] <elukey>	 but the new config seems weird anyway
[08:46:59] <elukey>	 ahhhh okok got it, the config seems doing what it is supposed ot
[08:47:19] <elukey>	 I had to re-read all the conversation with Aaron in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/492948/
[08:47:37] <elukey>	 I was puzzled by the NOT_STORED reply
[08:48:08] <elukey>	 but the new AllAsyncRoute does exactly like that
[08:48:24] <elukey>	 "Immediately sends the same request to all child route handles. Does not wait for response. Returns the default reply for the request right away as returned by NullRoute."
[08:49:09] <elukey>	 but BagOfStuff will not consider that as an error
[08:50:27] <elukey>	 since it assumes that mcrouter already works in async mode (meanwhile it is not)
[08:50:41] <elukey>	 ok PEBCAK + TIL, I think that I can re-revert
[08:58:05] <elukey>	 jijiki: https://phabricator.wikimedia.org/T225642#5336204
[08:58:15] <elukey>	 going to re-test again
[09:02:25] <jijiki>	 I am a bit confused
[09:02:30] <jijiki>	 I will read again later
[09:09:51] <elukey>	 so the gist of it is that, if I got it correctly, we define a eqiad pool and a codfw pool with prefixes for mcrouter. If /eqiad/mw-wan/name-of-key is used, then we go to eqiad (default, see --route-prefix arg in mcrouter)
[09:10:28] <elukey>	 if /codfw/mw-wan is used, we got to codfw for set/deletes, the rest to eqiad (but probably we don't really use it for gets)
[09:10:46] <elukey>	 mediawiki is aware of this and handles the replication 
[09:12:14] <elukey>	 up to now when replicating to codfw, mcrouter waits for a positive answer from the codfw shard
[09:12:42] <elukey>	 that implies waiting for a round trip time from codfw
[09:13:11] <elukey>	 the patch that aaron introduced simply makes mcrouter to return "NOT_STORED", as the null route would do
[09:13:17] <elukey>	 immediately
[09:13:57] <elukey>	 so mediawiki is freed from waiting, but mcrouter behind the scenes sends anyway the set/delete to codfw
[09:14:00] <elukey>	 simply not caring about the result
[09:14:08] <elukey>	 this is the current consistency model
[09:14:16] <elukey>	 does it make more sense?
[09:15:39] <elukey>	 jijiki: --^
[09:17:25] <jijiki>	 I was under the impression that mcrouter handled replication
[09:17:36] <jijiki>	 which is clearly wrong
[09:18:28] <elukey>	 I had the same assumption, but from my tests it was wrong :)
[11:58:40] <wikibugs>	 10serviceops, 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review: Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10MoritzMuehlenhoff) 05Open→03Resolved Packages have been synched to thirdparty/ci for stretch-w...
[14:13:46] <elukey>	 while checking https://phabricator.wikimedia.org/T227142 I got a bit scared
[14:13:55] <elukey>	 we have 5 mc1xxx hosts in the same rack?
[14:16:09] <elukey>	 https://netbox.wikimedia.org/search/?q=mc10&obj_type=
[14:24:42] <wikibugs>	 10serviceops, 10Patch-For-Review: docker registry swift replication is not replicating content between DCs - https://phabricator.wikimedia.org/T227570 (10fsero) 05Open→03Resolved uploaded a new image today (coredns) and rechecked like @fgiunchedi and it seems to be working \o/ so resolving this issue.
[14:26:08] <moritzm>	 elukey: that vaguely rings a bell, we might have an existing task to increase row redundancy for mc/eqiad even
[14:28:51] <elukey>	 moritzm: ah didn't find it :( in theory we should survive to a rack down event, but I have no idea what impact we could get to databases/etc.. 
[14:29:00] <elukey>	 since mcrouter will not failover the traffic elsewhere
[14:29:17] <elukey>	 but it will simply return server error
[14:33:01] <wikibugs>	 10serviceops, 10Patch-For-Review: recreate staging cluster namespaces using helmfile - https://phabricator.wikimedia.org/T227775 (10fsero) p:05Triage→03Normal
[14:33:38] <wikibugs>	 10serviceops, 10Patch-For-Review: recreate staging cluster namespaces using helmfile - https://phabricator.wikimedia.org/T227775 (10fsero)
[15:10:04] <tarrow>	 fsero: If you happen to have a moment would you mind adding your thoughts about LVS vs envoy to here? https://phabricator.wikimedia.org/T226814 :)
[15:10:56] <fsero>	 sure tarrow i also discussed it internally and we might have another solution but i need some time to see how can that work :)
[15:11:30] <tarrow>	 fsero: ooh! awesome! Let me know if I can be of any help
[15:13:08] <tarrow>	 Also, I was wondering about if operations/deployment-charts is the sort of place people generally self-merge? I was sort of guessing that's what Alex meant by this: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/523124#message-5e420323412b4af41ba1b103c9548481f0aecaa6
[15:20:23] <fsero>	 well i think Alex was stating that you don't need serviceops +2 or +1 as you have merge rights on that repo, which it does not relate with self-merging
[15:20:52] <fsero>	 this is a really small change so i guess this one is good to self merge but is hard for me to advocate on self merging :)
[15:21:16] <fsero>	 does that makes sense?
[15:22:25] <tarrow>	 :P. Makes sense. To me it's not that small. I'd be happy to stick with getting another one of the wikidata team to merge it; should I ask about getting some of them added to the list?
[15:23:58] <tarrow>	 ah, it's any deployers
[15:24:01] <tarrow>	 cool
[15:24:02] <fsero>	 you can always add more people as reviewers and ping them 
[15:24:08] <fsero>	 we dont need to whiletist anybody
[15:24:11] <fsero>	 yep
[15:25:41] * tarrow is always happy to share out any blame :)
[15:26:10] <wikibugs>	 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10Cmjohnson) Last log paste before clearing the log   Record:      4 Date/Time:   11/08/2018 00:18:01 Source:      system Severity:    Non-Critical Description: Correctable memory error rate...
[15:26:43] <wikibugs>	 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10Cmjohnson) I swapped all the DIMM from side A to side B cleared the log and powered back up. Please put the server back in service and let's see if the reseating worked.
[15:27:23] <wikibugs>	 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10Cmjohnson) 05Open→03Resolved I am resolving this ticket, please re-open and ping me if the problem returns.
[15:48:15] <fsero>	 tarrow: there is no jenkins job configured on this repo yet
[15:48:30] <fsero>	 maybe you can do it or ask #releng for help :)
[15:48:33] <tarrow>	 fsero: Thanks! I was being slow and wondering :P
[15:49:37] <tarrow>	 I'll see about it. I guess you guys already have plans for linting etc.. the contents?
[16:17:44] <tarrow>	 Is it expected that my helmfile apply would fail right now? Are we still supposed to be using scap-helm?
[16:18:54] <wikibugs>	 10serviceops, 10Operations, 10Release Pipeline, 10Core Platform Team (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10Jdforrester-WMF)
[16:32:30] <tarrow>	 I'm trying to debug my failed staging deployment: Its failing on ImagePullback off. If I try and manually pull the image (from my laptop) I get it failing part way with "filesystem layer verification failed for digest sha......"
[16:38:54] <wikibugs>	 10serviceops, 10Machine vision, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway)
[18:05:37] <wikibugs>	 10serviceops: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero)
[18:05:44] <wikibugs>	 10serviceops: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) p:05Triage→03Unbreak!
[18:33:13] <fsero>	 tarrow: a new image has been built https://integration.wikimedia.org/ci/job/service-pipeline-test-and-publish/395/console
[18:33:16] <fsero>	 you can try that
[18:44:49] <wikibugs>	 10serviceops, 10Machine vision, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10MusikAnimal) I should mention the open_nsfw fails for some images, e.g. ` $ curl -d 'url=https://...
[18:52:39] <wikibugs>	 10serviceops, 10Machine vision, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) @MusikAnimal Since I'm also working on this at the moment, I grabbed a stack trace for...
[19:01:06] <urandom>	 fsero: re: https://wikitech.wikimedia.org/wiki/Deploying_a_service_in_kubernetes what does "Can i use kubectl?" mean in this context?
[19:01:19] <urandom>	 fsero: like, being the right sudoers group or something?
[19:01:23] <urandom>	 *in
[19:02:06] <wikibugs>	 10serviceops, 10Machine vision, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) @MusikAnimal I noticed that both of the images you linked were full-size originals, so...
[19:04:32] <fsero>	 that is a draft urandom 
[19:04:56] <fsero>	 it refers more to this section https://wikitech.wikimedia.org/w/index.php?title=Migrating_from_scap-helm#Advanced_use_cases:_using_kubeconfig
[19:05:05] <urandom>	 fsero: yeah, we've (CPT) been tasked w/ something similar, and was thinking about just improving yours
[19:05:12] <fsero>	 setting KUBECONFIG and expectations about kubectl
[19:05:37] <fsero>	 like anything regarding getting info get, describe port-forward is ok to be done using kubectl
[19:05:49] <fsero>	 anything that changes the deployment or modify something should be done in code through helmfile
[19:05:51] <wikibugs>	 10serviceops, 10Machine vision, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) As for what the underlying problem could be, one clue I notice is that both images are...
[19:06:39] <urandom>	 so in this context, "can I use" just means, "know what to do?"
[19:15:19] <fsero>	 IMO yep, is setting expectations about using kubectl i would not advise to edit a pod or deployment using kubectl even having auth power
[19:23:38] <urandom>	 fsero: OK
[19:23:56] <urandom>	 fsero: do you mind if we ask you questions and iterate on that list?
[19:24:08] <fsero>	 nope but better be quick
[19:24:15] <fsero>	 i have less than 30 days left
[19:24:36] <urandom>	 kk :(
[19:47:44] <wikibugs>	 10serviceops, 10Machine vision, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) Heh, there's an easy fix for the truncated image error:   ` from PIL import ImageFile...
[19:49:40] <wikibugs>	 10serviceops, 10Machine vision, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10MusikAnimal) >>! In T225664#5338710, Mholloway wrote: > Heh, there's an easy fix for the truncate...
[20:08:25] <wikibugs>	 10serviceops, 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10jijiki)
[20:09:04] <wikibugs>	 10serviceops, 10Operations, 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10jijiki)
[20:22:39] <wikibugs>	 10serviceops, 10Operations, 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) lisf of affected images   ` coredns dev/mediawiki dev/mediawiki-xdebug dev/restbase dev/stretch dev/stretch-php72 dev/...
[20:23:34] <wikibugs>	 10serviceops, 10Operations, 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) base images wikimedia-jessie and wikimedia-stretch and affected production images   ` Successfully published image doc...
[20:29:57] <wikibugs>	 10serviceops, 10Operations, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10Jdforrester-WMF) ` [contint1001.wikimedia.org] out: == Step 0: scanning /etc/zuul...
[20:30:04] <wikibugs>	 10serviceops, 10Operations, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10Jdforrester-WMF)
[20:35:42] <wikibugs>	 10serviceops, 10Operations: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) a:05Dzahn→03None please see T215332
[21:39:28] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10bd808) https://integration.wikimedia.org/ci/job/labs-strike...