[06:00:00] 10serviceops, 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10ArielGlenn) @Krinkle, any chance you can give a link to a few of those errors with their stacktrace in log... [06:48:39] good morning folks [06:48:47] if you guys are ok I'd proceed with https://phabricator.wikimedia.org/T225642 [06:50:05] the idea is to: [06:50:20] 1) merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/492948/ - this is a no-op, only to allow mcrouter to support the new config [06:50:50] 2) merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520726/ to enable the new config on one (canary) appserver and one (canary) api-appserver [06:51:24] the goal is to test the config and to eventually deploy it on all the canaries, and figure out what to do next [06:55:05] sounds good to me! [07:23:22] +1 [08:04:05] reverted the changed on the first two mw hosts, https://phabricator.wikimedia.org/T225642#5336089 [08:04:15] the config is in my opinion not correct [08:04:33] but in any case I want to be extra careful and wait for Performance [08:18:51] tx luca [08:23:57] I am still trying to wrap my head around the results since I am a bit confused about the replication that we have :D [08:38:09] ah no ok now I get it, my tests were not correct [08:38:19] lol [08:38:25] why? [08:38:31] the order of execution leads to some results that I expected in my mind, but that's not it [08:39:42] but the new config seems weird anyway [08:46:59] ahhhh okok got it, the config seems doing what it is supposed ot [08:47:19] I had to re-read all the conversation with Aaron in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/492948/ [08:47:37] I was puzzled by the NOT_STORED reply [08:48:08] but the new AllAsyncRoute does exactly like that [08:48:24] "Immediately sends the same request to all child route handles. Does not wait for response. Returns the default reply for the request right away as returned by NullRoute." [08:49:09] but BagOfStuff will not consider that as an error [08:50:27] since it assumes that mcrouter already works in async mode (meanwhile it is not) [08:50:41] ok PEBCAK + TIL, I think that I can re-revert [08:58:05] jijiki: https://phabricator.wikimedia.org/T225642#5336204 [08:58:15] going to re-test again [09:02:25] I am a bit confused [09:02:30] I will read again later [09:09:51] so the gist of it is that, if I got it correctly, we define a eqiad pool and a codfw pool with prefixes for mcrouter. If /eqiad/mw-wan/name-of-key is used, then we go to eqiad (default, see --route-prefix arg in mcrouter) [09:10:28] if /codfw/mw-wan is used, we got to codfw for set/deletes, the rest to eqiad (but probably we don't really use it for gets) [09:10:46] mediawiki is aware of this and handles the replication [09:12:14] up to now when replicating to codfw, mcrouter waits for a positive answer from the codfw shard [09:12:42] that implies waiting for a round trip time from codfw [09:13:11] the patch that aaron introduced simply makes mcrouter to return "NOT_STORED", as the null route would do [09:13:17] immediately [09:13:57] so mediawiki is freed from waiting, but mcrouter behind the scenes sends anyway the set/delete to codfw [09:14:00] simply not caring about the result [09:14:08] this is the current consistency model [09:14:16] does it make more sense? [09:15:39] jijiki: --^ [09:17:25] I was under the impression that mcrouter handled replication [09:17:36] which is clearly wrong [09:18:28] I had the same assumption, but from my tests it was wrong :) [11:58:40] 10serviceops, 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review: Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10MoritzMuehlenhoff) 05Open→03Resolved Packages have been synched to thirdparty/ci for stretch-w... [14:13:46] while checking https://phabricator.wikimedia.org/T227142 I got a bit scared [14:13:55] we have 5 mc1xxx hosts in the same rack? [14:16:09] https://netbox.wikimedia.org/search/?q=mc10&obj_type= [14:24:42] 10serviceops, 10Patch-For-Review: docker registry swift replication is not replicating content between DCs - https://phabricator.wikimedia.org/T227570 (10fsero) 05Open→03Resolved uploaded a new image today (coredns) and rechecked like @fgiunchedi and it seems to be working \o/ so resolving this issue. [14:26:08] elukey: that vaguely rings a bell, we might have an existing task to increase row redundancy for mc/eqiad even [14:28:51] moritzm: ah didn't find it :( in theory we should survive to a rack down event, but I have no idea what impact we could get to databases/etc.. [14:29:00] since mcrouter will not failover the traffic elsewhere [14:29:17] but it will simply return server error [14:33:01] 10serviceops, 10Patch-For-Review: recreate staging cluster namespaces using helmfile - https://phabricator.wikimedia.org/T227775 (10fsero) p:05Triage→03Normal [14:33:38] 10serviceops, 10Patch-For-Review: recreate staging cluster namespaces using helmfile - https://phabricator.wikimedia.org/T227775 (10fsero) [15:10:04] fsero: If you happen to have a moment would you mind adding your thoughts about LVS vs envoy to here? https://phabricator.wikimedia.org/T226814 :) [15:10:56] sure tarrow i also discussed it internally and we might have another solution but i need some time to see how can that work :) [15:11:30] fsero: ooh! awesome! Let me know if I can be of any help [15:13:08] Also, I was wondering about if operations/deployment-charts is the sort of place people generally self-merge? I was sort of guessing that's what Alex meant by this: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/523124#message-5e420323412b4af41ba1b103c9548481f0aecaa6 [15:20:23] well i think Alex was stating that you don't need serviceops +2 or +1 as you have merge rights on that repo, which it does not relate with self-merging [15:20:52] this is a really small change so i guess this one is good to self merge but is hard for me to advocate on self merging :) [15:21:16] does that makes sense? [15:22:25] :P. Makes sense. To me it's not that small. I'd be happy to stick with getting another one of the wikidata team to merge it; should I ask about getting some of them added to the list? [15:23:58] ah, it's any deployers [15:24:01] cool [15:24:02] you can always add more people as reviewers and ping them [15:24:08] we dont need to whiletist anybody [15:24:11] yep [15:25:41] * tarrow is always happy to share out any blame :) [15:26:10] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10Cmjohnson) Last log paste before clearing the log Record: 4 Date/Time: 11/08/2018 00:18:01 Source: system Severity: Non-Critical Description: Correctable memory error rate... [15:26:43] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10Cmjohnson) I swapped all the DIMM from side A to side B cleared the log and powered back up. Please put the server back in service and let's see if the reseating worked. [15:27:23] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10Cmjohnson) 05Open→03Resolved I am resolving this ticket, please re-open and ping me if the problem returns. [15:48:15] tarrow: there is no jenkins job configured on this repo yet [15:48:30] maybe you can do it or ask #releng for help :) [15:48:33] fsero: Thanks! I was being slow and wondering :P [15:49:37] I'll see about it. I guess you guys already have plans for linting etc.. the contents? [16:17:44] Is it expected that my helmfile apply would fail right now? Are we still supposed to be using scap-helm? [16:18:54] 10serviceops, 10Operations, 10Release Pipeline, 10Core Platform Team (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10Jdforrester-WMF) [16:32:30] I'm trying to debug my failed staging deployment: Its failing on ImagePullback off. If I try and manually pull the image (from my laptop) I get it failing part way with "filesystem layer verification failed for digest sha......" [16:38:54] 10serviceops, 10Machine vision, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) [18:05:37] 10serviceops: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) [18:05:44] 10serviceops: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) p:05Triage→03Unbreak! [18:33:13] tarrow: a new image has been built https://integration.wikimedia.org/ci/job/service-pipeline-test-and-publish/395/console [18:33:16] you can try that [18:44:49] 10serviceops, 10Machine vision, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10MusikAnimal) I should mention the open_nsfw fails for some images, e.g. ` $ curl -d 'url=https://... [18:52:39] 10serviceops, 10Machine vision, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) @MusikAnimal Since I'm also working on this at the moment, I grabbed a stack trace for... [19:01:06] fsero: re: https://wikitech.wikimedia.org/wiki/Deploying_a_service_in_kubernetes what does "Can i use kubectl?" mean in this context? [19:01:19] fsero: like, being the right sudoers group or something? [19:01:23] *in [19:02:06] 10serviceops, 10Machine vision, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) @MusikAnimal I noticed that both of the images you linked were full-size originals, so... [19:04:32] that is a draft urandom [19:04:56] it refers more to this section https://wikitech.wikimedia.org/w/index.php?title=Migrating_from_scap-helm#Advanced_use_cases:_using_kubeconfig [19:05:05] fsero: yeah, we've (CPT) been tasked w/ something similar, and was thinking about just improving yours [19:05:12] setting KUBECONFIG and expectations about kubectl [19:05:37] like anything regarding getting info get, describe port-forward is ok to be done using kubectl [19:05:49] anything that changes the deployment or modify something should be done in code through helmfile [19:05:51] 10serviceops, 10Machine vision, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) As for what the underlying problem could be, one clue I notice is that both images are... [19:06:39] so in this context, "can I use" just means, "know what to do?" [19:15:19] IMO yep, is setting expectations about using kubectl i would not advise to edit a pod or deployment using kubectl even having auth power [19:23:38] fsero: OK [19:23:56] fsero: do you mind if we ask you questions and iterate on that list? [19:24:08] nope but better be quick [19:24:15] i have less than 30 days left [19:24:36] kk :( [19:47:44] 10serviceops, 10Machine vision, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) Heh, there's an easy fix for the truncated image error: ` from PIL import ImageFile... [19:49:40] 10serviceops, 10Machine vision, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10MusikAnimal) >>! In T225664#5338710, Mholloway wrote: > Heh, there's an easy fix for the truncate... [20:08:25] 10serviceops, 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10jijiki) [20:09:04] 10serviceops, 10Operations, 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10jijiki) [20:22:39] 10serviceops, 10Operations, 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) lisf of affected images ` coredns dev/mediawiki dev/mediawiki-xdebug dev/restbase dev/stretch dev/stretch-php72 dev/... [20:23:34] 10serviceops, 10Operations, 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) base images wikimedia-jessie and wikimedia-stretch and affected production images ` Successfully published image doc... [20:29:57] 10serviceops, 10Operations, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10Jdforrester-WMF) ` [contint1001.wikimedia.org] out: == Step 0: scanning /etc/zuul... [20:30:04] 10serviceops, 10Operations, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10Jdforrester-WMF) [20:35:42] 10serviceops, 10Operations: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) a:05Dzahn→03None please see T215332 [21:39:28] 10serviceops, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10bd808) https://integration.wikimedia.org/ci/job/labs-strike...