[02:17:09] 10serviceops, 10Parsoid: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 (10ssastry) I think something is broken with mysql on testreduce1001 at this point .. so, maybe just reboot the server and see if that fixes anything. If not, we can just wipe the db and recreate it. We'll ju... [09:01:47] jayme: So, let's start pooling services? We can start with some of the low traffic ones. cxserver, citoid, recommendation-api [09:03:09] akosiaris: yup! I'll do those three [09:04:33] * jayme done [09:05:48] <_joe_> I would also switch back restbase-async [09:05:58] <_joe_> that will jumpstart work for changeprop [09:09:15] okay. let's do that next when we see first traffic flowing in [09:36:28] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO: Apache on doc1001 does not see updated PHP files for hours/days after deployment - https://phabricator.wikimedia.org/T275468 (10hashar) 05Open→03Resolved We are now restarting php-fpm on the doc hosts. That clears out... [09:52:02] <_joe_> I have to convert our mcrouter templates from puppet to helm [09:52:05] <_joe_> so much fun [09:52:19] <_joe_> anyone wants to share the pleasure? [11:08:52] akosiaris: so, while haven (as expected) not that of an impact traffic wise it looks okay. I would switch back restbase-async then as j.oe suggested [11:27:28] <_joe_> stylistic question: the mediawiki chart is complex enough that I want to separate manifest pieces per-container, so have a templates/mcrouter directory [11:27:34] <_joe_> containing all things mcrouter [11:27:36] <_joe_> etc etc [11:32:18] _joe_: I did the restbase-async switch but I don't see that reflected in the changeprop dashboard. Do you have an idea why? [11:33:21] <_joe_> jayme: sorry, I clearly had a brainfart earlier [11:33:33] <_joe_> it's changeprop that calls restbase-async, not vice-versa [11:34:03] <_joe_> but restbase-async will generate kafka messages, so you should see activity in eventgate (the main one) once you repool it [11:34:23] <_joe_> and /that/ will cause changeprop to bootstrap again [11:34:51] _joe_: ah, okay. But it's fine to have restbase-asynced switched for now, right? [11:35:10] so that I can go grab a bite and do eventgate after lunch [11:36:19] <_joe_> yes ofc [11:40:33] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10kostajh) >>! In T277297#6908552, @akosiaris wrote: > Bypassing the api-gateway (and the se... [12:15:01] _joe_: what was the question part of your stylistic question? :) [12:17:57] 10serviceops, 10Analytics-Radar, 10Cassandra, 10ContentTranslation, and 10 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10MSantos) [12:27:54] <_joe_> jayme: it was implied - do you like that approach? [12:28:44] <_joe_> basically instead of having a 1k+ lines deployment.yaml, etc, we'll have mcrouter/{configmap,deployment,...}.yaml.tpl [12:28:56] _joe_: yeah. I think it's fine to not have dozens of files with prefixes in that one directory! [12:29:19] oh, yeah. Way better that all in one deployment.yaml ocf! [12:30:07] templates/mcrouter_{configmap,deployment,...}.yaml.tpl was the other option I thought of [12:30:49] <_joe_> I like directories more :) [12:31:02] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10kostajh) a:03kostajh [12:31:16] yeah, totally fine :) [14:00:44] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [14:13:00] could someone reboot testreduce1001 .. ref T277580. [14:15:13] akosiaris: all services repooled in codfw [14:15:21] subbu: I can do so in a sec [14:15:50] ty [14:24:28] subbu: still shutting down because of MariaDB hanging it seems [14:25:35] yes. [14:26:13] yes, it seems pretty broken ... it has been attempted innodb recovery for hours on end and never completes. [14:26:36] unfortunately I have to run for an errand soon. The stop job will wait another ~7min or so [14:26:43] k [14:30:13] <_joe_> can we get rid of mcrouter? [14:32:53] we can get rid of me, for an hour or so - ttyl o/ [14:37:32] <_joe_> ttyl! [14:38:02] <_joe_> effie / rzl I don't remember if we use onhost memcached only for "local" routes or also for mw-wan ones [14:39:48] strictly speaking neither, it's a new route [14:40:01] but I don't remember offhand which use cases in MW call into it, effie probably does [14:40:34] <_joe_> context is: I'm trying to bake the mcrouter stuff for mediawiki on k8s [14:40:49] <_joe_> and I'm trying not to "just copy" what we have now [14:46:31] nod - I'm catching up on some discussion that's happened since I looked at this last [14:47:25] but it looks like the high-level answer is, no, nothing under mw-wan has an onhost tier [14:59:35] _joe_ I though you wanted to get rid of nutcraker [14:59:41] It's now mcrouter too? [15:00:34] <_joe_> that one was for functional reasons [15:00:40] jayme: +1. Some cleanup left and we are good [15:01:55] <_joe_> akosiaris: here I wanted to get rid of go text/template, but as an alternative I'm happyt with ditching mcrouter [15:11:04] * jayme back [15:14:14] subbu|away: in case you not already noticed: testreduce1001 is back up [15:14:44] akosiaris: what clean up you mean? Something not in the action items? [15:17:43] jayme, yes. :) [15:18:19] looks like it fixed mariadb as well. [15:20:48] jayme: niah, stuff like killing the old master nodes, merging some cleanup patches etc [15:21:02] but I think we got it all tracked (and if we don't have something, we 'll figure it out) [15:32:31] _joe_: I thought we would try the approach of keeping this out of k8s [15:33:19] <_joe_> effie: no? we had the discussion the other time here, and we said that we don't want such a big node-wide thing that can fail and make all pods unusable [15:34:38] I think we should discuss it a bit more, I wasnt aware we made a decision [15:52:18] I will start a task [16:10:55] so I'm trying to debug my shellbox chart right now: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/667047 and right now I'm stuck at actually hitting the shellbox service directly with e.g. curl [16:11:08] I'm following https://wikitech.wikimedia.org/wiki/Deployment_pipeline/Migration/Tutorial#values.example.yaml [16:11:41] but step #4 just ends up with "Connection refused" [16:13:23] <_joe_> legoktm: what does kubectl get services tells you? [16:13:47] NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE [16:13:47] kubernetes ClusterIP 10.96.0.1 443/TCP 27m [16:13:47] shellbox-default NodePort 10.106.141.15 8080:30524/TCP 13m [16:14:17] just to make sure, does the service actually have to be ready before k8s will make it accessible? I didn't disable the failing readiness check yet [16:14:40] <_joe_> legoktm: yeah that means that your pod is nor ready [16:14:48] <_joe_> so the service won't send it traffic I think [16:14:53] * legoktm comments out more [16:16:05] aaand that was it [16:16:11] <_joe_> yeah [16:17:27] <_joe_> so re: liveness/readiness probe, I'd ask you to take a look at https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/672767 [16:17:35] <_joe_> it includes a way to do readiness/liveness checks [16:18:12] <_joe_> also the recent versions of the php-fpm image have the FCGI_URL and FCGI_ALLOW env variables [16:19:39] that makes it so we can do checks against fcgi directly instead of going through httpd? [16:23:50] 10serviceops, 10Product-Infrastructure-Team-Backlog: Allow `push-notifications` service to accept production environment flag for APNS requests - https://phabricator.wikimedia.org/T274456 (10jijiki) >>! In T274456#6901782, @Dmantena wrote: > @jijiki Sure thing! > > **We're looking to privately trigger push no... [16:29:44] <_joe_> yes, also we should have separated readiness probes for both [16:45:08] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Figure out appropriate readiness and liveness probes - https://phabricator.wikimedia.org/T276908 (10JMeybohm) >>! In T276908#6916385, @Joe wrote: > After some more work, this is my ideas for liveness and readiness probes: > > # httpd: > -- liveness: tc... [16:49:04] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Figure out appropriate readiness and liveness probes - https://phabricator.wikimedia.org/T276908 (10Joe) >>! In T276908#6922060, @JMeybohm wrote: >>>! In T276908#6916385, @Joe wrote: >> After some more work, this is my ideas for liveness and readiness p... [16:54:51] 10serviceops, 10Prod-Kubernetes, 10SRE, 10SRE-tools, 10User-jijiki: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) [16:55:08] 10serviceops, 10Prod-Kubernetes, 10SRE, 10SRE-tools: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) [16:59:03] 10serviceops, 10Prod-Kubernetes, 10SRE, 10SRE-tools: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) p:05Triage→03Medium [17:00:29] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [17:15:50] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) a:05Cmjohnson→03RobH @robh mc1037 NIC cfg done (enabled pxe on 10G disabeld on the 1GE), MAC BC:97:E1:E4:4B:30 mc1038 same an... [17:16:16] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) [17:16:55] woohooo mc1037 and mc1038 are racked [17:17:30] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) @RobH after you set the 2 up can you assign this task back to @Jclark-ctr to finish the remainder please [17:52:33] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2230.codfw.wmnet` - mw2230.codfw.wmnet (**PASS**) - Downti... [18:09:16] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2231.codfw.wmnet` - mw2231.codfw.wmnet (**PASS**) - Downti... [18:33:52] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2232.codfw.wmnet` - mw2232.codfw.wmnet (**PASS**) - Downti... [19:01:07] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2233.codfw.wmnet` - mw2233.codfw.wmnet (**PASS**) - Downti... [19:04:21] jayme: when you rebooted testreduce1001 was it from inside the VM or from ganeti level? I think the second option is required for it to detect new virtual hardware, so doing that. assuming subbu's mysql import was done [19:16:46] duh.. that same thing happened like last time, VM comes back but is offline.. because the NIC has been renumbered [19:17:28] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2234.codfw.wmnet` - mw2234.codfw.wmnet (**PASS**) - Downti... [19:26:12] 10serviceops, 10Parsoid: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 (10Dzahn) >>! In T277580#6921570, @Stashbot wrote: > rebooting restreduce1001 for T277580 The reboot needs to happen on Ganeti level, not from inside the VM. But once I did that the VM was not reachab... [19:34:23] 10serviceops, 10Parsoid: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 (10Dzahn) We now have a new `/dev/vdb` available. so.. fdisk /dev/vdb to make a new partition table, then a primary partition. Now we have `/dev/vdb1` and can `mkfs.ext4 /dev/vdb1` to put a filesystem on i... [19:37:12] 10serviceops, 10Parsoid: Bump disk space on testreduce1001 - https://phabricator.wikimedia.org/T277580 (10Dzahn) 05Open→03Resolved a:03Dzahn Rebooted a final time to confirm it stays mounted: ` [testreduce1001:~] $ df -h Filesystem Size Used Avail Use% Mounted on ... /dev/vda1 36G 32G 1... [19:38:08] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2235.codfw.wmnet` - mw2235.codfw.wmnet (**PASS**) - Downti... [19:52:03] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2236.codfw.wmnet` - mw2236.codfw.wmnet (**PASS**) - Downti... [20:04:49] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn) [20:05:19] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2237.codfw.wmnet` - mw2237.codfw.wmnet (**PASS**) - Downti... [20:19:18] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2238.codfw.wmnet` - mw2238.codfw.wmnet (**PASS**) - Downti... [20:23:17] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn) [20:24:58] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn) @Papaul rack A3 is ready. All old hosts in it have been shut down. The remaining mw servers in A3 (mw2291 and up) are modern hardware from 2019 that... [21:22:46] 10serviceops, 10MediaWiki-General, 10SRE, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Legoktm) Was this ever finished? This has come up again on https://en.wi... [21:59:38] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) p:05Triage→03Medium [22:10:50] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['mc1037.eqiad.wmnet', 'mc1038.eqiad.w... [22:21:17] 10serviceops, 10SRE: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Krinkle) [22:34:58] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1037.eqiad.wmnet', 'mc1038.eqiad.wmnet'] ` and were **ALL** successful. [23:03:54] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) [23:05:27] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) a:05RobH→03Jclark-ctr #serviceops please be aware mc1037 and mc1038 are ready for your team to place into service. The rest are no... [23:20:40] wheeee [23:59:20] 10serviceops: mc1024 broke - replace it or remove it from configs - https://phabricator.wikimedia.org/T272078 (10Jclark-ctr)