[00:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210311T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:13:22] legoktm: great [00:13:31] time now to start preparing for bullseye :D [00:13:50] Haha we still have Jessie hosts to get rid of! [00:14:35] ouch [00:15:07] can't you just say they were lost in an OVH datacenter and get rid of them? (: [00:17:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:18:55] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) Looks like we need a decom ticket for old hardware in A3. So we are removing everything that was procured in T134272 ? [00:20:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:21:30] oof [00:21:53] though most of the jessie nodes left are the super critical ones like mwlog, irc.wm.o, etc. :p [00:22:59] legoktm: i wonder how mwlog1001 migration will work like. especially if "no data lost" is a requirement [00:23:41] switch and then rsync the archives/ over? [00:23:56] and what about today's data? [00:26:36] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) [00:29:29] Urbanecm, is that the UDP log stuff? could maybe be mirrored somehow wither by app config or by some network magic? [00:29:33] either* [00:30:11] (03PS1) 10Mstyles: bump rdf-streaming-udpater chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/670625 [00:30:23] Krenair: good point, that can work, sth like a WRITE_BOTH constant [00:30:24] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) Wondering wow long would this take from taking servers out of rotation until new servers are back in pool. It seems at least several days. Are we really ok with not be... [00:32:24] (03CR) 10Jeena Huneidi: [C: 03+2] bump rdf-streaming-udpater chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/670625 (owner: 10Mstyles) [00:33:03] (03Merged) 10jenkins-bot: bump rdf-streaming-udpater chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/670625 (owner: 10Mstyles) [00:40:56] legoktm: Urbanecm: Yeah, if they both listen to the same source (or get pushed the same data, I guess it's not on kafka yet?) then you just wait 24 hours for "today" to be in the past, then turn off the old pipeline and backfil archives then, ideally such that the copying process will complete before the next archiving is done on the destination [00:48:19] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2215.codfw.wmnet [00:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:27] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2216.codfw.wmnet [00:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:40] (03PS1) 10Bstorm: wikireplicas: create actual paws database accounts [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) [00:49:01] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2217.codfw.wmnet [00:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:08] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2218.codfw.wmnet [00:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:03] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2219.codfw.wmnet [00:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:18] (03CR) 10Bstorm: [C: 04-1] "I'm setting -1 as a reminder not to merge this. Please review and critique my first pass regardless. I have noticed that the original code" [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [00:54:21] (03PS2) 10Bstorm: wikireplicas: create actual paws database accounts [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) [00:54:35] (03CR) 10Bstorm: [C: 04-1] wikireplicas: create actual paws database accounts [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [00:57:47] 10SRE, 10Wikimedia-Mailing-lists: Figure out a way to sync old and new mailman - https://phabricator.wikimedia.org/T256539 (10Legoktm) You did skip over the easy option - declare a downtime for X hours, migrate everything over, and then bring it all back up on mailman3. Implementation wise it's the easiest but... [01:00:05] twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210311T0100). [01:10:55] 10SRE, 10Domains, 10Okapi, 10Traffic: Subdomain Request - OKAPI - https://phabricator.wikimedia.org/T276585 (10RBrounley_WMF) @Reedy, no it's not a duplicate and that ticket should be deleted. @Aklapper, what's the best way to delete T269686? [01:11:44] 10SRE, 10Domains, 10Okapi, 10Traffic: Subdomain Request - OKAPI - https://phabricator.wikimedia.org/T276585 (10Reedy) >>! In T276585#6903159, @RBrounley_WMF wrote: > @Reedy, no it's not a duplicate and that ticket should be deleted. @Aklapper, what's the best way to delete T269686? You can't delete ticket... [01:12:25] 10SRE, 10DNS, 10Okapi, 10Traffic: Create three Okapi sub-domains (okapi*.wikimedia.org) - https://phabricator.wikimedia.org/T269686 (10RBrounley_WMF) 05Open→03Invalid [01:12:52] 10SRE, 10Domains, 10Okapi, 10Traffic: Subdomain Request - OKAPI - https://phabricator.wikimedia.org/T276585 (10RBrounley_WMF) Done. Thanks @Reedy [01:18:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:22:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:27:57] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [01:33:41] 10SRE, 10Wikimedia-Mailing-lists: Figure out a way to sync old and new mailman - https://phabricator.wikimedia.org/T256539 (10Ladsgroup) You don't need a list of mailing lists in puppet, the exim4 routing checks for existence of directories that exist under mailman3 for that specific mailing list, otherwise it... [01:54:39] (03CR) 10Ahmon Dancy: [C: 03+1] pipeline: Initial multiversion pipeline configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) (owner: 10Dduvall) [02:22:22] (03PS1) 10Dzahn: site: replace mw2215,mw2216 with mw2377,mw2378 as codfw API canaries [puppet] - 10https://gerrit.wikimedia.org/r/670631 (https://phabricator.wikimedia.org/T274171) [04:17:37] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [04:18:11] PROBLEM - SSH on mw2236.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:12:07] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [05:12:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=gerrit site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:13:13] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [05:19:25] RECOVERY - SSH on mw2236.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:19:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:23:49] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 21657 bytes in 0.543 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [05:27:25] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 905 bytes in 5.905 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [05:29:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=gerrit-metrics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:30:44] * thcipriani looks at gerrit [05:31:05] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [05:31:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:32:13] thcipriani: is it affecting users? [05:32:28] I'm having trouble connecting via the webui [05:33:17] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 21657 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [05:33:48] jvm metrics look fine on grafana: https://grafana.wikimedia.org/d/Bw2mQ3iWz/javamelody?orgId=1 [05:34:06] I'm in over ssh [05:34:18] !log restarted apache2 on gerrit1001 [05:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:36] ^ did that right before the recovery at 5:33 [05:34:53] ack [05:35:12] gerrit itself looks fine. I think we had this problem once previously. [05:35:40] where the jvm looked fine and kicking apache "fixed" it [05:36:35] do you think it merits a follow-up task? [05:37:06] yeah, digging through grafana for apache metrics for the host now [05:39:02] thanks :) I'm going afk but am pingable if needed [05:39:15] aaannd there's the problem https://grafana.wikimedia.org/d/L0-l1o0Mz/apache?viewPanel=7&orgId=1&var-host=gerrit1001&var-port=9117&from=1615433431969&to=1615441111969 [05:39:19] cool, thanks :) [05:40:46] someone took up all the workers? [05:42:40] dunno yet: it's possible [06:18:53] I had a patch uploaded right when that was starting, and the side-effect was that all of the jenkins tasks failed. Otherwise, no serious issues. [06:36:17] !log Drop testreduce from m5 - T276787 [06:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:26] T276787: Drop testreduce and testreduce_vd from m5 master - https://phabricator.wikimedia.org/T276787 [06:38:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1136', diff saved to https://phabricator.wikimedia.org/P14749 and previous config saved to /var/cache/conftool/dbconfig/20210311-063814-marostegui.json [06:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:25] (03PS1) 10Marostegui: db1154,db1155: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670636 [06:41:20] (03CR) 10Marostegui: [C: 03+2] db1154,db1155: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670636 (owner: 10Marostegui) [06:45:17] (03PS4) 10Waihorace: Enable RelatedArticles Extension in zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670359 (https://phabricator.wikimedia.org/T266933) [06:47:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:48:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2108 T275633', diff saved to https://phabricator.wikimedia.org/P14750 and previous config saved to /var/cache/conftool/dbconfig/20210311-064821-marostegui.json [06:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:29] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:48:52] !log Stop mysql on db2108 to clone db2148 T275633 [06:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:52:04] (03PS1) 10Marostegui: mariadb: Productionize db2148 [puppet] - 10https://gerrit.wikimedia.org/r/670637 (https://phabricator.wikimedia.org/T275633) [06:52:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 10%: Repool db1136 after schema change', diff saved to https://phabricator.wikimedia.org/P14752 and previous config saved to /var/cache/conftool/dbconfig/20210311-065230-root.json [06:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:53] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2148 [puppet] - 10https://gerrit.wikimedia.org/r/670637 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [07:06:11] thcipriani: it just spiked again [07:06:43] but didn't max out and recovered [07:07:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 30%: Repool db1136 after schema change', diff saved to https://phabricator.wikimedia.org/P14753 and previous config saved to /var/cache/conftool/dbconfig/20210311-070734-root.json [07:07:38] legoktm: timings on httpd's access log seem to be higher when happens, and since it is basically an http proxy I bet that it is gerrit slowing down, leaving connections from httpd to it blocking workers [07:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:43] (also hi!) [07:08:56] morning :) yeah, that makes sense [07:22:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 60%: Repool db1136 after schema change', diff saved to https://phabricator.wikimedia.org/P14754 and previous config saved to /var/cache/conftool/dbconfig/20210311-072237-root.json [07:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 100%: Repool db1136 after schema change', diff saved to https://phabricator.wikimedia.org/P14755 and previous config saved to /var/cache/conftool/dbconfig/20210311-073741-root.json [07:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1079', diff saved to https://phabricator.wikimedia.org/P14756 and previous config saved to /var/cache/conftool/dbconfig/20210311-074352-marostegui.json [07:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:51:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:04:23] (03PS1) 10Marostegui: instances.yaml: Add db2148 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/670765 (https://phabricator.wikimedia.org/T275633) [08:04:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:05:34] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2148 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/670765 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [08:06:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:08:50] (03CR) 10Filippo Giunchedi: [C: 03+1] httpd: enable httpd to emit ECS-compliant logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668231 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [08:09:39] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: add domainrw param and lookup [puppet] - 10https://gerrit.wikimedia.org/r/670567 (owner: 10Herron) [08:09:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2148 to s2 T275633', diff saved to https://phabricator.wikimedia.org/P14757 and previous config saved to /var/cache/conftool/dbconfig/20210311-080944-marostegui.json [08:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:52] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [08:10:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2108 T275633', diff saved to https://phabricator.wikimedia.org/P14758 and previous config saved to /var/cache/conftool/dbconfig/20210311-081010-marostegui.json [08:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:11:19] (03PS1) 10Marostegui: db2148: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670766 (https://phabricator.wikimedia.org/T275633) [08:13:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:13:24] (03CR) 10Marostegui: [C: 03+2] db2148: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670766 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [08:13:51] (03PS1) 10Hashar: ci: switch Jenkins to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/670767 (https://phabricator.wikimedia.org/T269354) [08:16:05] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1028.eqiad.wmnet [08:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:02] (03PS1) 10Hashar: releases: switch Jenkins to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/670768 (https://phabricator.wikimedia.org/T269354) [08:18:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 10%: Repool db1079 after schema change', diff saved to https://phabricator.wikimedia.org/P14759 and previous config saved to /var/cache/conftool/dbconfig/20210311-081801-root.json [08:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:39] (03CR) 10Muehlenhoff: releases: switch Jenkins to Java 11 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670768 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [08:22:31] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1028.eqiad.wmnet [08:22:32] (03PS2) 10Hashar: releases: switch Jenkins to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/670768 (https://phabricator.wikimedia.org/T269354) [08:22:34] (03PS2) 10Hashar: ci: switch Jenkins to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/670767 (https://phabricator.wikimedia.org/T269354) [08:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:36] (03PS1) 10Marostegui: install_server: Do not reimage db2148 [puppet] - 10https://gerrit.wikimedia.org/r/670769 (https://phabricator.wikimedia.org/T275633) [08:23:25] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2148 [puppet] - 10https://gerrit.wikimedia.org/r/670769 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [08:24:37] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1029.eqiad.wmnet [08:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2074', diff saved to https://phabricator.wikimedia.org/P14760 and previous config saved to /var/cache/conftool/dbconfig/20210311-082445-marostegui.json [08:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2074', diff saved to https://phabricator.wikimedia.org/P14761 and previous config saved to /var/cache/conftool/dbconfig/20210311-082528-marostegui.json [08:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2109', diff saved to https://phabricator.wikimedia.org/P14762 and previous config saved to /var/cache/conftool/dbconfig/20210311-082546-marostegui.json [08:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:50] (03PS1) 10Marostegui: mariadb: Productionize db2149 [puppet] - 10https://gerrit.wikimedia.org/r/670771 (https://phabricator.wikimedia.org/T275633) [08:30:11] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1029.eqiad.wmnet [08:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2149 [puppet] - 10https://gerrit.wikimedia.org/r/670771 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [08:33:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 30%: Repool db1079 after schema change', diff saved to https://phabricator.wikimedia.org/P14764 and previous config saved to /var/cache/conftool/dbconfig/20210311-083305-root.json [08:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:10] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1030.eqiad.wmnet [08:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:39] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1030.eqiad.wmnet [08:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:38] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for pcoombe@wikimedia.org - https://phabricator.wikimedia.org/T277065 (10Gilles) 05Open→03Resolved a:03Gilles Thanks! [08:43:48] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1031.eqiad.wmnet [08:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:38] (03CR) 10Muehlenhoff: [C: 03+2] releases: switch Jenkins to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/670768 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [08:48:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 60%: Repool db1079 after schema change', diff saved to https://phabricator.wikimedia.org/P14765 and previous config saved to /var/cache/conftool/dbconfig/20210311-084809-root.json [08:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1031.eqiad.wmnet [08:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:23] (03PS5) 10Filippo Giunchedi: Run tests for alerts [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) [08:50:30] (03CR) 10Filippo Giunchedi: "Thanks for the reviews!" (036 comments) [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [08:51:30] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1032.eqiad.wmnet [08:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:26] (03PS1) 10Muehlenhoff: Fix path for Java path [puppet] - 10https://gerrit.wikimedia.org/r/670772 [08:56:56] (03CR) 10Hashar: [C: 03+1] Fix path for Java path [puppet] - 10https://gerrit.wikimedia.org/r/670772 (owner: 10Muehlenhoff) [08:57:20] (03CR) 10Muehlenhoff: [C: 03+2] Fix path for Java path [puppet] - 10https://gerrit.wikimedia.org/r/670772 (owner: 10Muehlenhoff) [08:58:02] (03PS3) 10Hashar: ci: switch Jenkins to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/670767 (https://phabricator.wikimedia.org/T269354) [08:58:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1032.eqiad.wmnet [08:58:28] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1033.eqiad.wmnet [08:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:34] (03CR) 10Hashar: "Amended to drop the /jre suffix." [puppet] - 10https://gerrit.wikimedia.org/r/670767 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [08:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/670504 (https://phabricator.wikimedia.org/T274262) (owner: 10Alexandros Kosiaris) [09:03:07] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1033.eqiad.wmnet [09:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 100%: Repool db1079 after schema change', diff saved to https://phabricator.wikimedia.org/P14766 and previous config saved to /var/cache/conftool/dbconfig/20210311-090312-root.json [09:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:22] (03Merged) 10jenkins-bot: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/670504 (https://phabricator.wikimedia.org/T274262) (owner: 10Alexandros Kosiaris) [09:03:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1174', diff saved to https://phabricator.wikimedia.org/P14767 and previous config saved to /var/cache/conftool/dbconfig/20210311-090342-marostegui.json [09:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:04] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1035.eqiad.wmnet [09:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:28] (03PS1) 10Giuseppe Lavagetto: Fix salt for php monitoring [labs/private] - 10https://gerrit.wikimedia.org/r/670773 [09:07:49] (03PS1) 10Effie Mouzeli: hieradata: upgrade mwdebug1001 onhost memcached to 1.6x [puppet] - 10https://gerrit.wikimedia.org/r/670774 [09:08:22] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [09:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:31] PROBLEM - Memcached on mwdebug1001 is CRITICAL: connect to address 10.64.32.123 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [09:11:17] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1035.eqiad.wmnet [09:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:25] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1036.eqiad.wmnet [09:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:19:03] !log upgrade memcached on mc1020, mc2020 [09:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:23:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1036.eqiad.wmnet [09:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:43] (03CR) 10Muehlenhoff: [C: 03+2] ci: switch Jenkins to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/670767 (https://phabricator.wikimedia.org/T269354) (owner: 10Hashar) [09:24:44] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1037.eqiad.wmnet [09:24:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: Repool db1174 after schema change', diff saved to https://phabricator.wikimedia.org/P14768 and previous config saved to /var/cache/conftool/dbconfig/20210311-092457-root.json [09:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:29] We are going to upgrade Java on the Jenkins CI instance. Will be disabled for some minutes, job will be queued in Zuul anyway [09:28:24] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Fix salt for php monitoring [labs/private] - 10https://gerrit.wikimedia.org/r/670773 (owner: 10Giuseppe Lavagetto) [09:29:43] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [09:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:05] !log Restarting CI Jenkins [09:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:40] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1037.eqiad.wmnet [09:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:13] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [09:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:45] (03PS1) 10Muehlenhoff: releases/ci: Remove now obsolete Java 8 packages [puppet] - 10https://gerrit.wikimedia.org/r/670776 (https://phabricator.wikimedia.org/T269354) [09:40:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 30%: Repool db1174 after schema change', diff saved to https://phabricator.wikimedia.org/P14769 and previous config saved to /var/cache/conftool/dbconfig/20210311-094000-root.json [09:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:35] (03PS2) 10Effie Mouzeli: hieradata: upgrade mwdebug1001 onhost memcached to 1.6x [puppet] - 10https://gerrit.wikimedia.org/r/670774 [09:43:15] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1038.eqiad.wmnet [09:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:30] !log Deploy schema change on s5 codfw master, lag will appear - T276150 T276156 [09:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:37] T276150: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 [09:45:38] T276156: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 [09:45:48] (03PS1) 10JMeybohm: Remove olykalinichenko SSH key, reused in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/670779 (https://phabricator.wikimedia.org/T275677) [09:48:21] (03CR) 10Muehlenhoff: [C: 03+1] Remove olykalinichenko SSH key, reused in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/670779 (https://phabricator.wikimedia.org/T275677) (owner: 10JMeybohm) [09:48:39] (03CR) 10JMeybohm: [C: 03+2] Remove olykalinichenko SSH key, reused in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/670779 (https://phabricator.wikimedia.org/T275677) (owner: 10JMeybohm) [09:51:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1038.eqiad.wmnet [09:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:17] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10JMeybohm) a:05JMeybohm→03None >>! In T275677#6891447, @JMeybohm wrote: > @OlyKalinichenkoSpeedAndFunction I did remove y... [09:55:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 60%: Repool db1174 after schema change', diff saved to https://phabricator.wikimedia.org/P14770 and previous config saved to /var/cache/conftool/dbconfig/20210311-095504-root.json [09:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:08] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [10:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:23] 10SRE, 10Analytics, 10observability: Set up cross DC topic mirroring for Kafka logging clusters - https://phabricator.wikimedia.org/T276972 (10fgiunchedi) >>! In T276972, @Ottomata wrote: > Our multi DC kafka setup works like this: > - Producers prefix topics with their datacenter name, e.g. eqiad.mediawiki.... [10:02:35] (03CR) 10Volans: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [10:07:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2109', diff saved to https://phabricator.wikimedia.org/P14771 and previous config saved to /var/cache/conftool/dbconfig/20210311-100705-marostegui.json [10:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:22] (03PS1) 10Marostegui: instances.yaml: Add db2149 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/670781 (https://phabricator.wikimedia.org/T275633) [10:09:16] (03PS1) 10MSantos: Revert "Revert "mobileapps: Enable egress network policy"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/670540 [10:09:30] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2149 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/670781 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [10:09:42] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1039.eqiad.wmnet [10:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: Repool db1174 after schema change', diff saved to https://phabricator.wikimedia.org/P14772 and previous config saved to /var/cache/conftool/dbconfig/20210311-101008-root.json [10:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:29] (03CR) 10Kormat: "Hey. It would be good to see a PCC run." [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) (owner: 10Dave Pifke) [10:16:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2149 to dbctl, depooled, T275633', diff saved to https://phabricator.wikimedia.org/P14773 and previous config saved to /var/cache/conftool/dbconfig/20210311-101604-marostegui.json [10:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:14] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [10:16:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1039.eqiad.wmnet [10:16:24] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1040.eqiad.wmnet [10:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3315 for schema change', diff saved to https://phabricator.wikimedia.org/P14774 and previous config saved to /var/cache/conftool/dbconfig/20210311-101714-marostegui.json [10:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:53] (03CR) 10Filippo Giunchedi: "> Patch Set 5: Code-Review+1" [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [10:22:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1040.eqiad.wmnet [10:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:21] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1041.eqiad.wmnet [10:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:09] (03CR) 10Majavah: "for the record, this is https://phabricator.wikimedia.org/T277069" [labs/private] - 10https://gerrit.wikimedia.org/r/670773 (owner: 10Giuseppe Lavagetto) [10:28:01] (03CR) 10Marostegui: [C: 03+1] mariadb: Use section params: remaining profiles. [puppet] - 10https://gerrit.wikimedia.org/r/669845 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [10:28:26] (03CR) 10Kormat: [C: 03+2] mariadb: Use section params: remaining profiles. [puppet] - 10https://gerrit.wikimedia.org/r/669845 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [10:29:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1041.eqiad.wmnet [10:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:16] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1042.eqiad.wmnet [10:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1042.eqiad.wmnet [10:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:22] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1043.eqiad.wmnet [10:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:00] (03CR) 10Gehel: [C: 03+1] "not tested either, but LGTM" [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/669894 (https://phabricator.wikimedia.org/T276595) (owner: 10Cwhite) [10:37:32] PROBLEM - MariaDB read only pc1 #page on pc1010 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.4.18-MariaDB-log, Uptime 873607s, event_scheduler: True, 1098.38 QPS, connection latency: 0.003498s, query latency: 0.000608s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:37:50] * volans|away here [10:37:53] expected? [10:37:55] looking [10:38:02] * apergos peeks in [10:38:04] here too if needed [10:38:04] * akosiaris around [10:38:06] around [10:38:07] manueeeeeellll [10:38:10] oh. yes, it is. i'm absentminded [10:38:18] phew [10:38:20] heh [10:38:22] whew indeed [10:38:26] sorry folks [10:38:31] kormat, is it a passive pc? [10:38:33] it is not user impacting [10:38:36] ack [10:38:37] ah, ok [10:38:42] ACKNOWLEDGEMENT - MariaDB read only pc1 #page on pc1010 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.4.18-MariaDB-log, Uptime 873607s, event_scheduler: True, 1098.38 QPS, connection latency: 0.003498s, query latency: 0.000608s Kormat Forgot to fix. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:39:12] if "Forgot to fix" then the alert worked exactly as expected :-), alerting before user impact [10:39:13] it's my fault, but i'm blaming marostegui. just to be clear. [10:39:25] which is a good thing [10:39:49] kormat: I always knew you had a disaster recovery plan [10:39:53] sobanski: :D [10:39:58] RECOVERY - MariaDB read only pc1 #page on pc1010 is OK: Version 10.4.18-MariaDB-log, Uptime 873752s, read_only: True, event_scheduler: True, 1350.36 QPS, connection latency: 0.005898s, query latency: 0.000710s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:40:04] 😅 [10:40:17] kormat: https://usercontent.irccloud-cdn.com/file/g1PvPjck/incident.png [10:40:26] hahah [10:42:05] WTB magical feature where if a puppet change causes an alert to fire, icinga first politely pokes the user on irc giving them a chance to fix their mistake before everyone notices [10:42:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1043.eqiad.wmnet [10:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:31] godog: think we can get that on the annual plan for observability? ;) [10:42:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 10%: Repool db1096:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14775 and previous config saved to /var/cache/conftool/dbconfig/20210311-104236-root.json [10:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:10] We can call it "see no evil, hear no evil" [10:44:27] (03Restored) 10Jbond: service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond) [10:44:36] (03PS11) 10Jbond: service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 (https://phabricator.wikimedia.org/T241160) [10:44:47] (03Restored) 10Jbond: service_definitions: add defined ports [puppet] - 10https://gerrit.wikimedia.org/r/559537 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond) [10:45:06] (03CR) 10jerkins-bot: [V: 04-1] service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond) [10:45:15] kormat: haha! *writes note* [10:46:04] godog: 💜 [10:46:05] (03Abandoned) 10Jbond: service_definitions: add defined ports [puppet] - 10https://gerrit.wikimedia.org/r/559537 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond) [10:48:44] (03PS1) 10Hnowlan: scap: use refresh-config when doing a deploy-local [puppet] - 10https://gerrit.wikimedia.org/r/670784 [10:49:51] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1044.eqiad.wmnet [10:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:28] (03CR) 10Ayounsi: [C: 03+1] netbox: add NetboxServer class (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [10:51:51] (03CR) 10Gehel: [C: 03+1] "LGTM, minor comment inline that you should feel free to ignore" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [10:54:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1044.eqiad.wmnet [10:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:16] (03CR) 10Gehel: "minor comment inline, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666110 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [10:54:45] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1045.eqiad.wmnet [10:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:14] (03PS1) 10Jbond: service_definitions: add defined ports [puppet] - 10https://gerrit.wikimedia.org/r/670788 (https://phabricator.wikimedia.org/T277146) [10:57:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 30%: Repool db1096:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14776 and previous config saved to /var/cache/conftool/dbconfig/20210311-105740-root.json [10:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:22] (03CR) 10Volans: [C: 04-1] service_definitions: add defined ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670788 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [10:58:54] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1045.eqiad.wmnet [10:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:32] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1046.eqiad.wmnet [10:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210311T1100). [11:00:15] 10SRE, 10netops, 10Patch-For-Review: Auhoritative ports list - https://phabricator.wikimedia.org/T277146 (10jbond) this came up before as a minor ask and i used it to create a PoC which used [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/559536 | custom types ]]. Ultimately i abandoned that as i fe... [11:02:50] (03CR) 10Vgutierrez: "DNS entries for this service are missing" [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [11:04:12] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1046.eqiad.wmnet [11:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:42] (03PS2) 10Jbond: service_definitions: add defined ports [puppet] - 10https://gerrit.wikimedia.org/r/670788 (https://phabricator.wikimedia.org/T277146) [11:05:16] (03PS1) 10JMeybohm: linkrecommendation fix labels for pod spec in job spec [deployment-charts] - 10https://gerrit.wikimedia.org/r/670791 [11:05:53] (03CR) 10Klausman: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [11:10:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Sigh, thanks. kubeyaml should have caught that though. Was it valid kubernetes yaml? It should be failing some schema." [deployment-charts] - 10https://gerrit.wikimedia.org/r/670791 (owner: 10JMeybohm) [11:11:05] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1047.eqiad.wmnet [11:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:04] (03PS4) 10Klausman: service catalog: Add entry for ML Team k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) [11:12:06] (03CR) 10Klausman: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [11:12:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 60%: Repool db1096:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14777 and previous config saved to /var/cache/conftool/dbconfig/20210311-111243-root.json [11:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:30] (03PS5) 10Klausman: service catalog: Add entry for ML Team k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) [11:15:16] (03CR) 10JMeybohm: [C: 03+2] "> Patch Set 1: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/670791 (owner: 10JMeybohm) [11:15:46] (03CR) 10Elukey: "We shouldn't add discovery bits, we don't need it for the kub" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [11:16:25] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1047.eqiad.wmnet [11:16:26] (03Merged) 10jenkins-bot: linkrecommendation fix labels for pod spec in job spec [deployment-charts] - 10https://gerrit.wikimedia.org/r/670791 (owner: 10JMeybohm) [11:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:39] (03PS6) 10Klausman: service catalog: Add entry for ML Team k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) [11:17:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:18:02] (03CR) 10Klausman: service catalog: Add entry for ML Team k8s control plane (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [11:19:04] (03CR) 10Elukey: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [11:19:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:20:36] (03CR) 10Elukey: "Ah Tobias we missed the PTR records in https://gerrit.wikimedia.org/r/c/operations/dns/+/670440/1/templates/wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [11:21:04] (03PS1) 10Muehlenhoff: Remove obsolete varnish::setup_filesystem [puppet] - 10https://gerrit.wikimedia.org/r/670795 [11:21:10] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1048.eqiad.wmnet [11:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:24] (03PS1) 10Klausman: Add DNS RR for ML team k8s control plane [dns] - 10https://gerrit.wikimedia.org/r/670797 [11:26:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:26:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1048.eqiad.wmnet [11:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:43] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1049.eqiad.wmnet [11:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:56] (03CR) 10Klausman: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [11:27:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:27:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: Repool db1096:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14778 and previous config saved to /var/cache/conftool/dbconfig/20210311-112747-root.json [11:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:15] (03PS2) 10Klausman: Add DNS RR for ML team k8s control plane [dns] - 10https://gerrit.wikimedia.org/r/670797 (https://phabricator.wikimedia.org/T272918) [11:30:23] (03CR) 10Elukey: [C: 03+1] Add DNS RR for ML team k8s control plane [dns] - 10https://gerrit.wikimedia.org/r/670797 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [11:30:57] (03CR) 10Klausman: [C: 03+2] Add DNS RR for ML team k8s control plane [dns] - 10https://gerrit.wikimedia.org/r/670797 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [11:30:59] (03CR) 10Whym: "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670180 (https://phabricator.wikimedia.org/T253802) (owner: 10Whym) [11:31:02] (03CR) 10Klausman: [V: 03+2 C: 03+2] Add DNS RR for ML team k8s control plane [dns] - 10https://gerrit.wikimedia.org/r/670797 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [11:31:11] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1049.eqiad.wmnet [11:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:40] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [11:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:52] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1050.eqiad.wmnet [11:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:05] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:13] (03CR) 10Lars Wirzenius: [C: 03+2] Merge branch 'master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/670258 (owner: 10Ahmon Dancy) [11:35:29] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [11:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:37:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:37:36] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1050.eqiad.wmnet [11:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:41] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:54] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1051.eqiad.wmnet [11:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:53] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1051.eqiad.wmnet [11:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:05] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1052.eqiad.wmnet [11:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:28] (03CR) 10Elukey: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [11:49:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1052.eqiad.wmnet [11:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:30] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1053.eqiad.wmnet [11:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1053.eqiad.wmnet [11:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:41] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1054.eqiad.wmnet [11:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:47] PROBLEM - Disk space on backup2002 is CRITICAL: DISK CRITICAL - free space: /srv 3001975 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops [11:55:40] 10SRE, 10Wikimedia-Portals, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10Jdrewniak) @Techwizzie I've removed this task from the GSoC portals project (I probably shouldn't have added it there in the first place). The apache configuration is... [11:57:42] jynus: ^ is that something known? [11:57:52] I am fixing it [11:57:56] <3 [11:59:56] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1054.eqiad.wmnet [12:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210311T1200). [12:00:05] Lucas_WMDE and whym: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:06] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1055.eqiad.wmnet [12:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:46] o/ [12:01:50] (03PS1) 10Majavah: betacluster: promote db07 as database master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670803 (https://phabricator.wikimedia.org/T277070) [12:02:03] whym: if you’re around, we can start with your config change [12:02:03] (03PS1) 10Majavah: betacluster: read only for db master switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670804 (https://phabricator.wikimedia.org/T277070) [12:02:11] RECOVERY - Disk space on backup2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops [12:02:38] Lucas_WMDE: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/670566 is a comment and beta only, can I get that in as well? forgot to add it to the list [12:03:34] Lucas_WMDE: yes, please [12:03:43] Majavah: sure (please add it to the list anyways :) ) [12:03:51] ok looking at whym’s change firs [12:03:53] *first [12:03:58] Lucas_WMDE: sure, {{doing}} [12:04:23] ok, that one’s also only a comment change, nice ^^ [12:04:27] (03PS3) 10Lucas Werkmeister (WMDE): Fix obsolete comments on wgCheckUserLogLogins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670180 (https://phabricator.wikimedia.org/T253802) (owner: 10Whym) [12:04:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix obsolete comments on wgCheckUserLogLogins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670180 (https://phabricator.wikimedia.org/T253802) (owner: 10Whym) [12:05:26] (03CR) 10Elukey: [C: 03+1] "LGTM! Valentin, ok to merge from your side?" [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [12:05:44] Lucas_WMDE: done, added to calendar [12:05:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1144:3315 for schema change', diff saved to https://phabricator.wikimedia.org/P14781 and previous config saved to /var/cache/conftool/dbconfig/20210311-120554-marostegui.json [12:05:56] (03Merged) 10jenkins-bot: Fix obsolete comments on wgCheckUserLogLogins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670180 (https://phabricator.wikimedia.org/T253802) (owner: 10Whym) [12:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:19] (03PS2) 10Lucas Werkmeister (WMDE): Update comment for irc.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670566 (https://phabricator.wikimedia.org/T277081) (owner: 10Majavah) [12:07:51] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1055.eqiad.wmnet [12:07:51] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:670180|Fix obsolete comments on wgCheckUserLogLogins (T253802)]] (duration: 01m 08s) [12:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:59] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Update comment for irc.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670566 (https://phabricator.wikimedia.org/T277081) (owner: 10Majavah) [12:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:03] T253802: Configure WMF wikis to log login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 [12:08:18] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1056.eqiad.wmnet [12:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:57] (03Merged) 10jenkins-bot: Update comment for irc.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670566 (https://phabricator.wikimedia.org/T277081) (owner: 10Majavah) [12:10:57] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/LabsServices.php: Config: [[gerrit:670566|Update comment for irc.beta.wmflabs.org (T277081)]] (comment-only beta-only change) (duration: 01m 13s) [12:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:04] T277081: Replace deployment-ircd with a Buster host - https://phabricator.wikimedia.org/T277081 [12:11:09] alright, I think that’s all the patches done [12:11:20] moving on to my maintenance script now [12:12:41] !log start of lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds 581768,739279,774383,852302 [12:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:51] !log finished in 1.124s real time [12:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:05] dammit, forgot to attach the task number [12:13:44] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds 581768,739279,774383,852302 # T270249, finished in 1.124s [12:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:50] T270249: Run maintenance script to remove deleted items from term store on production - https://phabricator.wikimedia.org/T270249 [12:14:16] 10SRE, 10Packaging, 10User-Majavah: Package udplog for Buster - https://phabricator.wikimedia.org/T276421 (10hashar) [12:14:45] (03CR) 10Vgutierrez: "yes, looking good. I'm wondering if IdleConnection is the best we can do in terms of monitoring the service from pybal" [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [12:15:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 10%: Repool db1144:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14782 and previous config saved to /var/cache/conftool/dbconfig/20210311-121552-root.json [12:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:45] !log EU backport&config window done [12:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:36] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1056.eqiad.wmnet [12:17:40] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: add gitlab certificate [puppet] - 10https://gerrit.wikimedia.org/r/670424 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [12:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:44] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1057.eqiad.wmnet [12:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:38] (03CR) 10Elukey: [C: 03+1] "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [12:19:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:19:51] (03PS1) 10JMeybohm: Fix Rakefile to run lint and validate_template again for all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/670806 [12:21:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:22:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1057.eqiad.wmnet [12:22:10] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1058.eqiad.wmnet [12:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:44] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1058.eqiad.wmnet [12:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:22] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1059.eqiad.wmnet [12:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 30%: Repool db1144:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14783 and previous config saved to /var/cache/conftool/dbconfig/20210311-123056-root.json [12:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:42] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1059.eqiad.wmnet [12:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:14] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1060.eqiad.wmnet [12:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:41] jouncebot: next [12:38:41] In 0 hour(s) and 51 minute(s): beta: Master DB switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210311T1330) [12:39:20] !log imported cassandra_2.2.6-wmf1 to buster-wikimedia [12:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:06] godog: XioNoX: [12:40:10] sorry [12:41:37] (03PS1) 10Jbond: P:base: manage /etc/services file [puppet] - 10https://gerrit.wikimedia.org/r/670810 (https://phabricator.wikimedia.org/T277146) [12:42:22] lolz, np jbond42 [12:46:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 60%: Repool db1144:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14785 and previous config saved to /var/cache/conftool/dbconfig/20210311-124559-root.json [12:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:10] (03PS2) 10Jbond: P:base: manage /etc/services file [puppet] - 10https://gerrit.wikimedia.org/r/670810 (https://phabricator.wikimedia.org/T277146) [12:46:34] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1060.eqiad.wmnet [12:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:10] (03CR) 10Jbond: P:base: manage /etc/services file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670810 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [12:49:00] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1061.eqiad.wmnet [12:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:49] (03PS1) 10Hnowlan: aptrepo: add cassandra311 component [puppet] - 10https://gerrit.wikimedia.org/r/670813 (https://phabricator.wikimedia.org/T274119) [12:52:02] (03PS1) 10Marostegui: db2149: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670814 (https://phabricator.wikimedia.org/T275633) [12:52:44] (03CR) 10Marostegui: [C: 03+2] db2149: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670814 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [12:53:35] (03PS7) 10Klausman: service catalog: Add entry for ML Team k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) [12:54:10] (03CR) 10Klausman: [C: 03+2] service catalog: Add entry for ML Team k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [12:56:55] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: puppet: Custom type providers - https://phabricator.wikimedia.org/T241160 (10jbond) If we go down this route we should use the puppet [[ https://puppet.com/docs/puppet/7.4/about_the_resource_api.html | resource API ]] which is much cleaner then standard ty... [13:00:03] !log imported cassandra_2.2.6-wmf5 to buster-wikimedia [13:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:19] (03PS1) 10Klausman: hiera: Add LVS realserver config for ML k8s [puppet] - 10https://gerrit.wikimedia.org/r/670816 (https://phabricator.wikimedia.org/T272918) [13:01:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 100%: Repool db1144:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14786 and previous config saved to /var/cache/conftool/dbconfig/20210311-130103-root.json [13:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1130 for schema change', diff saved to https://phabricator.wikimedia.org/P14787 and previous config saved to /var/cache/conftool/dbconfig/20210311-130208-marostegui.json [13:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:59] (03CR) 10Jbond: [C: 04-1] "if we implement custom type/providers we should imo prefer the resource api" [puppet] - 10https://gerrit.wikimedia.org/r/559536 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond) [13:03:51] !log copy python-mwclient 0.8.4-1 from stretch-wikimedia to buster-wikimedia for T275865 [13:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:57] T275865: Toolforge: migrate bastions to Debian Buster - https://phabricator.wikimedia.org/T275865 [13:04:00] !log installing openssl1.0 security updates on stretch [13:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:39] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28498/console" [puppet] - 10https://gerrit.wikimedia.org/r/670816 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [13:08:14] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28500/console" [puppet] - 10https://gerrit.wikimedia.org/r/670816 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [13:09:55] (03PS2) 10Klausman: hiera: Add LVS realserver config for ML k8s [puppet] - 10https://gerrit.wikimedia.org/r/670816 (https://phabricator.wikimedia.org/T272918) [13:10:53] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28501/console" [puppet] - 10https://gerrit.wikimedia.org/r/670816 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [13:12:27] (03PS4) 10Arturo Borrero Gonzalez: toolforge: initial support for Debian Buster on bastions [puppet] - 10https://gerrit.wikimedia.org/r/667144 (https://phabricator.wikimedia.org/T275865) [13:12:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Fix Rakefile to run lint and validate_template again for all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/670806 (owner: 10JMeybohm) [13:13:40] (03PS5) 10Arturo Borrero Gonzalez: toolforge: initial support for Debian Buster on bastions [puppet] - 10https://gerrit.wikimedia.org/r/667144 (https://phabricator.wikimedia.org/T275865) [13:15:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: initial support for Debian Buster on bastions [puppet] - 10https://gerrit.wikimedia.org/r/667144 (https://phabricator.wikimedia.org/T275865) (owner: 10Arturo Borrero Gonzalez) [13:17:59] (03PS1) 10MSantos: maps imposm3: add log file for imposm3 sync [puppet] - 10https://gerrit.wikimedia.org/r/670817 [13:18:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Repool db1130 after schema change', diff saved to https://phabricator.wikimedia.org/P14788 and previous config saved to /var/cache/conftool/dbconfig/20210311-131818-root.json [13:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:21:44] (03CR) 10Elukey: [C: 03+1] hiera: Add LVS realserver config for ML k8s [puppet] - 10https://gerrit.wikimedia.org/r/670816 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [13:22:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:23:29] (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: Add LVS realserver config for ML k8s [puppet] - 10https://gerrit.wikimedia.org/r/670816 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [13:27:28] (03CR) 10JMeybohm: [C: 03+2] Fix Rakefile to run lint and validate_template again for all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/670806 (owner: 10JMeybohm) [13:28:02] PROBLEM - Host ms-be1061 is DOWN: PING CRITICAL - Packet loss = 100% [13:28:47] (03Merged) 10jenkins-bot: Fix Rakefile to run lint and validate_template again for all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/670806 (owner: 10JMeybohm) [13:29:06] RECOVERY - Host ms-be1061 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [13:29:18] moritzm: part of reboots? --^ [13:30:04] Majavah and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy beta: Master DB switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210311T1330). [13:30:16] here [13:30:19] (03PS1) 10Klausman: hiera: Switch ml-ctrl service to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/670818 (https://phabricator.wikimedia.org/T272918) [13:30:23] move to #-releng or stay here? [13:31:23] Majavah: hello! [13:31:31] o/ [13:31:32] -releng is probably better [13:33:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 30%: Repool db1130 after schema change', diff saved to https://phabricator.wikimedia.org/P14789 and previous config saved to /var/cache/conftool/dbconfig/20210311-133321-root.json [13:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:29] !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [13:33:29] !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [13:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:11] elukey: yeah, that is expected [13:36:28] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1061.eqiad.wmnet [13:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:09] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1062.eqiad.wmnet [13:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:39] 10ops-eqiad, 10decommission-hardware: decommission frqueue1001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T277171 (10Jgreen) [13:39:50] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [13:39:50] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [13:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:46] (03PS2) 10Majavah: betacluster: promote db07 as db master, decom db06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670803 (https://phabricator.wikimedia.org/T277070) [13:42:11] (03CR) 10Elukey: [C: 03+1] "LGTM, needs the traffic's team sign off and then we can proceed with the pybal's restart :)" [puppet] - 10https://gerrit.wikimedia.org/r/670818 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [13:43:00] (03CR) 10Urbanecm: [C: 03+2] betacluster: promote db07 as db master, decom db06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670803 (https://phabricator.wikimedia.org/T277070) (owner: 10Majavah) [13:43:56] (03Merged) 10jenkins-bot: betacluster: promote db07 as db master, decom db06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670803 (https://phabricator.wikimedia.org/T277070) (owner: 10Majavah) [13:47:53] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1062.eqiad.wmnet [13:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 60%: Repool db1130 after schema change', diff saved to https://phabricator.wikimedia.org/P14790 and previous config saved to /var/cache/conftool/dbconfig/20210311-134825-root.json [13:48:25] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be1063.eqiad.wmnet [13:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:53] (03Abandoned) 10Majavah: betacluster: read only for db master switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670804 (https://phabricator.wikimedia.org/T277070) (owner: 10Majavah) [13:55:09] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jgreen) [13:56:11] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1063.eqiad.wmnet [13:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] httpd: enable httpd to emit ECS-compliant logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668231 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [13:59:13] (03PS1) 10Kormat: mariadb: Expect all pc hosts to be read-write. [puppet] - 10https://gerrit.wikimedia.org/r/670822 [14:00:05] brennen and liw: #bothumor My software never has bugs. It just develops random features. Rise for Mediawiki train - American+European Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210311T1400). [14:00:30] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28502/console" [puppet] - 10https://gerrit.wikimedia.org/r/670822 (owner: 10Kormat) [14:01:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2149 into s3', diff saved to https://phabricator.wikimedia.org/P14791 and previous config saved to /var/cache/conftool/dbconfig/20210311-140119-marostegui.json [14:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:09] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for pcoombe@wikimedia.org - https://phabricator.wikimedia.org/T277065 (10Pcoombe) Thank you! [14:03:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Repool db1130 after schema change', diff saved to https://phabricator.wikimedia.org/P14792 and previous config saved to /var/cache/conftool/dbconfig/20210311-140328-root.json [14:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113:3315 for schema change', diff saved to https://phabricator.wikimedia.org/P14793 and previous config saved to /var/cache/conftool/dbconfig/20210311-140526-marostegui.json [14:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:28] !log swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836 [14:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:35] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [14:12:15] (03CR) 10Marostegui: [C: 03+1] mariadb: Expect all pc hosts to be read-write. [puppet] - 10https://gerrit.wikimedia.org/r/670822 (owner: 10Kormat) [14:17:37] (03CR) 10Kormat: [V: 03+1 C: 03+2] mariadb: Expect all pc hosts to be read-write. [puppet] - 10https://gerrit.wikimedia.org/r/670822 (owner: 10Kormat) [14:21:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 10%: Repool db1113:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14794 and previous config saved to /var/cache/conftool/dbconfig/20210311-142157-root.json [14:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:13] (03CR) 10Muehlenhoff: "I like having a fleet-wide identical /etc/services. One other option (which is even more KISS) would be to upload rebuilds of netbase 6.2 " [puppet] - 10https://gerrit.wikimedia.org/r/670810 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [14:23:16] 10SRE: Improve alerting for hosts with Puppet disabled for longer periods - https://phabricator.wikimedia.org/T277083 (10Volans) One option for the longer term could also be to actually generate the list of mgmt hosts to monitor in Icinga from Netbox instead that from PuppetDB... any host with a mgmt IP should b... [14:28:09] PROBLEM - puppet last run on kubestagemaster1001 is CRITICAL: CRITICAL: Puppet has been disabled for longer than 86400 seconds, message: blah-alex - akosiaris, last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:29:55] (03PS1) 10Kormat: mariadb: Fix duplicate resource declaration. [puppet] - 10https://gerrit.wikimedia.org/r/670828 [14:30:21] (03PS1) 10Muehlenhoff: Assign mw_rc_irc role to irc2001 [puppet] - 10https://gerrit.wikimedia.org/r/670829 (https://phabricator.wikimedia.org/T224579) [14:30:31] (03PS2) 10Muehlenhoff: Assign mw_rc_irc role to irc2001 [puppet] - 10https://gerrit.wikimedia.org/r/670829 (https://phabricator.wikimedia.org/T224579) [14:32:10] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28503/console" [puppet] - 10https://gerrit.wikimedia.org/r/670828 (owner: 10Kormat) [14:35:12] (03CR) 10Kormat: [V: 03+1 C: 03+2] mariadb: Fix duplicate resource declaration. [puppet] - 10https://gerrit.wikimedia.org/r/670828 (owner: 10Kormat) [14:36:01] (03CR) 10Klausman: [C: 03+2] hiera: Switch ml-ctrl service to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/670818 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:37:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 30%: Repool db1113:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14795 and previous config saved to /var/cache/conftool/dbconfig/20210311-143700-root.json [14:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:03] (03CR) 10Marostegui: [C: 04-1] "The grants presents there for the older deployment hosts are just probably leftovers from years ago and are on clouddb file cause I just d" [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [14:43:05] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.39:6443]) https://wikitech.wikimedia.org/wiki/PyBal [14:43:29] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 77 connections established with conf2001.codfw.wmnet:2379 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [14:43:48] this is due to a new service --^ [14:43:55] Cc: klausman: --^ [14:44:18] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.39:6443]) Klausman LVS setup of ML team k8s (T272918) https://wikitech.wikimedia.org/wiki/PyBal [14:44:19] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 57 connections established with conf2001.codfw.wmnet:2379 (min=58) https://wikitech.wikimedia.org/wiki/PyBal [14:44:25] yep, acking alerts as they enter HARD state [14:44:31] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.39:6443]) https://wikitech.wikimedia.org/wiki/PyBal [14:44:51] klausman: we can start with 2010 if you are ready [14:45:03] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.39:6443]) Klausman LVS setup of ML team k8s (T272918) https://wikitech.wikimedia.org/wiki/PyBal [14:45:38] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 57 connections established with conf2001.codfw.wmnet:2379 (min=58) Klausman LVS setup of ML team k8s (T272918) https://wikitech.wikimedia.org/wiki/PyBal [14:45:58] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 77 connections established with conf2001.codfw.wmnet:2379 (min=78) Klausman LVS setup of ML team k8s (T272918) https://wikitech.wikimedia.org/wiki/PyBal [14:46:44] !log installing openssl (1.1) security updates for stretch [14:46:51] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.39:6443]) https://wikitech.wikimedia.org/wiki/PyBal [14:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:53] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 113 connections established with conf1004.eqiad.wmnet:4001 (min=114) https://wikitech.wikimedia.org/wiki/PyBal [14:47:03] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 65 connections established with conf1004.eqiad.wmnet:4001 (min=66) https://wikitech.wikimedia.org/wiki/PyBal [14:47:25] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 65 connections established with conf1004.eqiad.wmnet:4001 (min=66) Klausman LVS setup of ML team k8s (T272918) https://wikitech.wikimedia.org/wiki/PyBal [14:47:25] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.39:6443]) https://wikitech.wikimedia.org/wiki/PyBal [14:47:43] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.39:6443]) Klausman LVS setup of ML team k8s (T272918) https://wikitech.wikimedia.org/wiki/PyBal [14:47:53] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 113 connections established with conf1004.eqiad.wmnet:4001 (min=114) Klausman LVS setup of ML team k8s (T272918) https://wikitech.wikimedia.org/wiki/PyBal [14:48:08] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.39:6443]) Klausman LVS setup of ML team k8s (T272918) https://wikitech.wikimedia.org/wiki/PyBal [14:49:29] !log restarting pybal on lvs2010 T272918 [14:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:35] T272918: Create ml-serve k8s cluster - https://phabricator.wikimedia.org/T272918 [14:50:15] !log restarting pybal on lvs1016 T272918 [14:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:44] (03CR) 10Muehlenhoff: [C: 03+1] aptrepo: add cassandra311 component [puppet] - 10https://gerrit.wikimedia.org/r/670813 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [14:52:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 60%: Repool db1113:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14796 and previous config saved to /var/cache/conftool/dbconfig/20210311-145204-root.json [14:52:08] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 114 connections established with conf1004.eqiad.wmnet:4001 (min=114) https://wikitech.wikimedia.org/wiki/PyBal [14:52:08] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:26] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:53:58] jynus or marostegui: we'd like to go forward with the DB table creation for GrowthExperiments (for testwiki only - we were supposed to deploy to beta this week, but it's so broken doing it on testwiki makes more sense). It has been discussed before and AIUI table creation is self-serve, just making sure you don't have any concerns / we did not forget anything. [14:54:03] The summary is in T266913. [14:54:03] T266913: Add a link engineering: create tables in Wikimedia production - https://phabricator.wikimedia.org/T266913 [14:54:40] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 78 connections established with conf2001.codfw.wmnet:2379 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [14:55:30] tgr_: the filtering is in place right? Was that a private table? [14:55:38] !log restarting pybal on lvs2009 T272918 [14:55:39] yes [14:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:46] T272918: Create ml-serve k8s cluster - https://phabricator.wikimedia.org/T272918 [14:56:08] tgr_: What's the table name? So I can quickly check for the filtering? [14:56:22] https://phabricator.wikimedia.org/rOPUPb4fbbb751787c4ba4e39cc09ef70d716b279c95a was the patch [14:57:03] tgr_: let me quickly check [14:57:55] tgr_: looks good, go ahead! [14:58:02] thanks! [14:58:51] (will do it in the backport window, in a couple hours) [14:59:30] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:00:46] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 58 connections established with conf2001.codfw.wmnet:2379 (min=58) https://wikitech.wikimedia.org/wiki/PyBal [15:01:14] (03PS2) 10Muehlenhoff: docker: Simply use the default package version on Buster and later [puppet] - 10https://gerrit.wikimedia.org/r/667628 [15:01:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/667628 (owner: 10Muehlenhoff) [15:02:47] !log restarting pybal on lvs1015 T272918 [15:02:53] klausman: Failed to log message to wiki. Somebody should check the error logs. [15:02:54] T272918: Create ml-serve k8s cluster - https://phabricator.wikimedia.org/T272918 [15:03:13] why did that fail? ^ [15:03:22] No clue [15:04:26] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:04:36] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 66 connections established with conf1004.eqiad.wmnet:4001 (min=66) https://wikitech.wikimedia.org/wiki/PyBal [15:07:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 100%: Repool db1113:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P14797 and previous config saved to /var/cache/conftool/dbconfig/20210311-150707-root.json [15:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:01] klausman: try to log again for documentation purposes :) [15:08:15] it may have been a transient issue [15:12:28] (03CR) 10Muehlenhoff: [C: 03+2] docker: Simply use the default package version on Buster and later [puppet] - 10https://gerrit.wikimedia.org/r/667628 (owner: 10Muehlenhoff) [15:14:13] (03PS1) 10Klausman: ssl: update ml-ctrl certs (fixed altname) [puppet] - 10https://gerrit.wikimedia.org/r/670835 (https://phabricator.wikimedia.org/T272918) [15:14:32] (03CR) 10Elukey: [C: 03+1] ssl: update ml-ctrl certs (fixed altname) [puppet] - 10https://gerrit.wikimedia.org/r/670835 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [15:14:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110 for schema change', diff saved to https://phabricator.wikimedia.org/P14798 and previous config saved to /var/cache/conftool/dbconfig/20210311-151435-marostegui.json [15:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:53] (03CR) 10Klausman: [C: 03+2] ssl: update ml-ctrl certs (fixed altname) [puppet] - 10https://gerrit.wikimedia.org/r/670835 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [15:19:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:19:50] (03PS1) 10JMeybohm: ratelimit: Switch to nobody, update build and base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/670836 (https://phabricator.wikimedia.org/T274852) [15:23:21] (03PS1) 10JMeybohm: fluent-bit: Switch to nobody and use seed_image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/670838 (https://phabricator.wikimedia.org/T274852) [15:23:49] (03CR) 10Herron: [C: 03+2] grafana: add domainrw param and lookup [puppet] - 10https://gerrit.wikimedia.org/r/670567 (owner: 10Herron) [15:24:47] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10OlyKalinichenkoSpeedAndFunction) [15:26:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 10%: Repool db1110 after schema change', diff saved to https://phabricator.wikimedia.org/P14799 and previous config saved to /var/cache/conftool/dbconfig/20210311-152627-root.json [15:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:59] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/670839 [15:27:07] 10SRE, 10Performance-Team, 10Platform Engineering, 10serviceops: Get rid of nutcracker for connecting to redis - https://phabricator.wikimedia.org/T277183 (10Joe) [15:27:18] 10SRE, 10Performance-Team, 10Platform Engineering, 10serviceops: Get rid of nutcracker for connecting to redis - https://phabricator.wikimedia.org/T277183 (10Joe) p:05Triage→03Medium [15:28:47] (03PS1) 10Muehlenhoff: Stop specifying specific docker releases [puppet] - 10https://gerrit.wikimedia.org/r/670840 [15:29:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/670840 (owner: 10Muehlenhoff) [15:30:30] (03CR) 10Hnowlan: [C: 03+2] aptrepo: add cassandra311 component [puppet] - 10https://gerrit.wikimedia.org/r/670813 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [15:30:47] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:19] (03CR) 10Hnowlan: [C: 03+1] ratelimit: Switch to nobody, update build and base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/670836 (https://phabricator.wikimedia.org/T274852) (owner: 10JMeybohm) [15:32:03] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T275831 (10Cmjohnson) [15:32:31] (03PS1) 10Marostegui: db2105: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670841 [15:32:51] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Cmjohnson) [15:32:53] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission deploy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T275831 (10Cmjohnson) 05Open→03Resolved a:05Jclark-ctr→03Cmjohnson Removed from rack, ran script [15:33:57] (03CR) 10Marostegui: [C: 03+2] db2105: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670841 (owner: 10Marostegui) [15:35:13] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:05] 10SRE, 10ops-eqiad, 10Analytics, 10DC-Ops: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10Cmjohnson) @wiki_willy This server is out of warranty by 1 year (purchased 2017) I can probably find a used one in our decom servers. Let me know if this is how you want t... [15:36:12] 10SRE, 10ops-eqiad, 10Analytics, 10DC-Ops: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10Cmjohnson) a:03wiki_willy [15:36:41] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 15.63 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [15:37:10] (03PS3) 10Muehlenhoff: Update address for rmurthy, converted to staff [puppet] - 10https://gerrit.wikimedia.org/r/665077 [15:37:35] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Cmjohnson) @Marostegui can we schedule this for Monday next week? 1500/1600UTC timeframe please? Thanks [15:38:23] (03CR) 10Muehlenhoff: [C: 03+2] Update address for rmurthy, converted to staff [puppet] - 10https://gerrit.wikimedia.org/r/665077 (owner: 10Muehlenhoff) [15:38:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Marostegui) Sounds good @Cmjohnson - I will leave the host off beforehand so you can proceed as you wish. Once you are done, just power it back on and I will take it from there. Thank you [15:41:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 30%: Repool db1110 after schema change', diff saved to https://phabricator.wikimedia.org/P14800 and previous config saved to /var/cache/conftool/dbconfig/20210311-154131-root.json [15:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:34] (03PS2) 10Muehlenhoff: Stop specifying specific docker releases [puppet] - 10https://gerrit.wikimedia.org/r/670840 [15:43:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/670840 (owner: 10Muehlenhoff) [15:44:57] (03PS1) 10Giuseppe Lavagetto: redis: also configure the new rdb servers [puppet] - 10https://gerrit.wikimedia.org/r/670846 [15:45:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:48:08] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1059 - https://phabricator.wikimedia.org/T276696 (10Cmjohnson) a:03elukey I replaced the disk it's in an unconfigured state: Can you add it back to the raid? Firmware state: Unconfigured(good), Spun Up [15:48:34] 10SRE, 10ops-eqiad, 10DBA: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Cmjohnson) Dell will be here tomorrow morning to replace the backplane. [15:49:53] RECOVERY - MegaRAID on analytics1059 is OK: OK: optimal, 12 logical, 13 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:53:13] !log updating firmware wdqs1009 T274751 [15:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:20] T274751: Upgrade firmware on wdqs1009 - https://phabricator.wikimedia.org/T274751 [15:54:24] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] ssh-client-config: use wmcloud bastion [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/668746 (owner: 10Elukey) [15:55:46] (03PS2) 10Muehlenhoff: Update puppetised java.security file from 11.9 [puppet] - 10https://gerrit.wikimedia.org/r/637464 (https://phabricator.wikimedia.org/T266782) [15:56:01] PROBLEM - Host wdqs1009 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 60%: Repool db1110 after schema change', diff saved to https://phabricator.wikimedia.org/P14801 and previous config saved to /var/cache/conftool/dbconfig/20210311-155635-root.json [15:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_streaming_updater site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:59:08] (03PS2) 10Giuseppe Lavagetto: redis: also configure the new rdb servers [puppet] - 10https://gerrit.wikimedia.org/r/670846 [15:59:10] (03PS1) 10Giuseppe Lavagetto: rdb: use buster on newer servers [puppet] - 10https://gerrit.wikimedia.org/r/670850 [16:00:03] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10RobH) 05Open→03Resolved [16:02:39] RECOVERY - Host wdqs1009 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [16:02:45] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 41.62 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [16:04:29] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:04:29] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:05:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:06:47] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:47] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:08:41] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:55] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:05] 10SRE, 10serviceops, 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) [16:11:08] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki) [16:11:26] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10Krinkle) [16:11:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 100%: Repool db1110 after schema change', diff saved to https://phabricator.wikimedia.org/P14802 and previous config saved to /var/cache/conftool/dbconfig/20210311-161138-root.json [16:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:05] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 13.6 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [16:14:35] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:33] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:58] (03PS1) 10Gergő Tisza: Configure GrowthExperiments Add Link settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670857 (https://phabricator.wikimedia.org/T277173) [16:21:00] (03PS1) 10Gergő Tisza: Enable GrowthExperiments link recommendations on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670858 (https://phabricator.wikimedia.org/T277173) [16:21:24] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10Cmjohnson) Verified cable/port info, labeled server. updated idrac, verified connected to correct 10G port fixed NICs to PXE 10G and turned off PXE for GB port. [16:22:18] (03CR) 10jerkins-bot: [V: 04-1] Configure GrowthExperiments Add Link settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670857 (https://phabricator.wikimedia.org/T277173) (owner: 10Gergő Tisza) [16:22:22] (03CR) 10jerkins-bot: [V: 04-1] Enable GrowthExperiments link recommendations on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670858 (https://phabricator.wikimedia.org/T277173) (owner: 10Gergő Tisza) [16:24:51] PROBLEM - Host wdqs1009 is DOWN: PING CRITICAL - Packet loss = 100% [16:25:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={jmx_wdqs_streaming_updater,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:26:23] !log upgrade memcached on mc1021, mc2021 [16:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:15] RECOVERY - Host wdqs1009 is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [16:31:59] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:13] (03CR) 10Volans: "addressed comments" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [16:32:16] (03PS3) 10Volans: netbox: add NetboxServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) [16:34:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:34:16] (03PS2) 10Gergő Tisza: Configure GrowthExperiments Add Link settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670857 (https://phabricator.wikimedia.org/T277173) [16:34:18] (03PS2) 10Gergő Tisza: Enable GrowthExperiments link recommendations on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670858 (https://phabricator.wikimedia.org/T277173) [16:36:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 3 others: Upgrade firmware on wdqs1009 - https://phabricator.wikimedia.org/T274751 (10Cmjohnson) 05Open→03Resolved updated BIOS, IDRAC and NIC firmware. Resolving the task, if an issue persists please open a new task with the error. [16:37:26] (03CR) 10Andrew Bogott: "@jbond, this patch uses a new custom fact that was largely untested previously. Can you point me to where I need to mock up fact data to m" [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [16:39:17] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10Cmjohnson) [16:39:41] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10Cmjohnson) a:05Cmjohnson→03RobH Rob, assigning this to you to complete! Thanks [16:40:05] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:53] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:41:57] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:41:59] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:42:11] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:42:19] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-a7-eqiad power over threshold alerts - https://phabricator.wikimedia.org/T276743 (10Cmjohnson) @wiki_willy I am pretty confident that the power spike is from the new ms-be1060. [16:42:27] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:43:47] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:43:51] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:43:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "When merging this we need to actually import the docker image into our internal registry." [puppet] - 10https://gerrit.wikimedia.org/r/658416 (owner: 10Bstorm) [16:43:55] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:44:09] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:44:25] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:44:33] (03PS12) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [16:45:04] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [16:45:05] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 44.74 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [16:45:18] (03CR) 10Bstorm: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/658416 (owner: 10Bstorm) [16:46:48] (03PS13) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [16:47:55] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [16:48:10] (03CR) 10Bstorm: "Doing a quick check on this in toolsbeta, then I'll merge it." [puppet] - 10https://gerrit.wikimedia.org/r/658416 (owner: 10Bstorm) [16:51:28] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-a7-eqiad power over threshold alerts - https://phabricator.wikimedia.org/T276743 (10Cmjohnson) 05Open→03Resolved I moved ms-be1060 to a different phase. I think we could add the server to A7 but it cannot be on the same phase as ms-be1060. [16:51:32] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) [16:51:56] (03PS14) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [16:53:00] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [16:55:52] (03PS15) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [16:56:22] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [16:58:55] (03PS16) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [17:00:00] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [17:00:05] jbond42 and cdanis: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210311T1700). [17:03:12] (03PS17) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [17:04:19] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [17:05:15] PROBLEM - puppet last run on idp-test1001 is CRITICAL: CRITICAL: Puppet has been disabled for longer than 86400 seconds, message: testing trasncoders - T273867 - jbond, last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:09:32] (03PS3) 10Gergő Tisza: Configure GrowthExperiments Add Link settings, step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670857 (https://phabricator.wikimedia.org/T277173) [17:09:34] (03PS3) 10Gergő Tisza: Enable GrowthExperiments link recommendations on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670858 (https://phabricator.wikimedia.org/T277173) [17:09:36] (03PS1) 10Gergő Tisza: Configure GrowthExperiments Add Link settings, step 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670887 (https://phabricator.wikimedia.org/T277173) [17:10:33] (03PS18) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [17:11:49] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [17:12:15] RECOVERY - Memcached on mwdebug1001 is OK: TCP OK - 0.001 second response time on 10.64.32.123 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [17:13:56] (03PS3) 10Effie Mouzeli: hieradata: upgrade mwdebug1001 onhost memcached to 1.6x [puppet] - 10https://gerrit.wikimedia.org/r/670774 [17:16:33] (03PS19) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [17:17:11] (03CR) 10Bstorm: [C: 03+2] "Works well. 16 min and no crashlooping in toolsbeta" [puppet] - 10https://gerrit.wikimedia.org/r/658416 (owner: 10Bstorm) [17:17:53] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [17:21:05] (03PS20) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [17:21:56] (03Abandoned) 10Effie Mouzeli: hieradata: upgrade mwdebug1001 onhost memcached to 1.6x [puppet] - 10https://gerrit.wikimedia.org/r/670774 (owner: 10Effie Mouzeli) [17:22:21] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [17:28:44] jouncebot: next [17:28:45] In 0 hour(s) and 31 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210311T1800) [17:29:00] (03CR) 10CDanis: "In general I'm in favor of explicitly managing /etc/services *somehow*." [puppet] - 10https://gerrit.wikimedia.org/r/670810 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [17:30:46] (03PS21) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [17:31:00] !log install mecached 1.6.6-1 on mwdebug1001 [17:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:51] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [17:32:50] (03PS1) 10Cwhite: logstash: add dead letter queue ingest support [puppet] - 10https://gerrit.wikimedia.org/r/670893 (https://phabricator.wikimedia.org/T277080) [17:33:36] (03PS22) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [17:33:40] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-a7-eqiad power over threshold alerts - https://phabricator.wikimedia.org/T276743 (10wiki_willy) Ah cool, thanks @Cmjohnson >>! In T276743#6905568, @Cmjohnson wrote: > I moved ms-be1060 to a different phase. I think we could add the server to A7 but it cannot be on the sam... [17:34:23] (03PS4) 10Cwhite: logstash: add dead letter queue support [puppet] - 10https://gerrit.wikimedia.org/r/670576 (https://phabricator.wikimedia.org/T277080) [17:35:36] (03CR) 10Dzahn: "Thank you, makes sense to me. I have heard no complaints about anything specific regarding clouddb not working since we switched deploymen" [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [17:35:46] (03Abandoned) 10Dzahn: mariadb: update grants for deployment servers to clouddb and prod-m5 [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [17:35:53] (03Restored) 10Dzahn: mariadb: update grants for deployment servers to clouddb and prod-m5 [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [17:36:20] (03CR) 10Dzahn: "oh wait, the prod-m5 part still needs to be merged. will amend" [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [17:36:33] PROBLEM - Memcached on mwdebug1001 is CRITICAL: connect to address 10.64.32.123 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [17:37:55] (03PS2) 10Majavah: redis::multidc: Make discovery optional [puppet] - 10https://gerrit.wikimedia.org/r/669447 [17:38:22] (03CR) 10Majavah: redis::multidc: Make discovery optional (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/669447 (owner: 10Majavah) [17:38:49] 10SRE, 10ops-eqiad, 10Analytics, 10DC-Ops: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10wiki_willy) a:05wiki_willy→03Cmjohnson Hi @cmjohnson - it sounds like they need it in production. @elukey or @Ottomata - let us know if there's a particular decom'd... [17:43:04] 10SRE, 10ops-eqiad, 10Analytics, 10DC-Ops: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10elukey) @wiki_willy any decommed host is fine! [17:44:20] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10jijiki) @Cyberpower678 Please disable the bot until this issue is fixed [17:45:05] (03PS5) 10Cwhite: logstash: add dead letter queue support [puppet] - 10https://gerrit.wikimedia.org/r/670576 (https://phabricator.wikimedia.org/T277080) [17:48:50] (03PS5) 10Cwhite: httpd: enable httpd to emit ECS-compliant logs [puppet] - 10https://gerrit.wikimedia.org/r/668231 (https://phabricator.wikimedia.org/T234565) [17:49:12] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests (HTTP 415 errors) - https://phabricator.wikimedia.org/T269914 (10jijiki) [17:50:26] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1059 - https://phabricator.wikimedia.org/T276696 (10elukey) a:05elukey→03razzi [17:55:32] (03PS23) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [17:57:24] 10SRE, 10Analytics, 10CirrusSearch, 10Wikidata, and 4 others: Upgrade prometheus-jmx-exporter - https://phabricator.wikimedia.org/T276595 (10Ottomata) Hello! Does Analytics have to upgrade too? :) [17:57:39] 10SRE, 10Analytics-Clusters, 10CirrusSearch, 10Wikidata, and 4 others: Upgrade prometheus-jmx-exporter - https://phabricator.wikimedia.org/T276595 (10Ottomata) [17:58:36] 10SRE, 10Analytics-Radar, 10observability: Set up cross DC topic mirroring for Kafka logging clusters - https://phabricator.wikimedia.org/T276972 (10Ottomata) [17:58:52] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10Ottomata) [18:00:04] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210311T1800). [18:00:57] 10SRE, 10ops-eqiad, 10Analytics-Clusters: Degraded RAID on analytics1059 - https://phabricator.wikimedia.org/T276696 (10Ottomata) [18:02:09] (03PS24) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [18:03:51] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10RobH) [18:10:33] (03PS1) 10Hnowlan: aqs: test cassandra 3.11 on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/670905 (https://phabricator.wikimedia.org/T274119) [18:11:49] (03PS1) 10Mstyles: update helm test [deployment-charts] - 10https://gerrit.wikimedia.org/r/670906 [18:16:13] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 53.23 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [18:18:07] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:19:16] (03PS2) 10Cwhite: pontoon: initial hiera config for pontoon env [puppet] - 10https://gerrit.wikimedia.org/r/669968 [18:19:58] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28511/console" [puppet] - 10https://gerrit.wikimedia.org/r/670905 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [18:20:32] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:31:01] (03PS3) 10Cwhite: pontoon: initial hiera config for pontoon env [puppet] - 10https://gerrit.wikimedia.org/r/669968 [18:37:37] I'll use mwdebug1001 for a bit [18:41:17] (03CR) 10Cwhite: pontoon: initial hiera config for pontoon env (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/669968 (owner: 10Cwhite) [18:41:32] (03CR) 10Dzahn: [C: 03+1] "Thank you very much for this fix!" [puppet] - 10https://gerrit.wikimedia.org/r/670784 (owner: 10Hnowlan) [18:42:45] (03CR) 10Elukey: aqs: test cassandra 3.11 on aqs1010 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670905 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [18:44:09] (03CR) 10Bstorm: profile::ci::slave::labs::common: move to cinder-based storage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [18:46:43] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/28512/" [puppet] - 10https://gerrit.wikimedia.org/r/670424 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [18:46:58] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10RobH) [18:47:51] !log running mwscript extensions/GrowthExperiments/maintenance/importOresTopics.php testwiki --count 1000 --verbose --wikiId enwiki --apiUrl 'https://en.wikipedia.org/w/api.php' [18:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:19] (03CR) 10Dzahn: "deployed, acmechief1001 puppet run looked fine" [puppet] - 10https://gerrit.wikimedia.org/r/670424 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [18:50:10] (03PS1) 10RobH: backup1003 updates [puppet] - 10https://gerrit.wikimedia.org/r/670911 (https://phabricator.wikimedia.org/T274184) [18:50:17] (03PS2) 10Hnowlan: aqs: test cassandra 3.11 on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/670905 (https://phabricator.wikimedia.org/T274119) [18:50:55] (03CR) 10Dzahn: [C: 03+1] "Deployed the acme-chief part, would also deploy this, just that /usr/bin/gitlab-ctl does not exist yet. So it might fail if that refresh c" [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [18:51:25] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28513/console" [puppet] - 10https://gerrit.wikimedia.org/r/670905 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [18:51:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10RobH) [18:53:20] (03CR) 10RobH: [C: 03+2] backup1003 updates [puppet] - 10https://gerrit.wikimedia.org/r/670911 (https://phabricator.wikimedia.org/T274184) (owner: 10RobH) [18:53:27] 10SRE, 10Analytics-Clusters, 10CirrusSearch, 10Wikidata, and 4 others: Upgrade prometheus-jmx-exporter - https://phabricator.wikimedia.org/T276595 (10colewhite) >>! In T276595#6905896, @Ottomata wrote: > Hello! Does Analytics have to upgrade too? :) The updated jar will be deployed to to our apt repo whi... [18:53:58] (03CR) 10Cwhite: [C: 03+2] Upgrade to upstream version 0.15.0 [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/669894 (https://phabricator.wikimedia.org/T276595) (owner: 10Cwhite) [18:55:50] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10RobH) [18:56:38] !log hnowlan@deploy1002 Started deploy [restbase/deploy@6f0fe23]: Remove internal ratelimits that were causing service proxy issues [18:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:49] 10SRE, 10ops-eqiad, 10Analytics-Clusters: Degraded RAID on analytics1059 - https://phabricator.wikimedia.org/T276696 (10razzi) 05Open→03Resolved Disk is added to raid. Thanks @Cmjohnson for doing the replacement. [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210311T1900). [19:00:04] tgr: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:19] (03PS1) 10Legoktm: Support having multiple IRC feed servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670913 (https://phabricator.wikimedia.org/T224579) [19:00:20] o/ [19:00:22] (03PS1) 10Legoktm: Define IRC feed servers as an array in {Production,Labs}Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670914 [19:00:24] (03PS1) 10Legoktm: Remove back-compat from when IRC feed servers was a string [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670915 [19:00:28] I can do the backport [19:00:49] fingers crossed tgr_ [19:01:05] moritzm: ^^ [19:01:41] (03CR) 10Dzahn: "amended so that for prod we change to new IP but for clouddb we just drop the old IP. I was about to abandon that part of the change but r" [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [19:01:52] hasharDinner: do you want the wikitech config change deployed? [19:02:10] yeah [19:02:20] (03PS2) 10Gergő Tisza: wikitech: enable BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668196 (https://phabricator.wikimedia.org/T125941) (owner: 10Hashar) [19:02:23] if someone is handling the backport-confg sure :) [19:02:42] andrewbogott said that it should be fine for wikitech [19:02:50] well at least he did not deny it [19:03:37] I have no idea how wikitech works these days TBH [19:03:48] I don't either [19:03:50] but merging that patch won't do any harm at least [19:04:07] (03CR) 10Gergő Tisza: [C: 03+2] wikitech: enable BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668196 (https://phabricator.wikimedia.org/T125941) (owner: 10Hashar) [19:04:29] I have a browser set to debug with mwdebug1001 [19:06:03] (03PS5) 10Dzahn: mariadb: update grants for deployment servers to prod-m5, drop from clouddb [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) [19:06:12] (03Merged) 10jenkins-bot: wikitech: enable BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668196 (https://phabricator.wikimedia.org/T125941) (owner: 10Hashar) [19:06:52] historically that didn't work with wikitech, but let's see if it does now [19:07:14] the commit that disabled in 2014 did not give much detail [19:07:20] it bulk disabled several extensions [19:07:34] it's on mwdebug1001 [19:07:47] (03CR) 10Dzahn: "actually it's not just "not used by anything", it's kind of worse. That IP has already been assigned to something else." [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [19:08:00] using X-Wikimedia-Debug didn't work, I mean [19:08:18] yeah [19:08:20] :( [19:09:06] (03CR) 10Dzahn: [C: 03+1] "so yea, since the prod-m5 part of this should already match production, please just drop the cloudddb grant for 10.64.32.16 and this can b" [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [19:10:51] who knows really [19:11:05] on mwdebug1001 with mwscript maintenance/shell.php --wiki=labswiki [19:11:07] (03PS5) 10Dzahn: phabricator::tools: replace cron jobs with timers [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) [19:11:13] I get $wmgUseBetaFeatures = false ; [19:11:46] tgr_: oh it is not pulled on the server! [19:12:06] running scap pull [19:12:14] same [19:12:45] oh duh, sorry [19:12:55] (03PS3) 10Hnowlan: aqs: test cassandra 3.11 on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/670905 (https://phabricator.wikimedia.org/T274119) [19:12:58] it lacks a pull on deploy server apparently ;-] [19:12:59] I must have done the fetch before it finished merging [19:13:03] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@6f0fe23]: Remove internal ratelimits that were causing service proxy issues (duration: 16m 25s) [19:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:09] possibly yeah [19:13:34] my dance is usually: git fetch , git log HEAD..HEAD@{u} [19:13:38] okay it's on mwdebug1001 now for reals [19:13:41] then if the log looks ok I rebase [19:14:00] \o [19:14:57] and I dont see it enabled on Special:Version :\ [19:16:09] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28515/console" [puppet] - 10https://gerrit.wikimedia.org/r/670905 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [19:16:11] shall we sync it and see if that helps? [19:16:18] possibly yeah [19:16:39] I have zero idea what can be broken [19:16:59] (03PS1) 10Jbond: prometheus::node_puppet_agent: update requires [puppet] - 10https://gerrit.wikimedia.org/r/670916 [19:17:01] (03PS1) 10Jbond: service_definitions: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 [19:17:03] (03PS1) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [19:18:14] (03PS4) 10Hnowlan: aqs: test cassandra 3.11 on aqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/670905 (https://phabricator.wikimedia.org/T274119) [19:18:14] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:668196|wikitech: enable BetaFeatures (T125941)]] (duration: 01m 08s) [19:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:21] T125941: Enable BetaFeatures extension on Wikitech - https://phabricator.wikimedia.org/T125941 [19:18:27] (03CR) 10jerkins-bot: [V: 04-1] service_definitions: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (owner: 10Jbond) [19:18:43] tgr_: or wikitech runs on a different machine and never ends up on mwdebug hosts [19:19:12] hashar: wikitech doesn't work on mwdebug hosts [19:19:19] in the past, that was definitely so [19:19:28] \o/ [19:19:30] it requires special packages that are only on the special wikitech hosts [19:19:32] thx Urbanecm :] [19:19:46] mw* servers even don't have access to the wikitech's database [19:19:50] but in the past it wasn't part of the normal app cluster and now it seems it is, so who knows [19:19:59] tgr_: I guess you can do the other change. Wikitech seems to be somewhat working at lesat [19:20:21] but maybe the varnish rules for handling the X-WM-D header were not updated [19:20:29] or the extension [19:20:42] https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures :] [19:21:03] yay [19:21:04] the thing i dont know is whether one of the feature ends up being broken on wikitech. For example if one requires a database update [19:21:30] (03PS2) 10Gergő Tisza: Configure GrowthExperiments Add Link settings, step 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670887 (https://phabricator.wikimedia.org/T277173) [19:21:37] (03PS4) 10Gergő Tisza: Configure GrowthExperiments Add Link settings, step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670857 (https://phabricator.wikimedia.org/T277173) [19:21:49] (03PS4) 10Gergő Tisza: Enable GrowthExperiments link recommendations on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670858 (https://phabricator.wikimedia.org/T277173) [19:21:54] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28517/console" [puppet] - 10https://gerrit.wikimedia.org/r/670905 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [19:22:35] I have tested two of the three betafeatures and they work [19:22:37] success [19:22:39] thank you tgr_ ! [19:23:05] tgr_: wikitech isn't part of the normal app cluster [19:23:31] apparently it is now [19:23:57] (03CR) 10Gergő Tisza: [C: 03+2] Configure GrowthExperiments Add Link settings, step 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670887 (https://phabricator.wikimedia.org/T277173) (owner: 10Gergő Tisza) [19:24:00] why do you think so? [19:24:58] James said so on the patch [19:25:54] I guess it is now part of the scap deployment [19:26:02] when maybe previously it was an ad hoc dpeloyment? [19:26:03] plus, I don't think scap would have worked otherwise [19:26:10] I don't know really, wikitech has always been a mystery to me [19:26:20] T237773 is still open tgr_ [19:26:21] T237773: Move Wikitech onto the production MW cluster - https://phabricator.wikimedia.org/T237773 [19:26:24] (03Merged) 10jenkins-bot: Configure GrowthExperiments Add Link settings, step 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670887 (https://phabricator.wikimedia.org/T277173) (owner: 10Gergő Tisza) [19:26:27] wikitech is not on mw* servers [19:26:45] it is a part of scap, but it runs totally detached from an ordinary cluster wiki [19:26:59] on labweb1001.wikimedia.org / labweb1002.wikimedia.org, to be precise [19:27:03] I think it's on labweb* hosts due to, uh, reasons [19:27:39] cause they reach the WMCS LDAP [19:28:03] and we probably dont want the various mw* servers to be able to reach it [19:28:11] production ldap, but https://phabricator.wikimedia.org/T237889 [19:28:16] anyway BetaFeatures is working! [19:30:13] (03PS2) 10Jbond: prometheus::node_puppet_agent: update requires [puppet] - 10https://gerrit.wikimedia.org/r/670916 [19:30:15] (03PS2) 10Jbond: service_definitions: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 [19:30:30] (03PS2) 10Ryan Kemper: wdqs: impl. envoy for wdqs-test [puppet] - 10https://gerrit.wikimedia.org/r/670339 (https://phabricator.wikimedia.org/T266470) [19:30:48] !log tgr@deploy1002 Synchronized wmf-config/: Config: [[gerrit:670887|Configure GrowthExperiments Add Link settings, step 1 (T277173)]] (duration: 01m 08s) [19:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:56] T277173: Deploy Add Link on testwiki - https://phabricator.wikimedia.org/T277173 [19:31:24] (03CR) 10jerkins-bot: [V: 04-1] service_definitions: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (owner: 10Jbond) [19:31:37] (03CR) 10Gergő Tisza: [C: 03+2] Configure GrowthExperiments Add Link settings, step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670857 (https://phabricator.wikimedia.org/T277173) (owner: 10Gergő Tisza) [19:33:39] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10RobH) a:05RobH→03Volans I've setup the hw raid and such to identically match backup2003 (the mirror of this in codfw which is already imaged successfully), and updated puppet r... [19:33:52] (03PS6) 10Dzahn: phabricator: replace task dump cron with timer and switch to weekly [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) [19:35:11] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/28514/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [19:35:16] (03CR) 10Jeena Huneidi: [C: 03+2] update helm test [deployment-charts] - 10https://gerrit.wikimedia.org/r/670906 (owner: 10Mstyles) [19:35:23] (03Merged) 10jenkins-bot: Configure GrowthExperiments Add Link settings, step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670857 (https://phabricator.wikimedia.org/T277173) (owner: 10Gergő Tisza) [19:37:28] (03CR) 10Gehel: [C: 04-1] wdqs: impl. envoy for wdqs-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670339 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [19:37:49] (03Merged) 10jenkins-bot: update helm test [deployment-charts] - 10https://gerrit.wikimedia.org/r/670906 (owner: 10Mstyles) [19:38:42] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10RobH) a:05Volans→03RobH [19:42:41] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10RobH) Arzhel caught this in IRC, it was setup for asw2-b1-eqiad:xe-2/0/33 when it should have been asw2-b2-eqiad:xe-2/0/33. I've removed the interface, reran the script, now rerun... [19:42:52] (03CR) 10Jbond: "> I agree that it should *not* differ between machines. I'm not sure if any of my past attempts did that, but it was not my intent if so." [puppet] - 10https://gerrit.wikimedia.org/r/670810 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [19:43:25] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:39] (03PS1) 10Dzahn: phabricator: remove absented cron job code [puppet] - 10https://gerrit.wikimedia.org/r/670921 (https://phabricator.wikimedia.org/T273673) [19:46:58] (03PS3) 10Jbond: prometheus::node_puppet_agent: update requires [puppet] - 10https://gerrit.wikimedia.org/r/670916 [19:47:00] (03PS3) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 [19:47:42] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (owner: 10Jbond) [19:48:07] (03CR) 10Kosta Harlan: [C: 03+1] Enable GrowthExperiments link recommendations on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670858 (https://phabricator.wikimedia.org/T277173) (owner: 10Gergő Tisza) [19:48:17] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10RobH) Homer now fails with an error, pinged Arzhel about it as I'm not sure best way to proceed. [19:48:54] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/670810 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [19:52:31] (03PS4) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 [19:52:56] (03PS1) 10Dzahn: phabricator: fix description for cleanup tmp job [puppet] - 10https://gerrit.wikimedia.org/r/670926 [19:53:07] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (owner: 10Jbond) [19:53:09] (03PS2) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [19:54:22] (03PS1) 10Jgreen: remove parallel notification of jgreen/dwisehaupt for fr-tech-ops icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/670927 (https://phabricator.wikimedia.org/T273065) [19:54:31] !log tgr@deploy1002 Synchronized wmf-config/: Config: [[gerrit:670857|Configure GrowthExperiments Add Link settings, step 2 (T277173)]] (duration: 01m 08s) [19:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:38] T277173: Deploy Add Link on testwiki - https://phabricator.wikimedia.org/T277173 [19:55:04] !log T277173 running mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=testwiki GrowthExperiments [19:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:39] (03CR) 10Dzahn: [C: 03+2] phabricator: fix description for cleanup tmp job [puppet] - 10https://gerrit.wikimedia.org/r/670926 (owner: 10Dzahn) [19:56:33] (03CR) 10Jgreen: "Victorops works!" [puppet] - 10https://gerrit.wikimedia.org/r/670927 (https://phabricator.wikimedia.org/T273065) (owner: 10Jgreen) [19:59:01] !log phab1001 - sudo systemctl start phabricator_clean_tmp_files (manually run after conversion from cron to timer, and it fails with permission issues) [19:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] brennen and liw: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - American+European Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210311T2000). [20:01:18] (03PS1) 10Brennen Bearnes: all wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670931 [20:01:20] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670931 (owner: 10Brennen Bearnes) [20:03:19] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670931 (owner: 10Brennen Bearnes) [20:03:39] (03PS3) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [20:04:21] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:44] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.34 [20:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:36] brennen: I ran out of time with a backport window deploy; can you ping me when the train is done so I can finish it? it's testwiki only so it won't interfere with logspam [20:06:04] ACKNOWLEDGEMENT - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/666979 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:46] (03PS5) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 [20:07:01] tgr_: yeah, give me a few more minutes to watch logs here [20:07:41] (03CR) 10Razzi: [C: 03+2] Enable maintenance mode for matomo reboot [puppet] - 10https://gerrit.wikimedia.org/r/670559 (https://phabricator.wikimedia.org/T273278) (owner: 10Razzi) [20:07:52] thx! I'm not in any hurry, just want to make sure I don't accidentally interrupt something [20:08:59] (03PS6) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 [20:09:15] (03PS4) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [20:10:16] (03PS4) 10Jbond: prometheus::node_puppet_agent: update requires [puppet] - 10https://gerrit.wikimedia.org/r/670916 [20:10:33] (03PS7) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 [20:10:41] (03PS5) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [20:11:56] (03CR) 10CDanis: [C: 03+1] prometheus::node_puppet_agent: update requires [puppet] - 10https://gerrit.wikimedia.org/r/670916 (owner: 10Jbond) [20:12:45] (03CR) 10Herron: [C: 03+1] remove parallel notification of jgreen/dwisehaupt for fr-tech-ops icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/670927 (https://phabricator.wikimedia.org/T273065) (owner: 10Jgreen) [20:13:01] tgr_: you're clear to go ahead. [20:13:27] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host matomo1002.eqiad.wmnet [20:13:31] (03CR) 10Jgreen: [C: 03+2] remove parallel notification of jgreen/dwisehaupt for fr-tech-ops icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/670927 (https://phabricator.wikimedia.org/T273065) (owner: 10Jgreen) [20:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:56] thanks! I thought that would take a way longer, I will be busy for another half an hour actually. Will get back to it after that. [20:14:13] (03PS1) 10Dzahn: phabricator: run cleanup_tmp_files job as root [puppet] - 10https://gerrit.wikimedia.org/r/670934 [20:14:29] cool [20:14:38] (03CR) 10Jbond: P:base: add ability to manage services file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670918 (owner: 10Jbond) [20:16:58] (03PS1) 10Razzi: matomo: Disable maintenance mode for matomo reboot [puppet] - 10https://gerrit.wikimedia.org/r/670935 (https://phabricator.wikimedia.org/T273278) [20:17:39] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host matomo1002.eqiad.wmnet [20:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:56] (03CR) 10Dzahn: [C: 03+2] phabricator: run cleanup_tmp_files job as root [puppet] - 10https://gerrit.wikimedia.org/r/670934 (owner: 10Dzahn) [20:18:27] (03CR) 10Razzi: [C: 03+2] matomo: Disable maintenance mode for matomo reboot [puppet] - 10https://gerrit.wikimedia.org/r/670935 (https://phabricator.wikimedia.org/T273278) (owner: 10Razzi) [20:19:06] (03PS1) 10Ottomata: Bump eventstreams and -internal to 2021-03-11-200606-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/670936 (https://phabricator.wikimedia.org/T276305) [20:19:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,routinator} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:20:11] !log phab1001 - systemctl start phabricator_clean_tmp_files - now Succeeded [20:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:22:18] (03CR) 10Ottomata: [C: 03+2] Bump eventstreams and -internal to 2021-03-11-200606-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/670936 (https://phabricator.wikimedia.org/T276305) (owner: 10Ottomata) [20:22:41] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` backup1003.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2... [20:22:57] RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:23:04] (03CR) 10Dzahn: "20:22 <+icinga-wm> RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational https://wikitech.wikime" [puppet] - 10https://gerrit.wikimedia.org/r/670934 (owner: 10Dzahn) [20:24:31] (03PS2) 10Dzahn: phabricator: remove absented cron job code [puppet] - 10https://gerrit.wikimedia.org/r/670921 (https://phabricator.wikimedia.org/T273673) [20:25:39] (03CR) 10Dzahn: [C: 03+2] phabricator: remove absented cron job code [puppet] - 10https://gerrit.wikimedia.org/r/670921 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:28:20] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [20:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:17] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [20:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:01] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [20:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:41] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup1003.eqiad.wmnet'] ` Of which those **FAILED**: ` ['backup1003.eqiad.wmnet'] ` [20:36:21] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` backup1003.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2... [20:37:30] I'm back [20:37:47] running createExtensionTables.php for real, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/670930 got in the way [20:38:15] (03CR) 10Cwhite: [C: 03+2] logstash: extract index label from logEvent indexing errors [puppet] - 10https://gerrit.wikimedia.org/r/670525 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:40:26] (03PS2) 10Dzahn: site: decom mw2215,mw2216, codfw API canaries [puppet] - 10https://gerrit.wikimedia.org/r/670631 (https://phabricator.wikimedia.org/T277119) [20:44:48] !log zpapierski@deploy1002 Started deploy [wikimedia/discovery/analytics@cc478d4]: T273847 export queries to relforge dag deployment [20:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:54] T273847: Create a elasticsearch/kibana index with queries to allow query completion candidate research - https://phabricator.wikimedia.org/T273847 [20:46:39] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [20:46:57] !log zpapierski@deploy1002 Finished deploy [wikimedia/discovery/analytics@cc478d4]: T273847 export queries to relforge dag deployment (duration: 02m 09s) [20:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:03] (03PS1) 10Ottomata: Release anaconda-2020.02-wmf3 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/670939 [20:49:45] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup1003.eqiad.wmnet'] ` Of which those **FAILED**: ` ['backup1003.eqiad.wmnet'] ` [20:50:39] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [20:50:41] PROBLEM - SSH on mw2227.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:31] (03PS5) 10Gergő Tisza: Enable GrowthExperiments link recommendations on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670858 (https://phabricator.wikimedia.org/T277173) [20:51:54] (03PS2) 10Ottomata: Release anaconda-2020.02-wmf3 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/670939 [20:53:31] (03CR) 10Gergő Tisza: [C: 03+2] Enable GrowthExperiments link recommendations on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670858 (https://phabricator.wikimedia.org/T277173) (owner: 10Gergő Tisza) [20:53:40] (03PS3) 10Ottomata: Release anaconda-2020.02-wmf3 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/670939 [20:54:34] (03Merged) 10jenkins-bot: Enable GrowthExperiments link recommendations on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670858 (https://phabricator.wikimedia.org/T277173) (owner: 10Gergő Tisza) [20:54:58] (03PS4) 10Ottomata: Release anaconda-2020.02-wmf3 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/670939 [20:55:02] (03PS3) 10JJMC89: Disable magic links on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667316 (https://phabricator.wikimedia.org/T275951) [20:57:32] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2215.codfw.wmnet [20:57:33] RECOVERY - Maps - OSM synchronization lag - eqiad on alert1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 1.763e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [20:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:39] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2216.codfw.wmnet [20:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:10] !log deactivating codfw API canaries on old hardware (T277119) [20:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:16] T277119: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 [20:58:31] (03PS5) 10Ottomata: Release anaconda-2020.02-wmf3 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/670939 [21:00:09] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [21:00:09] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [21:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:40] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on mw2215.codfw.wmnet with reason: decom [21:00:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on mw2215.codfw.wmnet with reason: decom [21:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:48] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on mw2216.codfw.wmnet with reason: decom [21:00:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on mw2216.codfw.wmnet with reason: decom [21:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:58] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` backup1003.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2... [21:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:16] (03CR) 10Dzahn: [C: 03+2] site: decom mw2215,mw2216, codfw API canaries [puppet] - 10https://gerrit.wikimedia.org/r/670631 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [21:01:21] (03PS3) 10Dzahn: site: decom mw2215,mw2216, codfw API canaries [puppet] - 10https://gerrit.wikimedia.org/r/670631 (https://phabricator.wikimedia.org/T277119) [21:03:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2215.codfw.wmnet [21:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:17] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [21:03:18] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [21:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:29] jouncebot: now [21:03:29] For the next 0 hour(s) and 56 minute(s): Mediawiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210311T2000) [21:03:43] (03CR) 10Cwhite: [C: 03+2] "Tested this change on Pontoon and it appears to work, also when enabled in a site with the CustomLog directive." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668231 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [21:04:24] aww, crap, host I wanted to remove is a scap proxy [21:04:32] abort [21:04:45] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts mw2215.codfw.wmnet [21:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:29] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2216.codfw.wmnet [21:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:15] (03PS1) 10Ebernhardson: zpapierski home: Start tmux on connect [puppet] - 10https://gerrit.wikimedia.org/r/670943 [21:10:17] (03PS1) 10Dzahn: scap: switch proxy for codfw A3 from mw2215 to mw2300 [puppet] - 10https://gerrit.wikimedia.org/r/670944 (https://phabricator.wikimedia.org/T277119) [21:12:04] !log zpapierski@deploy1002 Started deploy [wikimedia/discovery/analytics@3810277]: T273847 export queries to relforge dag deployment - correct start date [21:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:12] T273847: Create a elasticsearch/kibana index with queries to allow query completion candidate research - https://phabricator.wikimedia.org/T273847 [21:13:48] (03CR) 10Dzahn: "looks good, but should not deploy during deploy https://puppet-compiler.wmflabs.org/compiler1001/28518/" [puppet] - 10https://gerrit.wikimedia.org/r/670944 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [21:13:57] !log zpapierski@deploy1002 Finished deploy [wikimedia/discovery/analytics@3810277]: T273847 export queries to relforge dag deployment - correct start date (duration: 01m 53s) [21:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:11] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:670858|Enable GrowthExperiments link recommendations on testwiki (T277173)] (duration: 00m 59s) [21:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:17] T277173: Deploy Add Link on testwiki - https://phabricator.wikimedia.org/T277173 [21:15:12] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1003.eqiad.wmnet with reason: REIMAGE [21:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:05] jouncebot now [21:17:06] For the next 0 hour(s) and 42 minute(s): Mediawiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210311T2000) [21:17:18] tgr_: clear if i roll the train back? [21:17:20] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1003.eqiad.wmnet with reason: REIMAGE [21:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:42] brennen: just finished! [21:17:49] cool, thank you [21:20:25] !log train status: 1.36.0-wmf.34 (T274938): rolling back to group1 and marking T277229 a train blocker [21:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:33] T274938: 1.36.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T274938 [21:20:33] T277229: UserOptionsManager: Argument 1 passed to MediaWiki\User\UserOptionsManager::setOption() must implement interface MediaWiki\User\UserIdentity, null given - https://phabricator.wikimedia.org/T277229 [21:21:05] (03PS1) 10Dzahn: mcrouter: replace proxy for codfw A3, mw2235->mw2300 [puppet] - 10https://gerrit.wikimedia.org/r/670949 (https://phabricator.wikimedia.org/T277119) [21:21:57] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: Revert "all wikis to 1.36.0-wmf.34" [21:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:51] (03PS1) 10Cwhite: phabricator: use ecs-compatible apache log format [puppet] - 10https://gerrit.wikimedia.org/r/670950 (https://phabricator.wikimedia.org/T234565) [21:22:53] (03PS1) 10Cwhite: gerrit: use ecs-compatible apache access log format [puppet] - 10https://gerrit.wikimedia.org/r/670951 (https://phabricator.wikimedia.org/T234565) [21:23:06] (03PS1) 10Dzahn: DHCP: remove mw2215 through mw2242 [puppet] - 10https://gerrit.wikimedia.org/r/670953 (https://phabricator.wikimedia.org/T277119) [21:23:24] (03PS2) 10Cwhite: phabricator: use ecs-compatible apache access log format [puppet] - 10https://gerrit.wikimedia.org/r/670950 (https://phabricator.wikimedia.org/T234565) [21:23:41] (03PS1) 10Brennen Bearnes: Revert "all wikis to 1.36.0-wmf.34" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670954 [21:23:43] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "all wikis to 1.36.0-wmf.34" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670954 (owner: 10Brennen Bearnes) [21:24:03] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup1003.eqiad.wmnet'] ` and were **ALL** successful. [21:24:32] (03CR) 10Dzahn: [C: 03+2] DHCP: remove mw2215 through mw2242 [puppet] - 10https://gerrit.wikimedia.org/r/670953 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [21:25:19] (03Merged) 10jenkins-bot: Revert "all wikis to 1.36.0-wmf.34" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670954 (owner: 10Brennen Bearnes) [21:27:05] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2220.codfw.wmnet [21:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:14] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2221.codfw.wmnet [21:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:39] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2223.codfw.wmnet [21:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:24] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2222.codfw.wmnet [21:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:15] (03PS1) 10Legoktm: docker_registry_ha: Don't return HTTP 418 for GET/HEAD on restricted/ [puppet] - 10https://gerrit.wikimedia.org/r/670956 [21:33:04] brennen: sorry about that. let's roll back https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/665236. i'll revert. [21:33:41] (03PS1) 10Dzahn: site: remove mw2224 through mw2229 [puppet] - 10https://gerrit.wikimedia.org/r/670957 (https://phabricator.wikimedia.org/T277119) [21:34:12] (03PS1) 10Mholloway: Revert "Fix: Save user options only once when Advanced Mode is toggled" [extensions/MobileFrontend] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670877 [21:34:16] mholloway: thanks! looks like we might have a client-error related blocker as well, just figuring that out [21:34:28] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28519/console" [puppet] - 10https://gerrit.wikimedia.org/r/670956 (owner: 10Legoktm) [21:36:28] (03CR) 10Legoktm: [V: 03+1 C: 03+2] docker_registry_ha: Don't return HTTP 418 for GET/HEAD on restricted/ [puppet] - 10https://gerrit.wikimedia.org/r/670956 (owner: 10Legoktm) [21:37:52] mholloway: maybe we should instead merge https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/670958, but reverting is also fine [21:39:44] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10RobH) [21:40:07] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10RobH) 05Open→03Resolved @jcrepo: pinging you as i resolve this so you are aware! [21:41:32] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Fix: Save user options only once when Advanced Mode is toggled" [extensions/MobileFrontend] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670877 (owner: 10Mholloway) [21:42:05] Urbanecm, mholloway: will go ahead with that backport unless i hear otherwise. [21:42:20] brennen: feel free to revert, that will definitely solve it [21:42:30] Jdlrobson: while i'm at it, should this be backported as well before we go to group2? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/670945/ [21:42:32] this is just a proposal of what can we do longer-term [21:42:51] Urbanecm: ack, thanks. [21:43:22] brennen: sounds good, thanks! [21:44:44] Jdlrobson: (and i guess who do we need review from on that one if so) [21:45:14] I've reached out to Jason Lineham to take a look [21:45:26] cool, ty [21:46:34] Urbanecm: Ah, sorry, missed that, might as well go ahead with the revert now. I was following the "revert first and investigate later" rule of thumb. [21:46:57] mholloway: yeah, revert first and ask question later is a good rule [21:47:00] so no opposion [21:47:10] (03PS1) 10Razzi: superset: allow analytics networks through staging firewall [puppet] - 10https://gerrit.wikimedia.org/r/670959 (https://phabricator.wikimedia.org/T272390) [21:48:29] (03CR) 10CDanis: P:base: add ability to manage services file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670918 (owner: 10Jbond) [21:48:35] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28520/console" [puppet] - 10https://gerrit.wikimedia.org/r/670959 (https://phabricator.wikimedia.org/T272390) (owner: 10Razzi) [21:49:02] (03CR) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [21:49:14] (03CR) 10Ottomata: [C: 03+1] superset: allow analytics networks through staging firewall [puppet] - 10https://gerrit.wikimedia.org/r/670959 (https://phabricator.wikimedia.org/T272390) (owner: 10Razzi) [21:49:18] (03CR) 10Razzi: [V: 03+1 C: 03+2] superset: allow analytics networks through staging firewall [puppet] - 10https://gerrit.wikimedia.org/r/670959 (https://phabricator.wikimedia.org/T272390) (owner: 10Razzi) [21:49:39] (03PS1) 10Jdlrobson: Do not log script errors without file uri [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670879 (https://phabricator.wikimedia.org/T266517) [21:56:48] ^ brennen [21:57:05] that patch can be backported too and then i think we are good to go [21:57:26] !log run populate pages in cognate (T259360) [21:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:33] T259360: Cognate doesn't properly create interwiki links for Shawiya Wiktionary (shy.wiktionary.org) - https://phabricator.wikimedia.org/T259360 [21:57:49] (03CR) 10Brennen Bearnes: [C: 03+2] Do not log script errors without file uri [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670879 (https://phabricator.wikimedia.org/T266517) (owner: 10Jdlrobson) [21:58:19] Jdlrobson: thanks, will do. [21:58:35] * brennen waits on zuul [21:59:26] (03PS1) 10Ottomata: Add eventstreams clusters to monitor_services.pp [puppet] - 10https://gerrit.wikimedia.org/r/670960 (https://phabricator.wikimedia.org/T276305) [21:59:33] Jdlrobson: any form of testing needed / possible on that one, or just we roll forward and see whether there's a spike? [22:02:40] (03PS25) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [22:02:42] (03PS1) 10Andrew Bogott: cinderutils::ensure: Gracefully handle lvm legacy cases [puppet] - 10https://gerrit.wikimedia.org/r/670961 (https://phabricator.wikimedia.org/T272114) [22:02:44] (03PS1) 10Andrew Bogott: rake default_facts: add defaults for the new 'cinder_volumes' fact [puppet] - 10https://gerrit.wikimedia.org/r/670962 (https://phabricator.wikimedia.org/T272114) [22:03:11] (03PS2) 10Andrew Bogott: rake default_facts: add defaults for the new 'cinder_volumes' fact [puppet] - 10https://gerrit.wikimedia.org/r/670962 (https://phabricator.wikimedia.org/T272114) [22:03:13] (03PS2) 10Andrew Bogott: cinderutils::ensure: Gracefully handle lvm legacy cases [puppet] - 10https://gerrit.wikimedia.org/r/670961 (https://phabricator.wikimedia.org/T272114) [22:03:15] (03PS26) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [22:04:04] brennen: nope just that spike should go down :) [22:04:21] although because of caching it's still recovering [22:04:27] RECOVERY - SSH on mw2227.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:04:57] (03CR) 10Andrew Bogott: [C: 03+2] rake default_facts: add defaults for the new 'cinder_volumes' fact [puppet] - 10https://gerrit.wikimedia.org/r/670962 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [22:08:44] (03Merged) 10jenkins-bot: Revert "Fix: Save user options only once when Advanced Mode is toggled" [extensions/MobileFrontend] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670877 (owner: 10Mholloway) [22:08:46] (03Merged) 10jenkins-bot: Do not log script errors without file uri [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670879 (https://phabricator.wikimedia.org/T266517) (owner: 10Jdlrobson) [22:09:11] 10SRE, 10OTRS, 10Security: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10MoritzMuehlenhoff) There's now a group of companies related to OTRS which will be collaborating on Znuny: https://www.otter-alliance.de/en/die-allianz.html [22:12:59] (03PS3) 10Ladsgroup: mailman3: Add exim4 configuration [puppet] - 10https://gerrit.wikimedia.org/r/669182 (https://phabricator.wikimedia.org/T256536) [22:13:16] (03CR) 10Ladsgroup: mailman3: Add exim4 configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/669182 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [22:17:44] (03PS4) 10Ladsgroup: mailman3: Add exim4 configuration [puppet] - 10https://gerrit.wikimedia.org/r/669182 (https://phabricator.wikimedia.org/T256536) [22:18:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:20:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:22:32] (03PS1) 10Bstorm: paws: add a hiera-controlled ip blocklist [puppet] - 10https://gerrit.wikimedia.org/r/670964 (https://phabricator.wikimedia.org/T276615) [22:24:04] (03PS1) 10Razzi: sre.zookeeper.reboot_workers: add cookbook to reboot zookeeper cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/670966 (https://phabricator.wikimedia.org/T273278) [22:26:30] (03PS8) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 [22:27:21] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (owner: 10Jbond) [22:27:25] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:28:40] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [22:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:17] (03CR) 10Bstorm: [C: 03+1] "Seems like a good idea. I'm also finding myself regretting that we chose "mount_point" instead of "block_device", since the sense seems to" [puppet] - 10https://gerrit.wikimedia.org/r/670961 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [22:29:28] (03PS9) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 [22:30:11] (03CR) 10Andrew Bogott: "This ought to allow existing nodes (provisioned with lvm) to remain untouched and also support creation of new VMs with cinder volumes. N" [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [22:30:51] !log brennen@deploy1002 Synchronized php-1.36.0-wmf.34/extensions/MobileFrontend/includes/: Backport: [[gerrit:670877|Revert "Fix: Save user options only once when Advanced Mode is toggled" (T277229)]] (duration: 01m 09s) [22:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:58] T277229: UserOptionsManager: Argument 1 passed to MediaWiki\User\UserOptionsManager::setOption() must implement interface MediaWiki\User\UserIdentity, null given - https://phabricator.wikimedia.org/T277229 [22:32:11] (03CR) 10Andrew Bogott: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/670961 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [22:33:35] (03PS2) 10Bstorm: paws: add a hiera-controlled ip blocklist [puppet] - 10https://gerrit.wikimedia.org/r/670964 (https://phabricator.wikimedia.org/T276615) [22:33:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:58] Jdlrobson, mholloway: both of those should be synced momentarily, then rolling forward again. [22:34:18] !log brennen@deploy1002 Synchronized php-1.36.0-wmf.34/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: Backport: [[gerrit:670879|Do not log script errors without file uri (T266517)]] (duration: 01m 07s) [22:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:24] T266517: Do log errors with no url or file_uri instead of dropped from client side error instrumentation - https://phabricator.wikimedia.org/T266517 [22:35:28] (03PS2) 10Razzi: sre.zookeeper.reboot_workers: add cookbook to reboot zookeeper cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/670966 (https://phabricator.wikimedia.org/T273278) [22:35:38] (03CR) 10Bstorm: "I've not done this with a blank blocklist before, but it is always revertable if my assumption that no match is a good thing." [puppet] - 10https://gerrit.wikimedia.org/r/670964 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [22:36:16] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/670964 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [22:36:30] (03PS3) 10Razzi: sre.zookeeper.reboot_workers: add cookbook to reboot zookeeper cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/670966 (https://phabricator.wikimedia.org/T273278) [22:36:41] !log train status: 1.36.0-wmf.34 (T274938): T277229 and T266517 related issues hopefully resolved, rolling forward to all wikis [22:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:52] T274938: 1.36.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T274938 [22:36:53] T277229: UserOptionsManager: Argument 1 passed to MediaWiki\User\UserOptionsManager::setOption() must implement interface MediaWiki\User\UserIdentity, null given - https://phabricator.wikimedia.org/T277229 [22:39:19] (03PS1) 10Brennen Bearnes: all wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670969 [22:39:20] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670969 (owner: 10Brennen Bearnes) [22:40:06] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670969 (owner: 10Brennen Bearnes) [22:41:33] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.34 [22:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:43] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:45:10] (03CR) 10Legoktm: [C: 03+1] scap: switch proxy for codfw A3 from mw2215 to mw2300 [puppet] - 10https://gerrit.wikimedia.org/r/670944 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [22:45:32] (03CR) 10Legoktm: [C: 03+1] "Seems a bit weird to have the same host be a scap proxy and a mcrouter proxy, but it makes sense and this is only temporary." [puppet] - 10https://gerrit.wikimedia.org/r/670949 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [22:45:48] brennen: LGTM [22:46:24] Jdlrobson: rad, thanks. [22:47:41] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw2216.codfw.wmnet [22:47:47] !log running DNS cookbook in an attempt to remove mw2216 [22:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:10] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [22:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:49:49] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [22:50:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:29] jouncebot: now [22:52:30] No deployments scheduled for the next 1 hour(s) and 7 minute(s) [22:54:33] (03CR) 10Bstorm: [C: 04-1] "Good news. The wikireplicas do not have any accounts using the pattern of \^p\d+\" [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [22:54:49] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2224.codfw.wmnet [22:54:57] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2225.codfw.wmnet [22:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:03] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2226.codfw.wmnet [22:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:09] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2227.codfw.wmnet [22:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:14] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2228.codfw.wmnet [22:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:57] !log depooled mw2224 through mw2228 but not removing from DSH groups yet (T277119) [22:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:03] T277119: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 [22:56:10] (03CR) 10Bstorm: [C: 04-1] "Same on toolsdb. So if we *do* make toolsdb accounts, this might be ready to go. I'm open to opinions on that, but I'm inclined to make an" [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [22:59:23] (03PS1) 10Ladsgroup: mailman3: Add let's encrypt parts for labs [puppet] - 10https://gerrit.wikimedia.org/r/670971 (https://phabricator.wikimedia.org/T256536) [23:01:38] (03PS2) 10Ladsgroup: mailman3: Add let's encrypt parts for labs [puppet] - 10https://gerrit.wikimedia.org/r/670971 (https://phabricator.wikimedia.org/T256536) [23:04:39] (03CR) 10Ladsgroup: "This is output of LE." [puppet] - 10https://gerrit.wikimedia.org/r/670971 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [23:05:38] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [23:12:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:12:55] Jdlrobson: thoughts about that recent wikimedia-client-errors alert? [23:17:33] 10SRE, 10Kubernetes: helm test fails in ci namespace - https://phabricator.wikimedia.org/T277252 (10jeena) [23:23:23] (03PS8) 10Legoktm: k8s: Add docker-registry credentials to pull restricted images [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) [23:23:38] (03CR) 10jerkins-bot: [V: 04-1] k8s: Add docker-registry credentials to pull restricted images [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [23:32:36] (03PS1) 10Ladsgroup: mailman3: Enable hyperkitty [puppet] - 10https://gerrit.wikimedia.org/r/670980 [23:34:10] (03PS9) 10Legoktm: k8s: Add docker-registry credentials to pull restricted images [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) [23:34:22] (03CR) 10Legoktm: [C: 03+2] mailman3: Enable hyperkitty [puppet] - 10https://gerrit.wikimedia.org/r/670980 (owner: 10Ladsgroup) [23:34:45] (03PS2) 10Legoktm: mailman3: Enable hyperkitty [puppet] - 10https://gerrit.wikimedia.org/r/670980 (owner: 10Ladsgroup) [23:34:51] (03CR) 10Legoktm: [V: 03+2 C: 03+2] mailman3: Enable hyperkitty [puppet] - 10https://gerrit.wikimedia.org/r/670980 (owner: 10Ladsgroup) [23:36:57] (03CR) 10Legoktm: [C: 03+2] mailman3: Add exim4 configuration [puppet] - 10https://gerrit.wikimedia.org/r/669182 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [23:39:30] (03PS3) 10Legoktm: mailman3: Add let's encrypt parts for labs [puppet] - 10https://gerrit.wikimedia.org/r/670971 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [23:41:50] (03CR) 10Legoktm: [C: 03+2] "This is fine for now, but I'd like to see the two configs unified so we're not duplicating this long-term." [puppet] - 10https://gerrit.wikimedia.org/r/670971 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [23:45:00] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28522/console" [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [23:48:13] 10SRE, 10Kubernetes: helm test fails in ci namespace - https://phabricator.wikimedia.org/T277252 (10jeena) This isn't urgent since we removed the helm test part from our pipeline as the readiness probe is doing a similar check. [23:51:08] (03PS1) 10Cwhite: logstash: rename logEvent exception into error.message [puppet] - 10https://gerrit.wikimedia.org/r/670986 (https://phabricator.wikimedia.org/T234565)