[00:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T0000).
[00:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[00:06:20] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:06:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash-be103[345] - https://phabricator.wikimedia.org/T267666 (10Jclark-ctr) @Cmjohnson  all host racked and cabled  netbox updated host port   logstash-be1033 39 logstash-be1034 21 logstash-be1035 7
[00:07:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash-be103[345] - https://phabricator.wikimedia.org/T267666 (10Jclark-ctr)
[00:08:05] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2363.codfw.wmnet'] `  an...
[00:09:00] <wikibugs>	 10SRE, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Legoktm) Is this a problem with icinga?  ` legoktm@mwdebug1003:~$ /usr/local/lib/nagios/plugins/nrpe_check_opcache -w 100 -c 50 OK: opcache is healthy `  Doesn't seem like a permissions issue ei...
[00:09:05] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2365.codfw.wmnet'] `  an...
[00:09:34] <wikibugs>	 (03PS5) 10Mstyles: update flink logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006)
[00:09:58] <wikibugs>	 (03CR) 10Mstyles: update flink logging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[00:10:07] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2367.codfw.wmnet'] `  an...
[00:10:39] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2369.codfw.wmnet'] `  an...
[00:13:22] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:17:46] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2363.codfw.wmnet
[00:17:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:18:02] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2367.codfw.wmnet
[00:18:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:18:18] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2365.codfw.wmnet
[00:18:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:18:43] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2369.codfw.wmnet
[00:18:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:38] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2363.codfw.wmnet
[00:19:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:44] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2365.codfw.wmnet
[00:19:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:50] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2367.codfw.wmnet
[00:19:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:58] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2369.codfw.wmnet
[00:19:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:22:10] <wikibugs>	 10SRE, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) This always happens after reimaging a server and then disappears after it's been running for a while.
[00:22:49] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:22:57] <icinga-wm>	 PROBLEM - Host releases2002 is DOWN: PING CRITICAL - Packet loss = 100%
[00:23:51] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:26:56] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] nfs: set default monitors for 10Gb Ethernet [puppet] - 10https://gerrit.wikimedia.org/r/656269 (https://phabricator.wikimedia.org/T218338) (owner: 10Bstorm)
[00:27:32] <wikibugs>	 10SRE, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) It's not consistent. For example mw2226 is OK but mw2224 and mw2225 have the alert but all 3 are buster and have been reimaged on the same day, 8 days ago.
[00:27:47] <wikibugs>	 10SRE, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Legoktm) mwdebug1003 was one of the first servers to be reimaged and it's still critical after over a month though
[00:28:40] <legoktm>	 mutante: am I missing something? ^
[00:30:06] <mutante>	 legoktm: I don't kow but it's not consistent within the same type of hardware that was changed on the same day
[00:30:15] <mutante>	 while mwdebug1003 is different in other ways
[00:30:21] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:44] * legoktm checks a non-debug server
[00:31:48] <legoktm>	 mw1265 has the same issue
[00:32:15] <mutante>	 yea, many have it but not ALL of them
[00:32:32] <legoktm>	 right
[00:33:10] <legoktm>	 https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=opcache&servicestatustypes=29
[00:33:39] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:34:31] <mutante>	 as you already said, running the NRPE command locally on an affected host.. WORKS and is OK
[00:34:37] <mutante>	 confirmed that on yet anohter one
[00:35:57] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:36:23] <mutante>	 checking if that is REALLY the command that is in NRPE config
[00:36:33] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:36:44] <legoktm>	 where do you find that config?
[00:36:53] <mutante>	  /etc/nagios/nrpe.d
[00:37:23] <mutante>	   2 command[check_opcache]=/usr/local/lib/nagios/plugins/nrpe_check_opcache -w 100 -c 50
[00:37:45] <mutante>	 OK: opcache is healthy
[00:37:58] <mutante>	 wtf is this :)
[00:38:36] <legoktm>	 🙃
[00:38:40] <mutante>	 it works locally but not from remote, it seems to be caused by buster but also not ALL buster hosts
[00:39:49] <wikibugs>	 (03PS1) 10Razzi: sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596)
[00:39:56] <legoktm>	 could it be caching the result of an initial run and not updating properly?
[00:40:21] <mutante>	 there is parsing with jq ..twice
[00:40:40] <tabbycat>	 Am I reading https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/mediawiki/maintenance/initsitestats.pp#L4 right that the script just runs twice per month?
[00:42:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:42:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi)
[00:43:00] <legoktm>	 tabbycat: yeah. You can also ask a friendly sysadmin to run it manually for not-large wikis :)
[00:43:20] <mutante>	 tabbycat: it is "1 weeks 4 days" until the next time
[00:43:24] <legoktm>	 mutante: and some awk too
[00:43:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [x] db1166-db1176 (exceptions: db117[01]) have all had their default passwords changed to the idrac mgmt password.  [] Chris is going to check out db117[01] tomorro...
[00:43:39] <tabbycat>	 legoktm: ah, thanks. Well if we could have it run for tr.wikivoyage...
[00:43:52] <tabbycat>	 Scribunto went apesh** on Meta
[00:43:55] <mutante>	 we can start it right now if you want
[00:43:58] <tabbycat>	 due to not being able to fetch stats
[00:44:03] <legoktm>	 done
[00:44:05] <mutante>	 but that isnt specific to one wiki
[00:44:16] <tabbycat>	 it's also missing some wiktionary stats but I'm not sure which ones
[00:44:18] <legoktm>	 !log legoktm@mwmaint1002:~$ mwscript initSiteStats.php --wiki=trwikivoyage --update
[00:44:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:30] <mutante>	 tabbycat: NOT the analytics ones, just assign that to me
[00:44:33] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:45:16] <tabbycat>	 legoktm: awesome, thanks; now https://tr.wikivoyage.org/wiki/%C3%96zel:%C4%B0statistikler displays some data
[00:45:51] <icinga-wm>	 PROBLEM - PHP opcache health on mw2357 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[00:47:17] <icinga-wm>	 PROBLEM - PHP opcache health on mw2361 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[00:47:38] <tabbycat>	 meh, scribunto still not fetching https://meta.wikimedia.org/wiki/Talk:Www.wikivoyage.org_template -- I suspect there are few more dependencies
[00:47:46] <tabbycat>	 too late to debug, off to bed now
[00:47:59] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2353 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn debugging https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[00:47:59] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2357 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn debugging https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[00:47:59] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw2361 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn debugging https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[00:49:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH)
[00:51:30] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:52:32] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:52:59] <wikibugs>	 (03PS1) 10Cwhite: profile: send w3creportingapi logs to indexes with custom schema [puppet] - 10https://gerrit.wikimedia.org/r/657452 (https://phabricator.wikimedia.org/T265938)
[00:56:34] <icinga-wm>	 PROBLEM - PHP opcache health on mw2338 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[00:57:54] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:57:55] <wikibugs>	 (03CR) 10Cwhite: "Overall LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[01:00:04] <jouncebot>	 twentyafterfour: May I have your attention please! Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T0100)
[01:00:06] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1624931824 and 96 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:01:00] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 682521880 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:01:38] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:02:18] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 105152 and 68 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:02:34] <wikibugs>	 (03PS1) 10Ryan Kemper: Decommission relforge100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444)
[01:02:44] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 146640 and 95 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:03:44] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 901291448 and 191 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:05:12] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 237856 and 242 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:05:15] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/657454 (owner: 10CRusnov)
[01:05:57] <wikibugs>	 (03PS2) 10Ryan Kemper: Decommission relforge100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444)
[01:06:13] <wikibugs>	 (03CR) 10CRusnov: "Obviously tests are needed in -next before we deploy this to production. I'll be going through the changelog and looking for any issues wi" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/657454 (owner: 10CRusnov)
[01:07:16] <wikibugs>	 (03CR) 10Ryan Kemper: "Some things I'm unsure about:" [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) (owner: 10Ryan Kemper)
[01:07:57] <wikibugs>	 (03PS3) 10Ryan Kemper: Decommission relforge100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444)
[01:09:14] <wikibugs>	 (03CR) 10Ryan Kemper: "The decommissioning will be done in this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/657453" [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper)
[01:10:28] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:12:10] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:14:04] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:14:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH)
[01:15:58] <wikibugs>	 10SRE, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) >>! In T270517#6763884, @Legoktm wrote: > Is this a problem with icinga?  Yes! And it's really weird.   I tracked down the NRPE command that is run from Icinga and it behaves different on...
[01:17:48] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:18:36] <wikibugs>	 10SRE, 10Icinga, 10observability, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Legoktm)
[01:19:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:22:15] <ryankemper>	 !log [WDQS Deploy] Tests on canary `wdqs1003` passing before start of deploy, proceeding with deploy of wdqs `0.3.60` to canary
[01:22:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:23:20] <logmsgbot>	 !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@70f9d37]: 0.3.60
[01:23:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:25:50] <ryankemper>	 !log [WDQS Deploy] Automated tests passing on canary`wdqs1003` but manually visiting `http://localhost:9999` (my tunnel to `wdqs1003`) gives `404 Not Found`from nginx; aborting deploy
[01:25:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:26:05] <ryankemper>	 !log [WDQS Deploy] Rollback of canary `wdqs1003` initiated
[01:26:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:26:13] <logmsgbot>	 !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@70f9d37]: 0.3.60 (duration: 02m 53s)
[01:26:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:27:01] <ryankemper>	 !log [WDQS Deploy] Rollback complete, service health of `wdqs1003` is restored. Need to investigate source of 404 (possibly related to some recent changes we made in the `gui` repo)
[01:27:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:34:02] <icinga-wm>	 RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops
[01:40:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH)
[01:41:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) John was onsite and fixed db117[01] for me, they are now online.  db11[56-65] have had bios and idrac firmware updates, and raid setup.  I've updated the task descr...
[01:42:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH)
[01:43:06] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:49:50] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:51:56] <icinga-wm>	 PROBLEM - PHP opcache health on mw2367 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[01:53:48] <icinga-wm>	 PROBLEM - PHP opcache health on mw2369 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[01:54:06] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:02:29] <wikibugs>	 10SRE, 10Icinga, 10observability, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) upon further investigation I realized mw2226 is actually still stretch and I made a mistake to mark it as DONE in the etherpad for appserver upgrades... som...
[02:04:14] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_proton_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:06:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:10:40] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[02:11:58] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on logstash2022 - https://phabricator.wikimedia.org/T269552 (10Papaul) 05Open→03Resolved @herron I am closing this task, please fell free to open a decom task when server is ready for decommission   Thanks
[02:12:34] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[02:13:56] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:19:35] <wikibugs>	 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10Dzahn)
[02:23:04] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:23:54] <icinga-wm>	 PROBLEM - PHP opcache health on mw2355 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[02:30:42] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:04] <wikibugs>	 (03PS1) 10Legoktm: admin: Update my (legoktm)'s dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/657458
[02:32:46] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:33:36] <wikibugs>	 (03PS2) 10Legoktm: admin: Update my (legoktm)'s dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/657458
[02:34:11] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] admin: Update my (legoktm)'s dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/657458 (owner: 10Legoktm)
[02:35:12] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:37:54] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:38:12] <wikibugs>	 10SRE, 10Icinga, 10observability, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) I tried installing nagios-nrpe-server 4.0.3-1~bpo10+1 over 3.2.1-2 but that did not fix the issue either.
[02:43:01] <wikibugs>	 (03PS1) 10Legoktm: libraryupgrader: Update celery systemd units [puppet] - 10https://gerrit.wikimedia.org/r/657459
[02:51:10] <icinga-wm>	 PROBLEM - PHP opcache health on mw2359 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[02:57:42] <icinga-wm>	 RECOVERY - PHP opcache health on mw2225 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:03:55] <wikibugs>	 10SRE, 10Icinga, 10observability, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) I found the issue. Changing line 28 in /usr/local/lib/nagios/plugins/nrpe_check_opcache to:   ` OUT=$(/usr/local/bin/php7adm /opcache-info | jq . 2>&1) `  f...
[03:09:11] <wikibugs>	 10SRE, 10Icinga, 10observability, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) a:03Dzahn
[03:21:05] <wikibugs>	 (03PS1) 10Dzahn: nrpe_check_opcache: use full path to php7adm to fix opcache monitor on buster [puppet] - 10https://gerrit.wikimedia.org/r/657460 (https://phabricator.wikimedia.org/T270517)
[03:25:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] nrpe_check_opcache: use full path to php7adm to fix opcache monitor on buster [puppet] - 10https://gerrit.wikimedia.org/r/657460 (https://phabricator.wikimedia.org/T270517) (owner: 10Dzahn)
[03:26:28] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:27:12] <icinga-wm>	 PROBLEM - PHP opcache health on mw2363 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:27:55] <wikibugs>	 (03CR) 10Dzahn: "[alert1001:~] $ /usr/lib/nagios/plugins/check_nrpe -H mw2225.codfw.wmnet -c check_opcache" [puppet] - 10https://gerrit.wikimedia.org/r/657460 (https://phabricator.wikimedia.org/T270517) (owner: 10Dzahn)
[03:28:06] <icinga-wm>	 PROBLEM - PHP opcache health on mw2365 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:28:54] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug1003 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:29:03] <mutante>	 legoktm: ^ fixing :)
[03:29:16] <icinga-wm>	 RECOVERY - PHP opcache health on mw1265 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:29:30] <icinga-wm>	 RECOVERY - PHP opcache health on mw2240 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:29:30] <icinga-wm>	 RECOVERY - PHP opcache health on mw2363 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:29:46] <icinga-wm>	 RECOVERY - PHP opcache health on mw2255 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:29:52] <icinga-wm>	 RECOVERY - PHP opcache health on mw2329 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:29:52] <icinga-wm>	 RECOVERY - PHP opcache health on mw2335 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:29:52] <icinga-wm>	 RECOVERY - PHP opcache health on mw2357 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:30:20] <icinga-wm>	 RECOVERY - PHP opcache health on mw2234 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:30:40] <wikibugs>	 10SRE, 10Icinga, 10observability, 10serviceops, 10Patch-For-Review: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) ` 03:28 <+icinga-wm> RECOVERY - PHP opcache health on mwdebug1003 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Ap...
[03:30:41] <legoktm>	 mutante: woooww nice find! So the PATH was off on the buster hosts? 
[03:30:48] <icinga-wm>	 RECOVERY - PHP opcache health on mw2277 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:31:11] <mutante>	 legoktm: yea, for some reason on stretch it worked without full path 
[03:31:16] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:31:17] <mutante>	 but not anymore
[03:31:20] <icinga-wm>	 RECOVERY - PHP opcache health on mw2310 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:31:28] <mutante>	 though generally it's recommended to use full path in the plugins
[03:31:41] <mutante>	 php7adm is in the same location
[03:31:42] * legoktm nods
[03:32:16] <icinga-wm>	 RECOVERY - PHP opcache health on mw2231 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:32:28] <mutante>	  /usr/local/bin seems to be in $PATH when i echo it .. but ..yea
[03:32:32] <icinga-wm>	 RECOVERY - PHP opcache health on mw2325 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:32:44] <icinga-wm>	 RECOVERY - PHP opcache health on mw2369 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:32:46] <icinga-wm>	 RECOVERY - PHP opcache health on mw2274 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:33:12] <icinga-wm>	 RECOVERY - PHP opcache health on mw2327 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:33:45] <mutante>	 legoktm: and then the little bonus things like that it did not exit with an error but claimed the 99.85% in the case it does not find php7adm .. and that I marked a server as buster that is stretch  :)
[03:34:04] <mutante>	 alright, with all the recoveries now i'll head out. cya
[03:34:11] <legoktm>	 Bye :))
[03:34:16] <icinga-wm>	 RECOVERY - PHP opcache health on mw2315 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:34:26] <icinga-wm>	 RECOVERY - PHP opcache health on mw2233 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:34:30] <icinga-wm>	 RECOVERY - PHP opcache health on mw2303 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:35:12] <icinga-wm>	 RECOVERY - PHP opcache health on mw2316 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:35:40] <icinga-wm>	 RECOVERY - PHP opcache health on mw1276 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:36:56] <icinga-wm>	 RECOVERY - PHP opcache health on mw1277 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:36:58] <icinga-wm>	 RECOVERY - PHP opcache health on mw2313 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:37:04] <icinga-wm>	 RECOVERY - PHP opcache health on mw2331 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:37:04] <icinga-wm>	 RECOVERY - PHP opcache health on mw2353 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:37:36] <icinga-wm>	 RECOVERY - PHP opcache health on mw2236 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:38:20] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:38:30] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:38:48] <icinga-wm>	 RECOVERY - PHP opcache health on mw1267 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:38:52] <wikibugs>	 10SRE, 10Icinga, 10observability, 10serviceops, 10Patch-For-Review: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) 05Open→03Resolved
[03:38:55] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn)
[03:39:16] <icinga-wm>	 RECOVERY - PHP opcache health on mw2230 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:39:44] <icinga-wm>	 RECOVERY - PHP opcache health on mw2275 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:40:20] <icinga-wm>	 RECOVERY - PHP opcache health on mw2367 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:40:22] <icinga-wm>	 RECOVERY - PHP opcache health on mw2238 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:41:16] <icinga-wm>	 RECOVERY - PHP opcache health on mw2269 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:41:42] <icinga-wm>	 RECOVERY - PHP opcache health on mw2235 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:42:24] <icinga-wm>	 RECOVERY - PHP opcache health on mw2228 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:42:28] <icinga-wm>	 RECOVERY - PHP opcache health on mw2312 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:42:28] <icinga-wm>	 RECOVERY - PHP opcache health on mw2273 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:43:16] <icinga-wm>	 RECOVERY - PHP opcache health on mw2314 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:43:16] <icinga-wm>	 RECOVERY - PHP opcache health on mw2311 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:43:46] <icinga-wm>	 RECOVERY - PHP opcache health on parse2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:43:58] <icinga-wm>	 RECOVERY - PHP opcache health on mw2307 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:44:04] <icinga-wm>	 RECOVERY - PHP opcache health on mw2351 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:44:04] <icinga-wm>	 RECOVERY - PHP opcache health on mw2339 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:45:30] <icinga-wm>	 RECOVERY - PHP opcache health on mw2268 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:45:30] <icinga-wm>	 RECOVERY - PHP opcache health on mw2227 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:46:24] <icinga-wm>	 RECOVERY - PHP opcache health on mw2338 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:46:28] <icinga-wm>	 RECOVERY - PHP opcache health on mw2270 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:46:52] <icinga-wm>	 RECOVERY - PHP opcache health on mw2224 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:47:18] <icinga-wm>	 RECOVERY - PHP opcache health on mw2243 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:47:54] <icinga-wm>	 RECOVERY - PHP opcache health on mw2232 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:47:54] <icinga-wm>	 RECOVERY - PHP opcache health on mw2237 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:48:28] <icinga-wm>	 RECOVERY - PHP opcache health on mw2305 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:51:08] <icinga-wm>	 RECOVERY - PHP opcache health on mw2359 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:51:24] <icinga-wm>	 RECOVERY - PHP opcache health on mw1266 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:52:24] <icinga-wm>	 RECOVERY - PHP opcache health on mw2242 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:52:24] <icinga-wm>	 RECOVERY - PHP opcache health on mw2239 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:52:26] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:52:33] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/refinery@57589e7]: Minor typo fix
[03:52:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:53:28] <icinga-wm>	 RECOVERY - PHP opcache health on mw2333 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:53:28] <icinga-wm>	 RECOVERY - PHP opcache health on mw2337 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:53:28] <icinga-wm>	 RECOVERY - PHP opcache health on mw2361 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:53:28] <icinga-wm>	 RECOVERY - PHP opcache health on mw2355 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:53:58] <icinga-wm>	 RECOVERY - PHP opcache health on mw2241 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:53:58] <icinga-wm>	 RECOVERY - PHP opcache health on mw2309 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:54:00] <icinga-wm>	 RECOVERY - PHP opcache health on mw2365 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:54:13] <logmsgbot>	 !log milimetric@deploy1001 deploy aborted: Minor typo fix (duration: 01m 39s)
[03:54:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:54:44] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:55:52] <icinga-wm>	 RECOVERY - PHP opcache health on mw2229 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[03:56:32] <wikibugs>	 (03PS5) 10Andrew Bogott: nova vendordata/firstboot: move puppet config into cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657401 (https://phabricator.wikimedia.org/T271273)
[03:57:32] <icinga-wm>	 RECOVERY - PHP opcache health on mw2258 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[04:06:22] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:08:28] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:08:42] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:08:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata/firstboot: move puppet config into cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657401 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[04:10:48] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:15:40] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:17:48] <wikibugs>	 10Puppet, 10SRE: Unused puppet modules audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Ladsgroup)
[04:18:04] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:25:02] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:25:56] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova: reload api and api-metadata service when the vendordata source changes [puppet] - 10https://gerrit.wikimedia.org/r/657462 (https://phabricator.wikimedia.org/T271273)
[04:25:58] <wikibugs>	 (03PS1) 10Andrew Bogott: nova firstboot script: remove file updates that are handled by puppet [puppet] - 10https://gerrit.wikimedia.org/r/657463 (https://phabricator.wikimedia.org/T271273)
[04:26:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Nova: reload api and api-metadata service when the vendordata source changes [puppet] - 10https://gerrit.wikimedia.org/r/657462 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[04:32:46] <wikibugs>	 (03PS2) 10Andrew Bogott: Nova: reload api and api-metadata service when the vendordata source changes [puppet] - 10https://gerrit.wikimedia.org/r/657462 (https://phabricator.wikimedia.org/T271273)
[04:32:48] <wikibugs>	 (03PS2) 10Andrew Bogott: nova firstboot script: remove file updates that are handled by puppet [puppet] - 10https://gerrit.wikimedia.org/r/657463 (https://phabricator.wikimedia.org/T271273)
[04:33:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Nova: reload api and api-metadata service when the vendordata source changes [puppet] - 10https://gerrit.wikimedia.org/r/657462 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[04:34:24] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:39:20] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:41:36] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:49:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova firstboot script: remove file updates that are handled by puppet [puppet] - 10https://gerrit.wikimedia.org/r/657463 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[04:51:16] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01163 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[05:01:46] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:08:10] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova cloud-init: rework logic for initial volume setup [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273)
[05:11:08] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:18:18] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:22:25] <wikibugs>	 (03CR) 10Andrew Bogott: "/usr/local/sbin/make-instance-vg: lvm is not active on this host; unable to create a volume." [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[05:30:00] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:32:10] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:39:06] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:50:22] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:59:28] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:16:39] <wikibugs>	 (03PS1) 10Marostegui: production-m2.sql.erb: Add INDEX grant to sockpuppet_import user [puppet] - 10https://gerrit.wikimedia.org/r/657468 (https://phabricator.wikimedia.org/T272533)
[06:19:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) @Cmjohnson unfortunately the server isn't accessible yet - I cannot even reach its idrac :-( ` root@cumin1001:~# ping clouddb1019.eqiad.wmnet -c5 PING clouddb1019.eqiad.wmnet (10.64.48.9) 56(84)...
[06:20:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] production-m2.sql.erb: Add INDEX grant to sockpuppet_import user [puppet] - 10https://gerrit.wikimedia.org/r/657468 (https://phabricator.wikimedia.org/T272533) (owner: 10Marostegui)
[06:31:10] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:33:30] <wikibugs>	 10Puppet, 10SRE: Unused puppet modules audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Joe) Your methodology is not 100% accurate, so before removing anything I'd verify with the authors/service owners as there can be some false positives.  Also: - Let's exclude third-party modules like `stdli...
[06:37:50] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:37:54] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:38:10] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:42:14] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:44:50] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:47:03] <wikibugs>	 (03PS1) 10Marostegui: db1087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/657469
[06:49:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1087 and pool db1099:3318 into s8 vslow', diff saved to https://phabricator.wikimedia.org/P13860 and previous config saved to /var/cache/conftool/dbconfig/20210121-064903-marostegui.json
[06:49:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:49:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/657469 (owner: 10Marostegui)
[06:53:36] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:54:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1087', diff saved to https://phabricator.wikimedia.org/P13861 and previous config saved to /var/cache/conftool/dbconfig/20210121-065408-marostegui.json
[06:54:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:26] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/657432
[06:55:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repoool db1099:3318', diff saved to https://phabricator.wikimedia.org/P13862 and previous config saved to /var/cache/conftool/dbconfig/20210121-065459-marostegui.json
[06:55:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:55:07] <wikibugs>	 10Puppet, 10SRE: Unused puppet modules audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Ladsgroup) >>! In T272559#6764172, @Joe wrote: > Your methodology is not 100% accurate, so before removing anything I'd verify with the authors/service owners as there can be some false positives. >  > Also:...
[06:55:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/657432 (owner: 10Marostegui)
[06:56:20] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:58:16] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:59:11] <wikibugs>	 10Puppet, 10SRE: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Ladsgroup)
[07:01:04] <icinga-wm>	 ACKNOWLEDGEMENT - Host clouddb1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Marostegui T272125
[07:03:24] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:03:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repoool db1099:3318', diff saved to https://phabricator.wikimedia.org/P13863 and previous config saved to /var/cache/conftool/dbconfig/20210121-070346-marostegui.json
[07:03:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:48] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:10:10] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:10:34] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:17:08] <wikibugs>	 (03PS4) 10Effie Mouzeli: scap: enable logging to syslog [puppet] - 10https://gerrit.wikimedia.org/r/574485 (https://phabricator.wikimedia.org/T227080) (owner: 10Filippo Giunchedi)
[07:20:08] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:21:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P13864 and previous config saved to /var/cache/conftool/dbconfig/20210121-072101-marostegui.json
[07:21:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:35] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:29:45] <wikibugs>	 10Puppet, 10SRE: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Joe) >>! In T272559#6764178, @Ladsgroup wrote: >> - It could be interesting to audit what is in puppetdb and check it against what is in the puppet tree. I suspect there is more stale stuff in site.pp than...
[07:30:09] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:30:49] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:36:15] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:36:55] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:37:10] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] scap: enable logging to syslog [puppet] - 10https://gerrit.wikimedia.org/r/574485 (https://phabricator.wikimedia.org/T227080) (owner: 10Filippo Giunchedi)
[07:38:00] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] scap: disable udp logging [puppet] - 10https://gerrit.wikimedia.org/r/657136 (https://phabricator.wikimedia.org/T227080) (owner: 10Effie Mouzeli)
[07:38:14] <wikibugs>	 (03PS2) 10Effie Mouzeli: scap: disable udp logging [puppet] - 10https://gerrit.wikimedia.org/r/657136 (https://phabricator.wikimedia.org/T227080)
[07:38:27] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:40:55] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:42:33] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:43:23] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:45:13] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:56:41] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305)
[07:58:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27561/console" [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[08:00:48] <icinga-wm>	 PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:05:30] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:08:03] <wikibugs>	 (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) (owner: 10Ryan Kemper)
[08:08:56] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:11:00] <wikibugs>	 (03PS2) 10Effie Mouzeli: mediawiki: reduce the number of cached keys that trigger a restart [puppet] - 10https://gerrit.wikimedia.org/r/657398 (https://phabricator.wikimedia.org/T245183)
[08:11:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove tor::instance [puppet] - 10https://gerrit.wikimedia.org/r/657531 (https://phabricator.wikimedia.org/T272559)
[08:14:20] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:16:22] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:20:13] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: reduce the number of cached keys that trigger a restart [puppet] - 10https://gerrit.wikimedia.org/r/657398 (https://phabricator.wikimedia.org/T245183) (owner: 10Effie Mouzeli)
[08:22:44] <wikibugs>	 (03PS4) 10Effie Mouzeli: modules/scap/templates/scap.cfg.erb: Define php_fpm_unsafe_restart_script [puppet] - 10https://gerrit.wikimedia.org/r/636074 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy)
[08:22:54] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:24:24] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:25:08] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:25:54] <wikibugs>	 10SRE, 10ops-eqiad: ms-be1046 stuck on reboot - https://phabricator.wikimedia.org/T272396 (10fgiunchedi) Thank you @Cmjohnson ! contacting dell SGTM
[08:26:36] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:26:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] swift: decrease object replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/656837 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi)
[08:28:19] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305)
[08:29:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] refinery: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[08:30:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove tor::instance [puppet] - 10https://gerrit.wikimedia.org/r/657531 (https://phabricator.wikimedia.org/T272559) (owner: 10Muehlenhoff)
[08:31:20] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10MoritzMuehlenhoff)
[08:31:56] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:33:03] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10MoritzMuehlenhoff)
[08:33:37] <wikibugs>	 (03CR) 10Elukey: profile::analytics::refinery::job::hdfs_cleaner Update (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[08:34:18] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:34:47] <godog>	 !log roll-restart swift-object in codfw to apply new concurrency
[08:34:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:04] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10MoritzMuehlenhoff)
[08:36:21] <wikibugs>	 (03CR) 10Elukey: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey)
[08:36:30] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:37:52] <marostegui>	 !log Silence m1 hosts in preparation for the restart T271540
[08:37:53] <wikibugs>	 (03CR) 10Elukey: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey)
[08:37:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:56] <stashbot>	 T271540: Upgrade and restart m1 master (db1080) - https://phabricator.wikimedia.org/T271540
[08:38:25] <wikibugs>	 (03CR) 10Elukey: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey)
[08:38:40] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:40:14] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:41:42] <elukey>	 is there maintenance for cr2-esams?
[08:42:27] <wikibugs>	 (03PS2) 10Jcrespo: admin: Add wikitrent to the list of privileged LDAP accounts [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489)
[08:42:35] <elukey>	 This is the Lumen link to eqiad 
[08:43:16] <godog>	 !log swift codfw-prod: more weight to ms-be20[58-61] - T269337
[08:43:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:20] <elukey>	 yes it seems so, they are fixing the link
[08:43:20] <stashbot>	 T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337
[08:44:42] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:45:01] <wikibugs>	 (03PS9) 10Effie Mouzeli: varnish: Set debug=1 in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683)
[08:45:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27562/console" [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[08:51:41] <jynus>	 !log stopping puppet and bacula for backup1001 T271540
[08:51:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:46] <stashbot>	 T271540: Upgrade and restart m1 master (db1080) - https://phabricator.wikimedia.org/T271540
[08:52:12] <marostegui>	 akosiaris jynus pre-steps done
[08:52:55] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10fgiunchedi) Thank you @Cmjohnson ! Doesn't look like the host likes the new disk :(  Once ms-be1046 is repaired in T272396 I'll start decom of one host so there will be spare HP 4TB drives.  ` => ld 11 modify reenable...
[08:53:55] <jynus>	 marostegui, I can confirm bacula down
[08:54:06] <marostegui>	 \o/
[08:54:27] <jynus>	 prometheus alert for monitoring may happen
[08:54:43] <marostegui>	 and etherpad alert might too
[08:54:48] <marostegui>	 I am ready to restart etherpad anyways
[08:54:59] <jynus>	 the one that gathers https://grafana.wikimedia.org/d/413r2vbWk/bacula
[08:55:36] <jynus>	 and the one for zarcillo
[08:55:53] <jynus>	 sorry
[08:56:04] <jynus>	 not zarcillo, dbbackups
[08:56:09] <jynus>	 but that has not an alert
[08:56:35] <jynus>	 the proxy will complain too?
[08:56:59] <marostegui>	 depends on how long it takes the alert might not happen
[08:57:23] <jynus>	 not worrying, just trying to think all potential alerts so people don't wory
[08:57:29] <jynus>	 *other
[08:57:57] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) Most of the openstack ones are dynamically imported, see [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/openstack/manifests...
[08:58:15] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro)
[08:58:31] <wikibugs>	 (03PS1) 10Ladsgroup: eventlogging: Remove multiple unused modules [puppet] - 10https://gerrit.wikimedia.org/r/657538 (https://phabricator.wikimedia.org/T272559)
[08:59:02] <jynus>	 for next step, you will dump buffer pool, disable automatic pool on shutdown and then reduce the buffer pool ratio, is that what it means?
[08:59:12] <marostegui>	 that is done
[08:59:16] <jynus>	 cool
[08:59:19] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305)
[08:59:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:59:55] <jynus>	 ^this is what I meant before
[09:00:02] <marostegui>	 let's go?
[09:00:07] <jynus>	 +1
[09:00:11] <marostegui>	 !log m1 master restart - T271540
[09:00:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:18] <stashbot>	 T271540: Upgrade and restart m1 master (db1080) - https://phabricator.wikimedia.org/T271540
[09:00:23] <marostegui>	 stopping
[09:00:33] <jynus>	 let us know if errors or success
[09:00:35] <marostegui>	 starting
[09:00:51] <marostegui>	 started
[09:00:52] <marostegui>	 checking
[09:01:09] <marostegui>	 everything should be back as normal
[09:01:10] <marostegui>	 checking etherpad
[09:01:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27563/console" [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[09:01:21] <marostegui>	 I can write fine
[09:01:45] <marostegui>	 etherpad logs looking ok
[09:02:02] <marostegui>	 I can create a new etherpad, so it looks good
[09:02:21] <marostegui>	 reloading now the non active proxy
[09:03:05] <jynus>	 active proxy didn't failover?
[09:03:13] <marostegui>	 it was too fast :)
[09:03:26] <marostegui>	 librenms looking good
[09:03:33] <jynus>	 akosiaris, just check any alert/anything wrong you can check :-)
[09:03:41] <marostegui>	 rt looking good
[09:03:43] <jynus>	 will wait to reenable bacula
[09:03:57] <akosiaris>	 jynus: ok
[09:03:57] <marostegui>	 racktables looking good
[09:04:04] <marostegui>	 Everything seems to be working fine
[09:04:09] <moritzm>	 IDP is also just fine, just tested access to a U2F token
[09:04:21] <marostegui>	 thank you moritzm 
[09:05:21] <jynus>	 will reenable bacula if everything else looks good and redo missed gerrit backup
[09:07:02] <wikibugs>	 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10JMeybohm) I don't see anything interesting in the 2.7.1 release (https://github.com/docker/distribution/releases/tag/v2.7.1, https://metadata.ftp-master.debian.org/changelogs//main/d/docker-regi...
[09:07:04] <icinga-wm>	 RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:07:25] <marostegui>	 jynus: everything looks good yep
[09:09:22] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:11:40] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:11:52] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:13:24] <jynus>	 ok, that was the alert I was waiting to recover
[09:14:27] <jynus>	 there is a few puppet wmcs/no resources failures since 4:38, but not related to this
[09:14:43] <jynus>	 arturo ^
[09:15:02] <arturo>	 in a meeting
[09:15:29] <jynus>	 (no rush, just a friendly heads up ping 0:-))
[09:15:38] <jynus>	 doesn't impact our team
[09:20:41] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) > Is there a public API of it? it'd be amazing.  Here are the docs for the [[ https://puppet.com/docs/puppetdb/6.13/api/index.html | puppetdb API ]].  You can run curl commands...
[09:21:04] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:23:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:26:54] <wikibugs>	 (03Abandoned) 10Thiemo Kreuz (WMDE): [POC] Convert all Wikipedia logos to (true) grayscale [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609584 (https://phabricator.wikimedia.org/T252108) (owner: 10Thiemo Kreuz (WMDE))
[09:30:32] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:31:06] <wikibugs>	 (03PS1) 10Filippo Giunchedi: debian: add packaging [debs/alertmanager-webhook-logger] - 10https://gerrit.wikimedia.org/r/657541 (https://phabricator.wikimedia.org/T272474)
[09:37:18] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:37:26] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:38:54] <icinga-wm>	 PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:41:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:44:09] <hoo>	 !log Updated the Wikidata property suggester with data from the 2021-01-11 JSON dump and applied the T132839 workarounds
[09:44:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:14] <stashbot>	 T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839
[09:47:25] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I think there are a couple of bugs to fix, see inline for the details" (034 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov)
[09:49:30] <icinga-wm>	 PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:51:01] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Diff looks reasonable" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/657454 (owner: 10CRusnov)
[09:52:36] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10lilients_WMDE) I can now access the event logging metrics. I also got the mail for kerberos. Thank you for the support!
[09:54:13] <wikibugs>	 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10akosiaris)
[09:54:32] <wikibugs>	 (03PS1) 10Kormat: udp2log: Install bsection [puppet] - 10https://gerrit.wikimedia.org/r/657543
[09:55:31] <wikibugs>	 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10akosiaris) I 've marked T272111 as a parent of this task for greater visibility. This one seems more generic than the sp...
[09:55:56] <wikibugs>	 (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27564/console" [puppet] - 10https://gerrit.wikimedia.org/r/657543 (owner: 10Kormat)
[09:57:38] <wikibugs>	 (03CR) 10Kormat: [V: 03+1] "Apparently i forgot to do this when i created bsection." [puppet] - 10https://gerrit.wikimedia.org/r/657543 (owner: 10Kormat)
[10:01:27] <wikibugs>	 (03CR) 10Gehel: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper)
[10:03:09] <wikibugs>	 (03CR) 10Gehel: "LGTM (with Moritz comments)." [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) (owner: 10Ryan Kemper)
[10:19:07] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[10:20:09] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] "LGTM" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[10:20:44] <wikibugs>	 (03Merged) 10jenkins-bot: update flink logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[10:30:48] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:37:46] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:39:30] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:41:54] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:50:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment (but feel free to ignore)" (031 comment) [debs/alertmanager-webhook-logger] - 10https://gerrit.wikimedia.org/r/657541 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi)
[10:55:14] <wikibugs>	 (03PS3) 10Elukey: varnish: block python-request UA bots for AQS [puppet] - 10https://gerrit.wikimedia.org/r/657288
[10:57:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489) (owner: 10Jcrespo)
[10:57:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile: ecs indices to use a weekly rotation [puppet] - 10https://gerrit.wikimedia.org/r/657371 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[10:59:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657370 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[11:00:04] <jouncebot>	 mvolz: Your horoscope predicts another unfortunate Services – Citoid /  Zotero deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T1100).
[11:00:28] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:01:23] <wikibugs>	 (03CR) 10Muehlenhoff: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) (owner: 10Ryan Kemper)
[11:02:07] <wikibugs>	 (03CR) 10Gehel: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) (owner: 10Ryan Kemper)
[11:02:28] <wikibugs>	 (03PS2) 10Filippo Giunchedi: debian: add packaging [debs/alertmanager-webhook-logger] - 10https://gerrit.wikimedia.org/r/657541 (https://phabricator.wikimedia.org/T272474)
[11:02:36] <wikibugs>	 (03PS1) 10Hnowlan: similar-users: release new container version with unicode parsing fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657546
[11:02:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: debian: add packaging (031 comment) [debs/alertmanager-webhook-logger] - 10https://gerrit.wikimedia.org/r/657541 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi)
[11:03:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [debs/alertmanager-webhook-logger] - 10https://gerrit.wikimedia.org/r/657541 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi)
[11:03:48] <wikibugs>	 (03PS1) 10Elukey: profile::analytics::cluster::users: add analytics to the druid group [puppet] - 10https://gerrit.wikimedia.org/r/657547
[11:06:05] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] similar-users: release new container version with unicode parsing fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657546 (owner: 10Hnowlan)
[11:07:08] <wikibugs>	 (03PS2) 10Elukey: profile::analytics::cluster::users: add analytics to the druid group [puppet] - 10https://gerrit.wikimedia.org/r/657547
[11:07:28] <wikibugs>	 (03Merged) 10jenkins-bot: similar-users: release new container version with unicode parsing fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657546 (owner: 10Hnowlan)
[11:12:46] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[11:12:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:53] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[11:12:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:49] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27566/console" [puppet] - 10https://gerrit.wikimedia.org/r/657547 (owner: 10Elukey)
[11:15:51] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "LGTM and tests are green:" [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey)
[11:18:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] varnish: block python-request UA bots for AQS [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey)
[11:18:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] debian: add packaging [debs/alertmanager-webhook-logger] - 10https://gerrit.wikimedia.org/r/657541 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi)
[11:19:40] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10ArielGlenn)
[11:20:12] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cr/firewall.conf: cloud-in4: introduce ACL for novafullstack [homer/public] - 10https://gerrit.wikimedia.org/r/657358 (https://phabricator.wikimedia.org/T272486) (owner: 10Arturo Borrero Gonzalez)
[11:21:05] <wikibugs>	 (03Merged) 10jenkins-bot: cr/firewall.conf: cloud-in4: introduce ACL for novafullstack [homer/public] - 10https://gerrit.wikimedia.org/r/657358 (https://phabricator.wikimedia.org/T272486) (owner: 10Arturo Borrero Gonzalez)
[11:28:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1085', diff saved to https://phabricator.wikimedia.org/P13867 and previous config saved to /var/cache/conftool/dbconfig/20210121-112849-marostegui.json
[11:28:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:24] <marostegui>	 !log Stop replication on db1085 to move wiki replicas under the other sanitarium host
[11:29:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:13] <wikibugs>	 (03PS1) 10Ayounsi: Remove unused roles librenms and rancid [puppet] - 10https://gerrit.wikimedia.org/r/657551 (https://phabricator.wikimedia.org/T272559)
[11:31:26] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:46] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"" [homer/public] - 10https://gerrit.wikimedia.org/r/657439
[11:32:18] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Revert "Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"" [homer/public] - 10https://gerrit.wikimedia.org/r/657439 (owner: 10Arturo Borrero Gonzalez)
[11:32:51] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: Revert "Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"" [homer/public] - 10https://gerrit.wikimedia.org/r/657439 (https://phabricator.wikimedia.org/T209082)
[11:33:12] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"" [homer/public] - 10https://gerrit.wikimedia.org/r/657439 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez)
[11:33:44] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"" [homer/public] - 10https://gerrit.wikimedia.org/r/657439 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez)
[11:35:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui)
[11:35:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 25%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13868 and previous config saved to /var/cache/conftool/dbconfig/20210121-113533-root.json
[11:35:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:22] <wikibugs>	 (03PS2) 10Hnowlan: services: similar-users discovery and LVS component [puppet] - 10https://gerrit.wikimedia.org/r/657101 (https://phabricator.wikimedia.org/T268837)
[11:38:04] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 50%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13870 and previous config saved to /var/cache/conftool/dbconfig/20210121-115036-root.json
[11:50:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:17] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove unused roles librenms and rancid [puppet] - 10https://gerrit.wikimedia.org/r/657551 (https://phabricator.wikimedia.org/T272559) (owner: 10Ayounsi)
[11:54:57] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10ayounsi)
[11:57:55] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Now with the proper exceptions should work ok 😊" [homer/public] - 10https://gerrit.wikimedia.org/r/657439 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez)
[12:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T1200).
[12:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[12:03:54] <Lucas_WMDE>	 yup, looks like nothing to do
[12:05:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 75%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13871 and previous config saved to /var/cache/conftool/dbconfig/20210121-120540-root.json
[12:05:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:21] <wikibugs>	 (03CR) 10Volans: "Nice addition! Minor things inline." (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond)
[12:07:03] <wikibugs>	 (03PS2) 10Matthias Mullie: Add global to indicate that elastic LTR features are available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646663
[12:07:17] <wikibugs>	 (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/657547 (owner: 10Elukey)
[12:20:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 100%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13872 and previous config saved to /var/cache/conftool/dbconfig/20210121-122043-root.json
[12:20:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:35] <wikibugs>	 (03PS1) 10Marostegui: sys: Add the current version of sys database. [software] - 10https://gerrit.wikimedia.org/r/657558
[12:22:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] sys: Add the current version of sys database. [software] - 10https://gerrit.wikimedia.org/r/657558 (owner: 10Marostegui)
[12:27:41] <wikibugs>	 (03PS15) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[12:29:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[12:31:18] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:47] <wikibugs>	 (03PS1) 10Ladsgroup: logstash: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/657560 (https://phabricator.wikimedia.org/T209953)
[12:33:59] <wikibugs>	 (03CR) 10Hnowlan: start using imposm as OSM sync tool (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[12:35:02] <icinga-wm>	 PROBLEM - SSH on ms-be2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:35:34] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2021 is CRITICAL: CRITICAL - load average: 84.11, 108.71, 66.41 https://wikitech.wikimedia.org/wiki/Swift
[12:37:58] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:39:18] <icinga-wm>	 RECOVERY - SSH on ms-be2021 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:39:41] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/27569/" [puppet] - 10https://gerrit.wikimedia.org/r/657560 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[12:39:58] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2021 is OK: OK - load average: 22.88, 60.73, 56.86 https://wikitech.wikimedia.org/wiki/Swift
[12:58:34] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:29] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Ladsgroup) @jbond  Thanks for the detailed comment. I will definitely use it to redo most of the work the script does but one big problem. Since I'm no SRE, I can't login to puppetdb1...
[13:12:00] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:14:31] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete role [puppet] - 10https://gerrit.wikimedia.org/r/657569 (https://phabricator.wikimedia.org/T272559)
[13:21:25] <wikibugs>	 (03PS1) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559)
[13:21:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) (owner: 10Jbond)
[13:22:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/657560 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[13:23:10] <wikibugs>	 (03PS3) 10JMeybohm: docker_registry_ha: Add "Vary: Accept" to response [puppet] - 10https://gerrit.wikimedia.org/r/650153 (https://phabricator.wikimedia.org/T256762)
[13:24:16] <wikibugs>	 (03PS1) 10A2569875: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612)
[13:24:18] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612) (owner: 10A2569875)
[13:25:26] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] docker_registry_ha: Add "Vary: Accept" to response [puppet] - 10https://gerrit.wikimedia.org/r/650153 (https://phabricator.wikimedia.org/T256762) (owner: 10JMeybohm)
[13:26:52] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Have you had any luck with those tests yet? Otherwise we’re pretty stuck here :/" [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man)
[13:31:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "The patch looks good to me. I'd suggest to move the system user/group handling to systemd::sysuser to simplify things, but that's unrelate" [puppet] - 10https://gerrit.wikimedia.org/r/657547 (owner: 10Elukey)
[13:32:10] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:32:12] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) @Ladsgroup managed to nerdsnipe be good on this one :).  I have created a [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/657571 | CR  ]] which is mostly the logic in y...
[13:33:22] <wikibugs>	 (03PS2) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559)
[13:38:13] <XioNoX>	 !log put eqiad/esams lumen link back in service
[13:38:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:10] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:44:02] <wikibugs>	 10SRE, 10netops: eqiad-esams link issue - https://phabricator.wikimedia.org/T272524 (10ayounsi) 05Open→03Resolved Back in service.
[13:44:48] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 113.71, 100.11, 70.21 https://wikitech.wikimedia.org/wiki/Swift
[13:48:06] <wikibugs>	 (03PS1) 10Marostegui: *.sql: Add sql_log_bin=0 [software] - 10https://gerrit.wikimedia.org/r/657574
[13:48:09] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) {meme, src="antoine-approve", below="{{done\}\}"}
[13:49:28] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] *.sql: Add sql_log_bin=0 [software] - 10https://gerrit.wikimedia.org/r/657574 (owner: 10Marostegui)
[13:49:30] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2055 is OK: OK - load average: 57.02, 73.63, 66.87 https://wikitech.wikimedia.org/wiki/Swift
[13:49:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] *.sql: Add sql_log_bin=0 [software] - 10https://gerrit.wikimedia.org/r/657574 (owner: 10Marostegui)
[13:50:22] <wikibugs>	 (03Merged) 10jenkins-bot: *.sql: Add sql_log_bin=0 [software] - 10https://gerrit.wikimedia.org/r/657574 (owner: 10Marostegui)
[13:53:08] <icinga-wm>	 PROBLEM - Check systemd state on registry1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:54:42] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host bast3004.wikimedia.org
[13:54:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:10] <wikibugs>	 (03PS1) 10Mforns: Migrate SuggestedTagsAction to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657579 (https://phabricator.wikimedia.org/T267351)
[13:57:18] <wikibugs>	 (03PS1) 10Kormat: dbtools: Add sys/apply script [software] - 10https://gerrit.wikimedia.org/r/657581
[13:58:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Add comment in site.pp for former bastions [puppet] - 10https://gerrit.wikimedia.org/r/657583
[13:59:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] dbtools: Add sys/apply script [software] - 10https://gerrit.wikimedia.org/r/657581 (owner: 10Kormat)
[13:59:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add comment in site.pp for former bastions [puppet] - 10https://gerrit.wikimedia.org/r/657583 (owner: 10Muehlenhoff)
[14:00:04] <jouncebot>	 brennen and liw: How many deployers does it take to do Mediawiki train - American+European Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T1400).
[14:00:46] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] dbtools: Add sys/apply script [software] - 10https://gerrit.wikimedia.org/r/657581 (owner: 10Kormat)
[14:03:07] <wikibugs>	 (03CR) 10Volans: "By any chance was https://github.com/camptocamp/puppet-ghostbuster evaluated/discarded for some reason?" [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) (owner: 10Jbond)
[14:04:31] * jbond42 dosn;t want to look at the comment volans just posted :(
[14:04:51] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Docker registry needs cache to vary on Accept header value - https://phabricator.wikimedia.org/T242200 (10JMeybohm) 05Open→03Resolved The registry now responds properly with `vary: Accept`
[14:04:59] <wikibugs>	 (03CR) 10Ottomata: "Not opposed at all, but I'd expect many many clients to have a UA of 'python-requests', no?  Not just a malicious one?" [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey)
[14:05:05] <volans>	 jbond42: lol
[14:05:07] <wikibugs>	 10SRE, 10Traffic: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez)
[14:06:20] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Migrate SuggestedTagsAction to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657579 (https://phabricator.wikimedia.org/T267351) (owner: 10Mforns)
[14:06:59] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3004.wikimedia.org
[14:07:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:00] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host bast4002.wikimedia.org
[14:08:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:13] <wikibugs>	 10SRE, 10Traffic: Consolidate misc servers at edge sites - https://phabricator.wikimedia.org/T257323 (10MoritzMuehlenhoff)
[14:09:56] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Consolidate edge bastion server into ganeti - https://phabricator.wikimedia.org/T257324 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is done.
[14:10:22] <wikibugs>	 (03PS3) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559)
[14:10:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) (owner: 10Jbond)
[14:13:32] <wikibugs>	 (03CR) 10Jbond: utils::audit: add puppet audit script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) (owner: 10Jbond)
[14:13:46] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:13:49] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4002.wikimedia.org
[14:13:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:24] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host bast5001.wikimedia.org
[14:14:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:00] <godog>	 !log roll-restart swift-object in eqiad to apply new concurrency
[14:17:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:50] <wikibugs>	 (03PS1) 10JMeybohm: Demo - don't merge: Add a new listener to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/657591
[14:20:52] <wikibugs>	 (03PS1) 10JMeybohm: Demo - don't merge: Enable the service-proxy-demo listener for MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/657592
[14:20:54] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:00] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5001.wikimedia.org
[14:22:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:50] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27570/console" [puppet] - 10https://gerrit.wikimedia.org/r/657591 (owner: 10JMeybohm)
[14:25:14] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27571/console" [puppet] - 10https://gerrit.wikimedia.org/r/657592 (owner: 10JMeybohm)
[14:26:41] <godog>	 jouncebot: next
[14:26:41] <jouncebot>	 In 2 hour(s) and 33 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T1700)
[14:30:26] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:37:26] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:46:44] <icinga-wm>	 RECOVERY - Check systemd state on registry1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:49:27] <wikibugs>	 (03PS1) 10David Caro: config: allow using ~ for cookbook path [software/spicerack] - 10https://gerrit.wikimedia.org/r/657608
[14:51:00] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::analytics::cluster::users: add analytics to the druid group [puppet] - 10https://gerrit.wikimedia.org/r/657547 (owner: 10Elukey)
[14:51:36] <wikibugs>	 (03PS1) 10David Caro: gitignore: add vim swap files [software/spicerack] - 10https://gerrit.wikimedia.org/r/657609
[14:54:18] <wikibugs>	 (03CR) 10Jbond: icinga: add wait_for_optimal function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond)
[14:55:06] <wikibugs>	 (03PS2) 10Andrew Bogott: Nova cloud-init: rework logic for initial volume setup [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273)
[14:55:08] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273)
[14:55:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] config: allow using ~ for cookbook path [software/spicerack] - 10https://gerrit.wikimedia.org/r/657608 (owner: 10David Caro)
[14:56:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[14:58:22] <wikibugs>	 (03PS1) 10JMeybohm: Remove discovery.enabled from services (it's unused) [deployment-charts] - 10https://gerrit.wikimedia.org/r/657613
[14:59:14] <wikibugs>	 (03PS2) 10Andrew Bogott: Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273)
[14:59:16] <wikibugs>	 (03PS3) 10Andrew Bogott: Nova cloud-init: rework logic for initial volume setup [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273)
[15:00:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[15:01:05] <wikibugs>	 (03CR) 10Volans: "All comments are the outcome of a chat with John" (034 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond)
[15:01:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove discovery.enabled from services (it's unused) [deployment-charts] - 10https://gerrit.wikimedia.org/r/657613 (owner: 10JMeybohm)
[15:03:00] <wikibugs>	 (03CR) 10Elukey: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey)
[15:03:12] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Remove k8s::ssl [puppet] - 10https://gerrit.wikimedia.org/r/657615 (https://phabricator.wikimedia.org/T272559)
[15:06:28] <wikibugs>	 (03PS3) 10Andrew Bogott: Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273)
[15:06:30] <wikibugs>	 (03PS4) 10Andrew Bogott: Nova cloud-init: rework logic for initial volume setup [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273)
[15:06:57] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "Its last usage was removed four years ago I6ad769d0225c4" [puppet] - 10https://gerrit.wikimedia.org/r/657615 (https://phabricator.wikimedia.org/T272559) (owner: 10Alexandros Kosiaris)
[15:08:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[15:10:29] <wikibugs>	 (03PS4) 10Andrew Bogott: Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273)
[15:10:31] <wikibugs>	 (03PS5) 10Andrew Bogott: Nova cloud-init: rework logic for initial volume setup [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273)
[15:11:22] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[15:11:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[15:12:21] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) @Volans pointed me towards [[ https://github.com/camptocamp/puppet-ghostbuster | puppet-ghostbuster ]].  I have run this locally with a tunnel to the puppetdb server and here a...
[15:12:50] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[15:12:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:17] <moritzm>	 !log installing cairo security updates on stretch
[15:13:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:24] <wikibugs>	 (03PS5) 10Andrew Bogott: Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273)
[15:13:26] <wikibugs>	 (03PS6) 10Andrew Bogott: Nova cloud-init: rework logic for initial volume setup [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273)
[15:16:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[15:18:07] <wikibugs>	 (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/657608 (owner: 10David Caro)
[15:19:33] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305)
[15:19:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for cairo [puppet] - 10https://gerrit.wikimedia.org/r/657621
[15:21:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[15:22:11] <wikibugs>	 10SRE, 10Icinga, 10observability, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10RLazarus) Nice find! Thanks for tracking this down.
[15:25:09] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305)
[15:25:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for cairo [puppet] - 10https://gerrit.wikimedia.org/r/657621 (owner: 10Muehlenhoff)
[15:26:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27577/console" [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[15:26:45] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, why not :-)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/657608 (owner: 10David Caro)
[15:27:15] <wikibugs>	 (03PS1) 10Anne Tomasevich: Distinguish between null continue value and unknown one [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657623 (https://phabricator.wikimedia.org/T272548)
[15:27:20] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/657609 (owner: 10David Caro)
[15:28:40] <wikibugs>	 (03PS2) 10Anne Tomasevich: Distinguish between null continue value and unknown one [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657623 (https://phabricator.wikimedia.org/T272548)
[15:29:11] <wikibugs>	 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Epic: [Epic] Scaling strategy for Wikidata Query Service - https://phabricator.wikimedia.org/T221938 (10CBogen)
[15:31:04] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:34:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/657615 (https://phabricator.wikimedia.org/T272559) (owner: 10Alexandros Kosiaris)
[15:35:45] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10akosiaris)
[15:37:14] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10akosiaris) I 've checked off stdlib and lvm classes as they are from external modules that have been imported to the tree as is (aka vendoring).
[15:38:12] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:43:34] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004896 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[15:45:32] <wikibugs>	 (03CR) 10Eric Gardner: [C: 03+1] Distinguish between null continue value and unknown one [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657623 (https://phabricator.wikimedia.org/T272548) (owner: 10Anne Tomasevich)
[15:45:50] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10MoritzMuehlenhoff)
[15:59:08] <_joe_>	 is someone looking at the puppet failures? I'm in a meeting rn
[15:59:20] <_joe_>	 oh it was a recovery
[15:59:55] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host krb2001.codfw.wmnet
[15:59:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:36] <wikibugs>	 (03PS2) 10Razzi: sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596)
[16:03:38] <wikibugs>	 (03PS1) 10Ayounsi: Add Lumen transit in eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657627 (https://phabricator.wikimedia.org/T271748)
[16:04:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add Lumen transit in eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657627 (https://phabricator.wikimedia.org/T271748) (owner: 10Ayounsi)
[16:04:20] <wikibugs>	 (03PS2) 10Ayounsi: Add Lumen transit in eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657627 (https://phabricator.wikimedia.org/T271748)
[16:04:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi)
[16:05:16] <icinga-wm>	 PROBLEM - Check the last execution of replicate-krb-database on krb1001 is CRITICAL: CRITICAL: Status of the systemd unit replicate-krb-database https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:05:37] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2001.codfw.wmnet
[16:05:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add Lumen transit in eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657627 (https://phabricator.wikimedia.org/T271748) (owner: 10Ayounsi)
[16:06:32] <wikibugs>	 (03Merged) 10jenkins-bot: Add Lumen transit in eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657627 (https://phabricator.wikimedia.org/T271748) (owner: 10Ayounsi)
[16:07:14] <icinga-wm>	 PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:09:59] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet
[16:10:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:56] <icinga-wm>	 RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:12:10] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10dancy) Thanks Legoktm.  Small feature request: Can you add "last updated at <blah>" text to the top right corner of the page?
[16:13:36] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Remove discovery.enabled from services (it's unused) [deployment-charts] - 10https://gerrit.wikimedia.org/r/657613 (owner: 10JMeybohm)
[16:14:10] <icinga-wm>	 RECOVERY - Check the last execution of replicate-krb-database on krb1001 is OK: OK: Status of the systemd unit replicate-krb-database https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:14:43] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet
[16:14:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:04] <wikibugs>	 (03Merged) 10jenkins-bot: Remove discovery.enabled from services (it's unused) [deployment-charts] - 10https://gerrit.wikimedia.org/r/657613 (owner: 10JMeybohm)
[16:15:29] <wikibugs>	 (03PS1) 10Mforns: Migrate WebUIActionsTracking schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657630 (https://phabricator.wikimedia.org/T267347)
[16:16:05] <wikibugs>	 (03PS1) 10Hnowlan: similar-users: remove unused releases, set log to DEBUG in staging, new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657631
[16:22:21] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/657631 (owner: 10Hnowlan)
[16:24:52] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] similar-users: remove unused releases, set log to DEBUG in staging, new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657631 (owner: 10Hnowlan)
[16:26:22] <wikibugs>	 (03Merged) 10jenkins-bot: similar-users: remove unused releases, set log to DEBUG in staging, new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657631 (owner: 10Hnowlan)
[16:26:36] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) >>! In T179696#6765834, @dancy wrote: > Thanks Legoktm.  Small feature request: Can you add "last updated at > <blah>" text to the top righ...
[16:27:18] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[16:27:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:29:37] <wikibugs>	 (03CR) 10RLazarus: "LGTM -- please test with httpbb on one host before deploying everywhere, either before or after merging" [puppet] - 10https://gerrit.wikimedia.org/r/657138 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[16:29:43] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] mediawiki::web::prod_sites: remove unused code from main.conf [puppet] - 10https://gerrit.wikimedia.org/r/657138 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[16:31:06] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:32:08] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:32:12] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: m2 on db2133 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1344.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:32:26] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: m2 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1358.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:32:59] <marostegui>	 checking
[16:35:15] <wikibugs>	 (03PS3) 10Razzi: sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596)
[16:35:26] <jynus>	 must be something on 2133
[16:35:37] <marostegui>	 jynus: yes, I am on it
[16:37:12] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: m2 on db2133 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1061, Errmsg: Error Duplicate key name ix_user_user_text on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:37:25] <jynus>	 ^there you have it
[16:37:32] <marostegui>	 jynus: yes, I am on it
[16:39:12] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:45:37] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10bd808) > that are not used anywhere (including WMCS, you can use it too (an example).  Be aware that the puppet class/role reporting for Cloud VPS instances **//only//** reports the o...
[16:53:43] <wikibugs>	 (03PS1) 10Ottomata: Install python3-snappy for webperf navtiming [puppet] - 10https://gerrit.wikimedia.org/r/657639 (https://phabricator.wikimedia.org/T272613)
[16:54:18] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1013 is CRITICAL: 4.904e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:54:41] <wikibugs>	 (03PS2) 10Cwhite: logstash: enable curator to accept custom age filters [puppet] - 10https://gerrit.wikimedia.org/r/657370 (https://phabricator.wikimedia.org/T234565)
[16:55:52] <wikibugs>	 (03CR) 10Gilles: [C: 03+1] Install python3-snappy for webperf navtiming [puppet] - 10https://gerrit.wikimedia.org/r/657639 (https://phabricator.wikimedia.org/T272613) (owner: 10Ottomata)
[16:57:11] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Install python3-snappy for webperf navtiming [puppet] - 10https://gerrit.wikimedia.org/r/657639 (https://phabricator.wikimedia.org/T272613) (owner: 10Ottomata)
[16:59:34] <wikibugs>	 (03CR) 10Ottomata: "Could we make this index non 'w3creportingapi' specific, and instead use it for any/all events that use Event Platform based event schemas" [puppet] - 10https://gerrit.wikimedia.org/r/657452 (https://phabricator.wikimedia.org/T265938) (owner: 10Cwhite)
[17:00:04] <jouncebot>	 jbond42 and cdanis: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T1700).
[17:05:45] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) icinga::nsca::client is an example for something used in fundraising. the server is in production and the clients are in frack and that does not use the same puppetmaster
[17:09:02] <wikibugs>	 (03PS3) 10Jcrespo: admin: Add wikitrent to the list of privileged LDAP accounts [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489)
[17:10:10] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] admin: Add wikitrent to the list of privileged LDAP accounts [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489) (owner: 10Jcrespo)
[17:12:52] <wikibugs>	 (03CR) 10Cwhite: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/657452 (https://phabricator.wikimedia.org/T265938) (owner: 10Cwhite)
[17:15:53] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "Ah ok, I think I had forgotten that.  We'll just have to figure out how to reconcile that `http` field between our event schemas and ECS. " [puppet] - 10https://gerrit.wikimedia.org/r/657452 (https://phabricator.wikimedia.org/T265938) (owner: 10Cwhite)
[17:16:54] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: m2 on db2133 is OK: OK slave_sql_lag Replication lag: 0.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:17:12] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: m2 on db2133 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:18:17] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) 05Open→03Resolved The extra wmf privileges have been deployed on LDAP for wikitrent. Reopen if you find any issues while using gerrit because of that.
[17:19:01] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10Tchanders) Thanks @jcrespo
[17:23:57] <wikibugs>	 (03PS4) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559)
[17:24:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) (owner: 10Jbond)
[17:28:29] <wikibugs>	 (03PS5) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559)
[17:30:26] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) (owner: 10Jbond)
[17:31:02] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:31:03] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10Papaul)
[17:33:20] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] "It seems like the comments have been addressed/I saw a +1 in them so I'll merge this" [deployment-charts] - 10https://gerrit.wikimedia.org/r/650633 (https://phabricator.wikimedia.org/T269876) (owner: 10Mstyles)
[17:33:40] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) have finished hacking with the audit script, this is the list produced by that script ` lines=5 alternatives::install apparmor::hardlink apt::noupgrade arclamp::profiler bacula...
[17:34:06] <wikibugs>	 (03PS1) 10Andrew Bogott: mwopenstackclients3.py: apply 70bade8f82a505b25e5cc1a09449dc6e0ebc34b6 to py3 [puppet] - 10https://gerrit.wikimedia.org/r/657668 (https://phabricator.wikimedia.org/T272553)
[17:34:08] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-wikireplica-dns.py: Add support for db.svc.wikimedia.cloud. entries [puppet] - 10https://gerrit.wikimedia.org/r/657669 (https://phabricator.wikimedia.org/T272553)
[17:34:33] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) Tchanders, small followup- I understand the process may not be trivial for newcomers, but the simplification, before I edited, on the Engineering's handbook made us unable to proceed with th...
[17:34:52] <wikibugs>	 (03Merged) 10jenkins-bot: update flink config with swift and other values [deployment-charts] - 10https://gerrit.wikimedia.org/r/650633 (https://phabricator.wikimedia.org/T269876) (owner: 10Mstyles)
[17:34:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients3.py: apply 70bade8f82a505b25e5cc1a09449dc6e0ebc34b6 to py3 [puppet] - 10https://gerrit.wikimedia.org/r/657668 (https://phabricator.wikimedia.org/T272553) (owner: 10Andrew Bogott)
[17:34:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs-wikireplica-dns.py: Add support for db.svc.wikimedia.cloud. entries [puppet] - 10https://gerrit.wikimedia.org/r/657669 (https://phabricator.wikimedia.org/T272553) (owner: 10Andrew Bogott)
[17:35:28] <ryankemper>	 !log [wdqs] Depooled `wdqs1013` to allow it to catch up on lag
[17:35:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:39] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jbond)
[17:36:41] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.dns.netbox
[17:36:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:10] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:38:42] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10jcrespo) Hey, @JTannerWMF,  I tried to search on my own for your LDAP/Developer account, but the one you provided (JTanner (WMF)) doesn't exist. I am adding @ggellerman to the ticket (ap...
[17:39:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:40:38] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10Tchanders) >>! In T272489#6766201, @jcrespo wrote: > Tchanders, small followup- I understand the process may not be trivial for newcomers, but the simplification, before I edited, on the Engineering'...
[17:41:40] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:42:07] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:42:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:42] <icinga-wm>	 ACKNOWLEDGEMENT - WDQS high update lag on wdqs1013 is CRITICAL: 4.386e+04 ge 4.32e+04 Ryan Kemper Affected node has been depooled while it catches up on 12h of update lag: https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=1611196826628&to=1611251346414 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wiki
[17:44:42] <icinga-wm>	 e?orgId=1&panelId=8&fullscreen
[17:45:38] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs-wikireplica-dns.py: Add support for db.svc.wikimedia.cloud. entries [puppet] - 10https://gerrit.wikimedia.org/r/657669 (https://phabricator.wikimedia.org/T272553)
[17:45:40] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-wikireplica-dns.py: format with black [puppet] - 10https://gerrit.wikimedia.org/r/657671 (https://phabricator.wikimedia.org/T272553)
[17:46:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:48:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:56:02] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10Papaul)
[18:00:04] <jouncebot>	 chrisalbon and accraze: Dear deployers, time to do the Services – Graphoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T1800).
[18:02:55] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.dns.netbox
[18:02:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:03] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10elukey) `statistics-privatedata-users` is deprecated, let's use `analytics-privatedata-users` (need @Ottomata's approval)
[18:08:12] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:08:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:32] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.dns.netbox
[18:08:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:45] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) a:05Ottomata→03elukey
[18:12:51] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:12:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:18] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:14:26] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.dns.netbox
[18:14:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:13] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:15:56] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:18:54] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "This looks good, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/657452 (https://phabricator.wikimedia.org/T265938) (owner: 10Cwhite)
[18:19:48] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:21:13] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:21:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:54] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10KFrancis) @jcrespo The NDA is out for signatures.  I will confirm when it's complete.  Thanks!
[18:23:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10Ottomata) Approved.
[18:26:10] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) Oo we'll also want eventstreams-internal.svc.* LVS set up too.
[18:28:03] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting ssh key change for production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10jcrespo)
[18:29:02] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) >>! In T272559#6766020, @Dzahn wrote: > icinga::nsca::client is an example for something used in fundraising. the server is in production and the clients are in frack and that...
[18:30:26] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:33:21] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2371.codfw.wmnet with reason: REIMAGE
[18:33:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:16] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2373.codfw.wmnet with reason: REIMAGE
[18:34:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:23] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10ggellerman) Thanks, @jcrespo  !  I have added @JKatzWMF   who is Jazmin's manager now.  Would you please let me know which records still list me as @JTannerWMF  's manager so that I can...
[18:34:56] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2375.codfw.wmnet with reason: REIMAGE
[18:34:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:22] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2371.codfw.wmnet with reason: REIMAGE
[18:35:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:28] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2375.codfw.wmnet with reason: REIMAGE
[18:35:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:13] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:37:10] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Andrew)
[18:37:27] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2373.codfw.wmnet with reason: REIMAGE
[18:37:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:35] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove the 'letsencrypt' module [puppet] - 10https://gerrit.wikimedia.org/r/655762 (https://phabricator.wikimedia.org/T252199) (owner: 10Andrew Bogott)
[18:38:55] <wikibugs>	 (03Abandoned) 10Andrew Bogott: Nova cloud-init: rework logic for initial volume setup [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[18:42:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting ssh key change for production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10jcrespo) No approvals needed, this is just an ssh change (no permission changes) I only need to verify identity of requester and we should be done.  @calbon @ACraze can we...
[18:43:12] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10jcrespo) Thank for the update, looking forward for the process to be complete. Thanks to you!
[18:44:18] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:44:58] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10jcrespo) @ggellerman Apologies for the mistake, I checked corporate ldap records, the one used for Google account authentication. Not sure if it is also used for some of the other hr too...
[18:46:58] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting ssh key change for production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10jcrespo)
[18:47:06] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder: comment out the memcached servers for keystone authtoken [puppet] - 10https://gerrit.wikimedia.org/r/657673 (https://phabricator.wikimedia.org/T272113)
[18:48:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cinder: comment out the memcached servers for keystone authtoken [puppet] - 10https://gerrit.wikimedia.org/r/657673 (https://phabricator.wikimedia.org/T272113) (owner: 10Andrew Bogott)
[18:48:18] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27578/console" [puppet] - 10https://gerrit.wikimedia.org/r/635751 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[18:48:55] <wikibugs>	 (03PS1) 10Jcrespo: admin: Update ssh key for accraze [puppet] - 10https://gerrit.wikimedia.org/r/657674 (https://phabricator.wikimedia.org/T272541)
[18:49:06] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2226.codfw.wmnet with reason: REIMAGE
[18:49:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:52:07] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2226.codfw.wmnet with reason: REIMAGE
[18:52:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:32] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid public cluster: Reboot Druid nodes - razzi@cumin1001
[18:53:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:11] <icinga-wm>	 PROBLEM - Host mw2375 is DOWN: PING CRITICAL - Packet loss = 100%
[18:54:29] <wikibugs>	 (03PS5) 10Joal: profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560)
[18:55:14] <wikibugs>	 (03CR) 10Joal: "Thanks for the explanation elukey - should be ok now :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[18:55:49] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2371.codfw.wmnet'] `  an...
[18:56:05] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2375 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.347 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:56:07] <icinga-wm>	 RECOVERY - Host mw2375 is UP: PING OK - Packet loss = 0%, RTA = 33.46 ms
[18:56:44] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2373.codfw.wmnet'] `  an...
[18:56:55] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting ssh key change for production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10jcrespo) p:05Triage→03High
[18:58:21] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2375.codfw.wmnet'] `  an...
[18:59:07] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[19:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T1900).
[19:00:05] <jouncebot>	 ottomata, hmonroy, and mforns: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:22] <mforns>	 here :]
[19:00:26] <Urbanecm>	 I can deploy today :)
[19:00:34] <mforns>	 I also represent ottomata from my team
[19:00:44] <Urbanecm>	 ack
[19:00:48] <Urbanecm>	 hmonroy: are you here?
[19:00:57] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[19:01:43] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[19:01:44] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "B&C" [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657391 (https://phabricator.wikimedia.org/T253121) (owner: 10Ottomata)
[19:01:47] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[19:01:47] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[19:02:01] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "B&C" [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657392 (https://phabricator.wikimedia.org/T253121) (owner: 10Ottomata)
[19:02:13] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[19:02:34] <Urbanecm>	 mforns: do the config patches depend on the backports (ie. can I deploy them before)?
[19:02:42] <wikibugs>	 (03PS1) 10Jeena Huneidi: rdf-streaming-updater: Increment version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657676
[19:02:45] <hmonroy>	 yes, I'm here
[19:02:51] <mforns>	 Urbanecm: you can deploy them before
[19:02:59] <Urbanecm>	 ack, thanks
[19:03:06] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Migrate WebUIActionsTracking schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657630 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns)
[19:03:43] <wikibugs>	 (03PS1) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python [puppet] - 10https://gerrit.wikimedia.org/r/657677
[19:03:48] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Migrate SuggestedTagsAction to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657579 (https://phabricator.wikimedia.org/T267351) (owner: 10Mforns)
[19:03:51] <wikibugs>	 (03CR) 10Urbanecm: Migrate WebUIActionsTracking schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657630 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns)
[19:04:48] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate SuggestedTagsAction to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657579 (https://phabricator.wikimedia.org/T267351) (owner: 10Mforns)
[19:05:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Port nrpe_check_opcache to Python [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm)
[19:05:17] <Urbanecm>	 mforns: please test 657579: Migrate SuggestedTagsAction to Event Platform on all wikis | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/657579 at mwdebug1001
[19:05:38] <Urbanecm>	 hi hmonroy, will ping you once your patches are ready :)
[19:05:48] <hmonroy>	 urbanecm: thank you!
[19:05:52] <mforns>	 Urbanecm: doing
[19:05:59] <mforns>	 thanks a lot Urbanecm 
[19:06:07] <wikibugs>	 (03PS2) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python [puppet] - 10https://gerrit.wikimedia.org/r/657677
[19:06:17] <Urbanecm>	 np :)
[19:06:51] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2371.codfw.wmnet
[19:06:56] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2373.codfw.wmnet
[19:06:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:06] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2375.codfw.wmnet
[19:07:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:35] <wikibugs>	 (03CR) 10Dzahn: "This wouldn't have been true for me personally, fwiw." [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm)
[19:11:15] <Urbanecm>	 mforns: how is it going? anything i can help with?
[19:11:48] <mforns>	 Urbanecm: I was looking at Kafka to see if events are flowing in, can not see them, but might be because the stream is low throughput
[19:12:06] <mforns>	 I have never used mwdebug1001 to test, will try now
[19:12:20] <Urbanecm>	 mforns: ah, that's because the change is not yet deployed
[19:12:27] <mforns>	 oh, ok ok
[19:13:17] <Urbanecm>	 you need to install an extension from https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_extensions to your browser, enable it, pick mwdebug1001 as your server, and do sth on-wiki to make the server send the event, and then you can see it in Kafka
[19:13:25] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "cinder: comment out the memcached servers for keystone authtoken" [puppet] - 10https://gerrit.wikimedia.org/r/657650
[19:13:56] <Urbanecm>	 mforns: it's a way how to test a change before actually pushing it live, affecting everyone else, to make sure it doesn't bring us down, or cause other bad things
[19:14:06] <mforns>	 of course
[19:14:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "cinder: comment out the memcached servers for keystone authtoken" [puppet] - 10https://gerrit.wikimedia.org/r/657650 (owner: 10Andrew Bogott)
[19:14:35] <mforns>	 I wasn't aware this was a requirement
[19:15:01] <mforns>	 Urbanecm: please feel free to cancel those patches if they are blocking the window
[19:15:24] <Urbanecm>	 does that mean there's an issue with testing them mforns ?
[19:15:28] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10jcrespo) @Lea_WMDE To speed up access, could you come back to me about questions at T271725#6755696. Interns and researchers, in our best practices, have a time-bound...
[19:16:12] <ottomata>	 Urbanecm:  btw i will be available in 15 mins and can help with testing both of thest things
[19:16:16] <ottomata>	 mforns:  ^
[19:16:20] <mforns>	 Urbanecm: I assume it will take a while, but if it's not blocking the window I'm trying
[19:16:30] <mforns>	 ok, thanks!
[19:17:21] <Urbanecm>	 mforns: it's all right, we have time :). Sorry, I thought you're familiar with the process, would make it more clear otherwise :)
[19:17:35] <mforns>	 no problem, thanks!
[19:18:42] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "docs-only change, no-op" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651890 (https://phabricator.wikimedia.org/T255790) (owner: 10Samwilson)
[19:18:47] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] rdf-streaming-updater: Increment version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657676 (owner: 10Jeena Huneidi)
[19:19:39] <wikibugs>	 (03Merged) 10jenkins-bot: Add notes about load order of Wikisource and Collection extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651890 (https://phabricator.wikimedia.org/T255790) (owner: 10Samwilson)
[19:19:45] <Urbanecm>	 (and also ack to otto.mata's msg)
[19:19:53] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "I verified new key on a videocall, with Calbon confirming identity of Andy and Andy confirming thew key's hash." [puppet] - 10https://gerrit.wikimedia.org/r/657674 (https://phabricator.wikimedia.org/T272541) (owner: 10Jcrespo)
[19:20:25] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: Increment version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657676 (owner: 10Jeena Huneidi)
[19:20:27] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10ggellerman) Thanks, @jcrespo !  I'll ask IT if they can update ldap records to reflect @JTannerWMF 's current manager.
[19:21:06] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) I don't know the answer to that question. Things may have changed over time.  We'd have to ask frack people like Jeff Green.
[19:21:10] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/657674 (https://phabricator.wikimedia.org/T272541) (owner: 10Jcrespo)
[19:21:48] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 0b46c9f1f75fc773f57bfa70521c9eaf20410b9e: [no-op] Add notes about load order of Wikisource and Collection extensions (T255790) (duration: 01m 11s)
[19:21:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:21:52] <stashbot>	 T255790: Wikisource: Replace ElectronPDF with WSExport PDF support - https://phabricator.wikimedia.org/T255790
[19:21:56] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] admin: Update ssh key for accraze [puppet] - 10https://gerrit.wikimedia.org/r/657674 (https://phabricator.wikimedia.org/T272541) (owner: 10Jcrespo)
[19:22:09] <Urbanecm>	 hmonroy: fyi, your docs-only change is merged
[19:22:22] <hmonroy>	 cool!
[19:24:43] <Urbanecm>	 mforns: I tried to submit an event via mwdebug1001 to help you testing it, it sent a post-request to intake-analytics.wikimedia.org
[19:25:13] <mforns>	 Urbanecm: trying to do the same here
[19:25:34] <mforns>	 Urbanecm: I think I managed now :]
[19:25:40] <Urbanecm>	 mforns: great :)
[19:26:37] <mforns>	 Urbanecm: yes, the event did come in to kafka :]
[19:26:44] <Urbanecm>	 great!
[19:26:45] <Urbanecm>	 syncing it then :)
[19:26:48] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10jcrespo) @ggellerman I just saw another mistake on the corporate ldap not being up to date in terms of management, so I will stop using it to locate managers and use some of the hr tools...
[19:27:11] <mforns>	 Urbanecm: cool! thanks a lot for the patience
[19:27:38] <Urbanecm>	 no problem :)
[19:28:26] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 62c9c35a76e2d065922f8c9f5a58672240dea7de: Migrate SuggestedTagsAction to Event Platform on all wikis (T267351) (duration: 01m 03s)
[19:28:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:30] <stashbot>	 T267351: SuggestedTagsAction Event Platform Migration - https://phabricator.wikimedia.org/T267351
[19:29:01] <Urbanecm>	 mforns: should be live :)
[19:29:12] <mforns>	 Urbanecm: ok! checking
[19:29:35] <Urbanecm>	 mforns: I'll deploy hmonroy's patch, which appears to be simpler, now :)
[19:29:42] <mforns>	 ok
[19:29:43] <wikibugs>	 (03PS3) 10Urbanecm: Enables the Wikisource extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657407 (https://phabricator.wikimedia.org/T272163) (owner: 10Tpt)
[19:29:46] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enables the Wikisource extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657407 (https://phabricator.wikimedia.org/T272163) (owner: 10Tpt)
[19:29:48] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27580/console" [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[19:30:39] <wikibugs>	 (03Merged) 10jenkins-bot: Refactor EventLogging Event Platform PHP integration [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657391 (https://phabricator.wikimedia.org/T253121) (owner: 10Ottomata)
[19:31:00] <wikibugs>	 (03Merged) 10jenkins-bot: Enables the Wikisource extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657407 (https://phabricator.wikimedia.org/T272163) (owner: 10Tpt)
[19:31:49] <Urbanecm>	 hmonroy: your patch is available at mwdebug1001 for testing
[19:31:59] <hmonroy>	 checking
[19:32:08] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10ggellerman) No need to apologize, @jcrespo  - you surfaced something that I did not know about that needs to be fixed.  I thank you for that :)
[19:33:21] <ottomata>	 OoOOk ! hello!
[19:33:25] <ottomata>	 mforns:  where we at?  
[19:33:44] <mforns>	 in the middle
[19:33:46] <mforns>	 :
[19:33:47] <mforns>	 :]
[19:33:53] <Urbanecm>	 ottomata:  657579 Migrate SuggestedTagsAction to Event Platform on all wikis is deployed, currently deploying some other (unrelated) patch to give you time to appear :)
[19:34:02] <wikibugs>	 (03Merged) 10jenkins-bot: Fix possible undefined index warning in arg checking in EventServiceClient [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657392 (https://phabricator.wikimedia.org/T253121) (owner: 10Ottomata)
[19:34:07] <Urbanecm>	 backports are merged, I'll pull them to mwdebug1002 so you can test
[19:34:11] <ottomata>	 nice
[19:34:12] <ottomata>	 ok
[19:35:04] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) >>! In T272559#6766195, @jbond wrote: > have finished hacking with the audit script, this is the list produced by that script  > diamond > diamond::collector > diamond::collect...
[19:35:07] <Urbanecm>	 ottomata: mforns: ok, your backports are at mwdebug1002 for testing :)
[19:35:17] <Urbanecm>	 (both of them)
[19:35:20] <ottomata>	 testing on mwdebug1002
[19:35:57] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) ^ Those are just the ones that stood out to me from the list. I have not gone through the others. But it seems to me there are a lot of false positives here. Please don't delet...
[19:36:18] <Urbanecm>	 hmonroy: how is it going with your patch? :)
[19:37:32] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2226.codfw.wmnet'] `  an...
[19:37:57] <ottomata>	 Urbanecm:  tested, works perfect.  
[19:38:04] <Urbanecm>	 thanks, syncing it out then :)
[19:38:08] <hmonroy>	 Urbanecm: Looks good! I just checked with the team to make sure it is working as expected.
[19:38:13] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2226.codfw.wmnet
[19:38:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:19] <Urbanecm>	 thanks hmonroy, will sync too :)
[19:38:30] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2371.codfw.wmnet
[19:38:36] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2373.codfw.wmnet
[19:38:42] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2375.codfw.wmnet
[19:38:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:27] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2226.codfw.wmnet
[19:39:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:58] <icinga-wm>	 PROBLEM - Host es2025.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:40:02] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 9.913 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[19:40:12] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/EventLogging/: ee830a5ec2051fa970084e89b477a44c384e309c: f7152a74e00404fc561c44d1c2e37d7f882e2f52: EventLogging backport, see commits for details (T253121) (duration: 01m 05s)
[19:40:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:19] <stashbot>	 T253121: MEP Client MediaWiki PHP - https://phabricator.wikimedia.org/T253121
[19:40:20] <Urbanecm>	 ottomata: mforns: backports deployed
[19:41:51] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 4bb9e5d13be702516368774732a9e1711bec42e5: Enables the Wikisource extension on oldwikisource (T272163) (duration: 01m 04s)
[19:41:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:41:55] <stashbot>	 T272163: Install the Wikisource extension on oldwikisource - https://phabricator.wikimedia.org/T272163
[19:42:00] <Urbanecm>	 hmonroy: and deployed :)
[19:42:24] <wikibugs>	 (03PS2) 10Urbanecm: Migrate WebUIActionsTracking schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657630 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns)
[19:42:29] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Migrate WebUIActionsTracking schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657630 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns)
[19:42:49] <ottomata>	 Urbanecm:  woohoo thank youy
[19:42:59] <Urbanecm>	 no problem :)
[19:43:18] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate WebUIActionsTracking schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657630 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns)
[19:44:19] <Urbanecm>	 ottomata: mforns: 657630: Migrate WebUIActionsTracking schemas to Event Platform on testwiki | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/657630 is at mwdebug1002 for testing :)
[19:44:27] <mforns>	 Urbanecm: ok! on it
[19:45:03] <Urbanecm>	 mforns: thanks, let me know if there's something I can help you with :)
[19:45:11] <mforns>	 ok :]
[19:45:48] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] mediawiki::web::prod_sites: remove unused code from main.conf [puppet] - 10https://gerrit.wikimedia.org/r/657138 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[19:45:58] <icinga-wm>	 RECOVERY - Host es2025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.96 ms
[19:47:04] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:47:19] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) >>! In T179696#6765834, @dancy wrote: > Thanks Legoktm.  Small feature request: Can you add "last updated at > <blah>" text to the top...
[19:47:19] <mforns>	 1/2 schemas tested
[19:47:31] <Urbanecm>	 ack
[19:49:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:49:01] <wikibugs>	 (03PS1) 10Legoktm: docker_registry_ha: Add timestamp to build-homepage output [puppet] - 10https://gerrit.wikimedia.org/r/657678 (https://phabricator.wikimedia.org/T179696)
[19:49:02] <ottomata>	 (afk for a bit!  )
[19:49:18] <Urbanecm>	 ack
[19:49:59] <mforns>	 Urbanecm: both schemas tested, working!
[19:50:07] <Urbanecm>	 mforns: great, syncing :)
[19:51:42] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ac99da75f9507e19472ab3020be638262857ec07: Migrate WebUIActionsTracking schemas to Event Platform on testwiki (T267347; T271164) (duration: 01m 03s)
[19:51:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:49] <stashbot>	 T267347: MobileWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T267347
[19:51:49] <stashbot>	 T271164: DesktopWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T271164
[19:51:53] <Urbanecm>	 mforns: that should be all :). Anything else?
[19:52:16] <mforns>	 Urbanecm: don't think so! thanks a lot for showing me how to test :]
[19:52:23] <Urbanecm>	 happy to help :)
[19:53:14] <hmonroy>	 Urbanecm: Thank you!
[19:53:20] <Urbanecm>	 no problem :)
[20:00:04] <jouncebot>	 brennen and liw: (Dis)respected human, time to deploy Mediawiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T2000). Please do the needful.
[20:03:35] <wikibugs>	 (03PS1) 10Brennen Bearnes: all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657679
[20:03:37] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657679 (owner: 10Brennen Bearnes)
[20:04:17] <wikibugs>	 (03PS6) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559)
[20:04:28] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657679 (owner: 10Brennen Bearnes)
[20:04:41] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid public cluster: Reboot Druid nodes - razzi@cumin1001
[20:04:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:02] <logmsgbot>	 !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.27
[20:06:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:36] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jcrespo) Comments for persistence-related modules: but please @Marostegui @Kormat comment too.  * profile::proxysql I wrote this for deployment of proxysql. While it is basic, it is f...
[20:13:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting ssh key change for production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10jcrespo)
[20:15:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting ssh key change for production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10jcrespo) 05Open→03Resolved a:03jcrespo Change was merged and should have been applied to all servers now. Reopen if you find any issues accessing the production cluster.
[20:19:50] <wikibugs>	 (03PS7) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559)
[20:20:12] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs1013 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 1.645e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[20:22:58] <wikibugs>	 (03PS1) 10Ottomata: Finalize QuickSurvey* Event Platform migration [puppet] - 10https://gerrit.wikimedia.org/r/657681 (https://phabricator.wikimedia.org/T271165)
[20:24:23] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) Thanks for the review @Dzahn this is helping get rid of some false positives in the [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/657571 | audit script ]].  i went th...
[20:25:20] <wikibugs>	 (03PS2) 10Ottomata: Finalize QuickSurvey* Event Platform migration [puppet] - 10https://gerrit.wikimedia.org/r/657681 (https://phabricator.wikimedia.org/T271165)
[20:33:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:34:17] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/657678 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[20:36:12] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:38:22] <wikibugs>	 (03CR) 10RLazarus: "I think this is a good idea -- just minor comments on the implementation." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm)
[20:40:47] <wikibugs>	 (03PS8) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559)
[20:46:26] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Finalize QuickSurvey* Event Platform migration [puppet] - 10https://gerrit.wikimedia.org/r/657681 (https://phabricator.wikimedia.org/T271165) (owner: 10Ottomata)
[20:56:34] <wikibugs>	 (03PS2) 10Ottomata: Remove wgEventLoggingSchemas ContentTranslationAbuseFilter override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639579 (https://phabricator.wikimedia.org/T259163)
[20:57:52] <wikibugs>	 (03Abandoned) 10Ottomata: Remove wgEventLoggingSchemas ContentTranslationAbuseFilter override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639579 (https://phabricator.wikimedia.org/T259163) (owner: 10Ottomata)
[21:01:17] <wikibugs>	 10SRE, 10Traffic, 10serviceops: ChartMuseum responses are cached in the CDN with default (24h) ttl - https://phabricator.wikimedia.org/T272633 (10Dzahn) `hieradata/role/common/cache/text.yaml` has:  `   60   helm-charts.wikimedia.org:  61     caching: 'normal' `  That should confirm that it is indeed the 24...
[21:01:28] <chrisalbon>	 is there a reason bast4002.wikimedia.org is unreachable to me?
[21:01:53] <chrisalbon>	 https://www.irccloud.com/pastebin/k8iiOPV2/
[21:02:27] <rzl>	 chrisalbon: bast4003 is the new hotness
[21:03:10] <rzl>	 should be a quick update to your ssh config, https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast4003.wikimedia.org has the correct fingerprint
[21:03:56] <chrisalbon>	 okay whew cool
[21:04:18] <chrisalbon>	 I thought it was because I upgraded to Ubuntu 20 or something and was like "NoooooooOOOooOoooo"
[21:04:22] <chrisalbon>	 Thanks rzl
[21:04:33] <rzl>	 👍
[21:08:13] <wikibugs>	 (03PS4) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712)
[21:08:43] <wikibugs>	 (03CR) 10Jforrester: "Good to go?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester)
[21:09:59] <wikibugs>	 10SRE, 10Traffic, 10serviceops: ChartMuseum responses are cached in the CDN with default (24h) ttl - https://phabricator.wikimedia.org/T272633 (10CDanis) >>! In T272633#6766881, @Dzahn wrote: > An easy way to do this would be to just switch 'normal' to 'pass' here. Then there would be no caching at all.  We...
[21:24:30] <brennen>	 jouncebot now
[21:24:30] <jouncebot>	 For the next 0 hour(s) and 35 minute(s): Mediawiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T2000)
[21:24:35] <brennen>	 rollin' back.
[21:25:41] <rzl>	 brennen: sorry to hear, shout if you need anything from SRE
[21:25:50] <brennen>	 rzl: thanks, will do.
[21:27:11] <logmsgbot>	 !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert group2 wikis to 1.36.0-wmf.26
[21:27:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:17] <wikibugs>	 (03PS1) 10Ottomata: Remove migrated EventLoggingSchemas overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657688 (https://phabricator.wikimedia.org/T259163)
[21:28:47] <wikibugs>	 (03PS1) 10Brennen Bearnes: Revert "all wikis to 1.36.0-wmf.27" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657689
[21:28:49] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Revert "all wikis to 1.36.0-wmf.27" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657689 (owner: 10Brennen Bearnes)
[21:29:48] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "all wikis to 1.36.0-wmf.27" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657689 (owner: 10Brennen Bearnes)
[21:30:20] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:32:25] <brennen>	 hrm, looks like the error spike i was just seeing probably isn't train-related, but i will dig a bit before rolling back to group2.
[21:32:53] <wikibugs>	 (03CR) 10Ottomata: "To be deployed on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657688 (https://phabricator.wikimedia.org/T259163) (owner: 10Ottomata)
[21:38:40] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:39:48] <wikibugs>	 (03CR) 10Daimona Eaytoy: "> Patch Set 4:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester)
[21:47:15] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] wmcs-wikireplica-dns.py: format with black [puppet] - 10https://gerrit.wikimedia.org/r/657671 (https://phabricator.wikimedia.org/T272553) (owner: 10Andrew Bogott)
[21:48:07] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) @elukey it works!  I realized that since this service is not proxied via...
[21:49:48] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "Definitely a hack-ish way to handle it, but this script is 99% for the wikireplicas and not even 1% a few other things that we thought of." [puppet] - 10https://gerrit.wikimedia.org/r/657669 (https://phabricator.wikimedia.org/T272553) (owner: 10Andrew Bogott)
[21:56:36] <wikibugs>	 (03PS1) 10DLynch: Enroll idwiki in the DiscussionTools a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657691 (https://phabricator.wikimedia.org/T268191)
[21:57:00] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.867 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[21:58:11] <brennen>	 Jdlrobson: about?
[22:05:08] <wikibugs>	 (03PS1) 10Aklapper: mariadb: grant user 'phstats' additional select on phabricator_policy db [puppet] - 10https://gerrit.wikimedia.org/r/657692
[22:06:05] <wikibugs>	 (03PS2) 10Aklapper: mariadb: grant user 'phstats' additional select on phabricator_policy db [puppet] - 10https://gerrit.wikimedia.org/r/657692
[22:06:57] <wikibugs>	 (03PS9) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559)
[22:07:58] <wikibugs>	 (03PS2) 10A2569875: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612)
[22:09:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-wikireplica-dns.py: Add support for db.svc.wikimedia.cloud. entries [puppet] - 10https://gerrit.wikimedia.org/r/657669 (https://phabricator.wikimedia.org/T272553) (owner: 10Andrew Bogott)
[22:10:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-wikireplica-dns.py: format with black [puppet] - 10https://gerrit.wikimedia.org/r/657671 (https://phabricator.wikimedia.org/T272553) (owner: 10Andrew Bogott)
[22:10:08] <brennen>	 !log 1.36.0-wmf.27 train status: for avoidance of doubt, no deploys until further notice - sorting out T272638
[22:10:10] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs-wikireplica-dns.py: format with black [puppet] - 10https://gerrit.wikimedia.org/r/657671 (https://phabricator.wikimedia.org/T272553)
[22:10:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:10:12] <stashbot>	 T272638: TypeError: null is not an object (evaluating 't[e.title]')  on mobile domain - https://phabricator.wikimedia.org/T272638
[22:17:23] <wikibugs>	 (03PS10) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559)
[22:23:22] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 149 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:23:51] <wikibugs>	 (03CR) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm)
[22:23:59] <wikibugs>	 (03PS3) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python [puppet] - 10https://gerrit.wikimedia.org/r/657677
[22:25:08] <wikibugs>	 (03PS11) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559)
[22:25:36] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 5 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:30:54] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:32:24] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[22:32:38] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[22:36:55] <wikibugs>	 (03PS12) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559)
[22:37:48] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:39:29] <James_F>	 brennen: OK for me to sling out a beta config patch?
[22:39:57] <brennen>	 James_F: that's fine, but give me one sec to clear things
[22:40:23] <brennen>	 James_F: you're clear
[22:42:12] <wikibugs>	 (03PS5) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712)
[22:42:19] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester)
[22:43:05] <wikibugs>	 (03Merged) 10jenkins-bot: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester)
[22:45:57] <James_F>	 brennen: All done, thanks.
[22:46:57] <brennen>	 James_F: ack, thanks.
[22:46:58] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:49:18] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:50:08] <wikibugs>	 (03PS1) 10Urbanecm: wgAbuseFilterAflFilterMigrationStage: Set READ_NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657694 (https://phabricator.wikimedia.org/T269712)
[22:50:15] <wikibugs>	 (03PS1) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make READ_NEW in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657695 (https://phabricator.wikimedia.org/T269712)
[22:50:17] <wikibugs>	 (03PS1) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657696 (https://phabricator.wikimedia.org/T269712)
[22:50:19] <wikibugs>	 (03PS1) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Stop setting, COMPAT_NEW is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657697 (https://phabricator.wikimedia.org/T269712)
[22:51:23] <Urbanecm>	 sorry, didn't know you're uploading the patches James_F 
[22:51:24] <James_F>	 Urbanecm: Pah. :-)
[22:52:30] <wikibugs>	 (03Abandoned) 10Urbanecm: wgAbuseFilterAflFilterMigrationStage: Set READ_NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657694 (https://phabricator.wikimedia.org/T269712) (owner: 10Urbanecm)
[22:53:31] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "Do not merge until it's actually changed in the AbuseFilter's repo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657697 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester)
[22:53:53] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "do not merge until we're sure the new schema doesn't cause any issues" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657696 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester)
[22:54:09] <James_F>	 Thanks.
[22:54:15] <James_F>	 Was looking for the AF patch.
[22:54:19] <Urbanecm>	 just placed a procedural -2 to avoid bad things happening
[22:54:24] <Urbanecm>	 I don't think there is any
[22:54:34] <James_F>	 Yeah, will write one.
[22:57:05] <wikibugs>	 (03PS2) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Stop setting, COMPAT_NEW is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657697 (https://phabricator.wikimedia.org/T269712)
[22:57:16] <wikibugs>	 (03CR) 10Volans: "Quick first pass" (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[22:57:30] <wikibugs>	 (03CR) 1020after4: [C: 03+1] "+1 because I can't +2" [puppet] - 10https://gerrit.wikimedia.org/r/657692 (owner: 10Aklapper)
[23:01:29] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "this sounds like a good idea" [puppet] - 10https://gerrit.wikimedia.org/r/657692 (owner: 10Aklapper)
[23:02:56] <wikibugs>	 (03CR) 10RLazarus: "Thanks! Almost there, IMO." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm)
[23:05:46] <legoktm>	 rzl: will icinga handle extra lines as long as the first one starts with UNKNOWN?
[23:06:33] <legoktm>	 or is that determination solely exit code based?
[23:06:45] <rzl>	 I *think* it's just the exit code but I'm not positive
[23:07:09] <wikibugs>	 (03CR) 10CRusnov: [V: 03+2 C: 03+2] Update to Netbox 2.10.3-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/657454 (owner: 10CRusnov)
[23:07:12] <wikibugs>	 10SRE, 10DBA, 10Phabricator: Grant phstats user SELECT rights to phstats user - https://phabricator.wikimedia.org/T272654 (10Urbanecm)
[23:07:47] <wikibugs>	 (03PS5) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476)
[23:08:15] <wikibugs>	 (03CR) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[23:08:22] <wikibugs>	 10SRE, 10DBA, 10Phabricator: Grant phstats user SELECT rights to phstats user - https://phabricator.wikimedia.org/T272654 (10Urbanecm)
[23:08:45] <wikibugs>	 (03PS3) 10Urbanecm: mariadb: grant user 'phstats' additional select on phabricator_policy db [puppet] - 10https://gerrit.wikimedia.org/r/657692 (https://phabricator.wikimedia.org/T272654) (owner: 10Aklapper)
[23:10:36] <wikibugs>	 (03PS1) 10CDanis: tweak User-Agent for bot_posts_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/657700 (https://phabricator.wikimedia.org/T272330)
[23:10:58] <wikibugs>	 10SRE, 10DBA, 10Phabricator, 10Patch-For-Review: Grant phstats user SELECT rights for phabricator_policy database - https://phabricator.wikimedia.org/T272654 (10Urbanecm)
[23:12:17] <legoktm>	 https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/pluginapi.html suggests it's just exit code and multiple lines are OK
[23:13:24] <rzl>	 yeah -- no guarantee icinga does exactly the same thing, and for some reason I can't find anything about it in the icinga docs, but I think it's likely correct
[23:13:39] <rzl>	 I mean, I'm sure the answer to this is known, I just don't know it :D
[23:15:02] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] "0 tests failed, 0 tests skipped, 26 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/657700 (https://phabricator.wikimedia.org/T272330) (owner: 10CDanis)
[23:15:22] <rzl>	 oh cdanis is about, I bet he knows
[23:15:37] <rzl>	 I figured he'd be done for the day but now he's outed himself, the fool
[23:20:01] <cdanis>	 oh no
[23:20:10] <wikibugs>	 (03PS6) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476)
[23:20:13] <cdanis>	 re: icinga, yes, it is just the exit code that matters
[23:20:31] <cdanis>	 I don't believe the textual output matters at all, aside from it is shown in the UI
[23:20:45] <legoktm>	 thanks
[23:20:50] <rzl>	 👍
[23:21:01] <wikibugs>	 (03PS4) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python [puppet] - 10https://gerrit.wikimedia.org/r/657677
[23:26:09] <Jdlrobson>	 brennen: thcipriani hey
[23:26:14] <Jdlrobson>	 sorry about the delay
[23:26:36] <brennen>	 Jdlrobson: hey, wb.  sorry for the dental appointment interruption.
[23:26:49] <Jdlrobson>	 so the issue is only an issue if we rollback
[23:26:54] <Jdlrobson>	 which hopefully we wont do
[23:26:57] <Jdlrobson>	 i can prepare a fix now
[23:27:05] <Jdlrobson>	 but maybe it's too late to roll the train forward?
[23:27:35] <brennen>	 we're technically past the cutoff, but this is always a judgment call.  in practice i'd rather it be fully deployed than left in a split state over the weekend.
[23:28:02] <brennen>	 i _would_ like to avoid having to roll back after some window of time and then having things in a much more broken state than they are currently, though.
[23:29:09] <brennen>	 i don't think that's super likely, but if a fix is quick i think i'm ok slinging it out and then rolling forward yet this afternoon.
[23:29:26] <brennen>	 ...otherwise i guess i welcome advice.
[23:29:46] <Jdlrobson>	 i think it's okay to roll forward
[23:29:52] <Jdlrobson>	 the patch i need to write is going to be super trivial
[23:30:06] <Jdlrobson>	 if we need to roll back, and ill be around for next 3 hrs, we can apply my patch
[23:30:39] <wikibugs>	 (03CR) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[23:31:16] <wikibugs>	 (03PS3) 10Bstorm: wikireplicas: add a multiinstance role for the dedicated analytics host [puppet] - 10https://gerrit.wikimedia.org/r/654558 (https://phabricator.wikimedia.org/T269211)
[23:31:19] <brennen>	 Jdlrobson: k.  let's go ahead and give it a shot.
[23:31:56] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:32:54] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "I'll merge this, since it isn't connected to any hosts now. Whenever we want to add the needed hiera and connect this to the host, I'll le" [puppet] - 10https://gerrit.wikimedia.org/r/654558 (https://phabricator.wikimedia.org/T269211) (owner: 10Bstorm)
[23:33:05] <wikibugs>	 (03CR) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm)
[23:33:06] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[23:33:09] <Jdlrobson>	 brennen: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/657702 is the patch
[23:33:24] <Jdlrobson>	 probably best to backport that now
[23:33:31] <Jdlrobson>	 so that if we do rollback it's straightforward
[23:34:01] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[23:34:52] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[23:34:58] <brennen>	 Jdlrobson: is that testable on an mwdebug?
[23:35:00] <Jdlrobson>	 On the plus side this is the biggest test of our error logging tracking at 105,965 errors in the last 12hrs
[23:35:09] <Jdlrobson>	 brennen: yes
[23:35:14] <Jdlrobson>	 i can test it on mwdebug
[23:35:31] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[23:35:36] <wikibugs>	 (03PS1) 10Brennen Bearnes: Fix toggling storage cleanup [extensions/MobileFrontend] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657652 (https://phabricator.wikimedia.org/T272638)
[23:35:48] <wikibugs>	 (03PS1) 10DLynch: A/B test output when a specific feature is being tested [extensions/DiscussionTools] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657653 (https://phabricator.wikimedia.org/T268191)
[23:35:52] <brennen>	 Jdlrobson: cool, i'll sync out the backport
[23:36:26] <brennen>	 once merged, that is.  then test and go ahead to group2.
[23:37:11] <Jdlrobson>	 brennen: it looks like we'd be in a worse state by not deploying so definitely want to do this :)
[23:37:22] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Fix toggling storage cleanup [extensions/MobileFrontend] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657652 (https://phabricator.wikimedia.org/T272638) (owner: 10Brennen Bearnes)
[23:37:53] <brennen>	 Jdlrobson: heh, yeah.  this is sort of the inverse of the typical train blocker.
[23:37:53] <Jdlrobson>	 https://logstash.wikimedia.org/goto/dbb3c95c431a5d301fd6f2cc32cd8fe0 not looking healthy
[23:38:21] <Jdlrobson>	 usally the top error is 1000 in 12hrs :/
[23:38:32] <brennen>	 oof, yeah.
[23:39:08] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:39:23] <wikibugs>	 (03PS13) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559)
[23:43:18] <brennen>	 Jdlrobson: https://integration.wikimedia.org/ci/job/mwgate-node10-docker/200571/console
[23:43:30] <Jdlrobson>	 ahghh
[23:43:42] <brennen>	 my sentiments exactly
[23:44:12] <Jdlrobson>	 linting issue fixed
[23:44:15] <Jdlrobson>	 new patch up
[23:45:31] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] mediawiki: Port nrpe_check_opcache to Python [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm)
[23:46:10] <wikibugs>	 (03CR) 10Brennen Bearnes: Fix toggling storage cleanup [extensions/MobileFrontend] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657652 (https://phabricator.wikimedia.org/T272638) (owner: 10Brennen Bearnes)
[23:47:40] <wikibugs>	 (03PS2) 10Brennen Bearnes: Fix toggling storage cleanup [extensions/MobileFrontend] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657652 (https://phabricator.wikimedia.org/T272638)
[23:48:28] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Fix toggling storage cleanup [extensions/MobileFrontend] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657652 (https://phabricator.wikimedia.org/T272638) (owner: 10Brennen Bearnes)
[23:49:21] <brennen>	 hrm.  is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/657652 going to need a recheck?  i don't think i've actually ever gotten myself into this situation with gerrit before.
[23:50:13] <Jdlrobson>	 brennen: think it should be fine
[23:50:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH)
[23:50:55] <brennen>	 ah, yeah, there we go.  started gate-and-submit again.
[23:51:14] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2374.codfw.wmnet'] `  Of...
[23:53:04] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2374.codfw.wmnet with reason: REIMAGE
[23:53:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:53:51] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2372.codfw.wmnet with reason: REIMAGE
[23:53:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:54:32] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2370.codfw.wmnet with reason: REIMAGE
[23:54:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:06] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2374.codfw.wmnet with reason: REIMAGE
[23:55:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:13] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2370.codfw.wmnet with reason: REIMAGE
[23:55:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:56:08] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2374.codfw.wmnet'] `  Of...
[23:57:10] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2372.codfw.wmnet with reason: REIMAGE
[23:57:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:58:55] <Daimona_>	 jouncebot now
[23:58:55] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 1 minute(s)
[23:59:02] <Daimona_>	 jouncebot next
[23:59:02] <jouncebot>	 In 0 hour(s) and 0 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210122T0000)