[00:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:06:20] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:06:54] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash-be103[345] - https://phabricator.wikimedia.org/T267666 (10Jclark-ctr) @Cmjohnson all host racked and cabled netbox updated host port logstash-be1033 39 logstash-be1034 21 logstash-be1035 7 [00:07:21] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash-be103[345] - https://phabricator.wikimedia.org/T267666 (10Jclark-ctr) [00:08:05] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2363.codfw.wmnet'] ` an... [00:09:00] 10SRE, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Legoktm) Is this a problem with icinga? ` legoktm@mwdebug1003:~$ /usr/local/lib/nagios/plugins/nrpe_check_opcache -w 100 -c 50 OK: opcache is healthy ` Doesn't seem like a permissions issue ei... [00:09:05] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2365.codfw.wmnet'] ` an... [00:09:34] (03PS5) 10Mstyles: update flink logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) [00:09:58] (03CR) 10Mstyles: update flink logging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [00:10:07] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2367.codfw.wmnet'] ` an... [00:10:39] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2369.codfw.wmnet'] ` an... [00:13:22] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:17:46] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2363.codfw.wmnet [00:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:02] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2367.codfw.wmnet [00:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:18] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2365.codfw.wmnet [00:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:43] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2369.codfw.wmnet [00:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:38] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2363.codfw.wmnet [00:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:44] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2365.codfw.wmnet [00:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:50] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2367.codfw.wmnet [00:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:58] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2369.codfw.wmnet [00:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:10] 10SRE, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) This always happens after reimaging a server and then disappears after it's been running for a while. [00:22:49] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:22:57] PROBLEM - Host releases2002 is DOWN: PING CRITICAL - Packet loss = 100% [00:23:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:26:56] (03CR) 10Bstorm: [C: 03+2] nfs: set default monitors for 10Gb Ethernet [puppet] - 10https://gerrit.wikimedia.org/r/656269 (https://phabricator.wikimedia.org/T218338) (owner: 10Bstorm) [00:27:32] 10SRE, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) It's not consistent. For example mw2226 is OK but mw2224 and mw2225 have the alert but all 3 are buster and have been reimaged on the same day, 8 days ago. [00:27:47] 10SRE, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Legoktm) mwdebug1003 was one of the first servers to be reimaged and it's still critical after over a month though [00:28:40] mutante: am I missing something? ^ [00:30:06] legoktm: I don't kow but it's not consistent within the same type of hardware that was changed on the same day [00:30:15] while mwdebug1003 is different in other ways [00:30:21] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:44] * legoktm checks a non-debug server [00:31:48] mw1265 has the same issue [00:32:15] yea, many have it but not ALL of them [00:32:32] right [00:33:10] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=opcache&servicestatustypes=29 [00:33:39] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:34:31] as you already said, running the NRPE command locally on an affected host.. WORKS and is OK [00:34:37] confirmed that on yet anohter one [00:35:57] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:23] checking if that is REALLY the command that is in NRPE config [00:36:33] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:36:44] where do you find that config? [00:36:53] /etc/nagios/nrpe.d [00:37:23] 2 command[check_opcache]=/usr/local/lib/nagios/plugins/nrpe_check_opcache -w 100 -c 50 [00:37:45] OK: opcache is healthy [00:37:58] wtf is this :) [00:38:36] 🙃 [00:38:40] it works locally but not from remote, it seems to be caused by buster but also not ALL buster hosts [00:39:49] (03PS1) 10Razzi: sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) [00:39:56] could it be caching the result of an initial run and not updating properly? [00:40:21] there is parsing with jq ..twice [00:40:40] Am I reading https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/mediawiki/maintenance/initsitestats.pp#L4 right that the script just runs twice per month? [00:42:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:42:43] (03CR) 10jerkins-bot: [V: 04-1] sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [00:43:00] tabbycat: yeah. You can also ask a friendly sysadmin to run it manually for not-large wikis :) [00:43:20] tabbycat: it is "1 weeks 4 days" until the next time [00:43:24] mutante: and some awk too [00:43:25] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [x] db1166-db1176 (exceptions: db117[01]) have all had their default passwords changed to the idrac mgmt password. [] Chris is going to check out db117[01] tomorro... [00:43:39] legoktm: ah, thanks. Well if we could have it run for tr.wikivoyage... [00:43:52] Scribunto went apesh** on Meta [00:43:55] we can start it right now if you want [00:43:58] due to not being able to fetch stats [00:44:03] done [00:44:05] but that isnt specific to one wiki [00:44:16] it's also missing some wiktionary stats but I'm not sure which ones [00:44:18] !log legoktm@mwmaint1002:~$ mwscript initSiteStats.php --wiki=trwikivoyage --update [00:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:30] tabbycat: NOT the analytics ones, just assign that to me [00:44:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:45:16] legoktm: awesome, thanks; now https://tr.wikivoyage.org/wiki/%C3%96zel:%C4%B0statistikler displays some data [00:45:51] PROBLEM - PHP opcache health on mw2357 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:47:17] PROBLEM - PHP opcache health on mw2361 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:47:38] meh, scribunto still not fetching https://meta.wikimedia.org/wiki/Talk:Www.wikivoyage.org_template -- I suspect there are few more dependencies [00:47:46] too late to debug, off to bed now [00:47:59] ACKNOWLEDGEMENT - PHP opcache health on mw2353 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn debugging https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:47:59] ACKNOWLEDGEMENT - PHP opcache health on mw2357 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn debugging https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:47:59] ACKNOWLEDGEMENT - PHP opcache health on mw2361 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn debugging https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:49:23] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [00:51:30] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:52:32] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:52:59] (03PS1) 10Cwhite: profile: send w3creportingapi logs to indexes with custom schema [puppet] - 10https://gerrit.wikimedia.org/r/657452 (https://phabricator.wikimedia.org/T265938) [00:56:34] PROBLEM - PHP opcache health on mw2338 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:57:54] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:57:55] (03CR) 10Cwhite: "Overall LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [01:00:04] twentyafterfour: May I have your attention please! Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T0100) [01:00:06] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1624931824 and 96 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:01:00] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 682521880 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:01:38] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:02:18] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 105152 and 68 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:02:34] (03PS1) 10Ryan Kemper: Decommission relforge100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) [01:02:44] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 146640 and 95 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:03:44] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 901291448 and 191 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:12] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 237856 and 242 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:15] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/657454 (owner: 10CRusnov) [01:05:57] (03PS2) 10Ryan Kemper: Decommission relforge100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) [01:06:13] (03CR) 10CRusnov: "Obviously tests are needed in -next before we deploy this to production. I'll be going through the changelog and looking for any issues wi" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/657454 (owner: 10CRusnov) [01:07:16] (03CR) 10Ryan Kemper: "Some things I'm unsure about:" [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) (owner: 10Ryan Kemper) [01:07:57] (03PS3) 10Ryan Kemper: Decommission relforge100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) [01:09:14] (03CR) 10Ryan Kemper: "The decommissioning will be done in this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/657453" [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [01:10:28] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:12:10] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:14:04] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:14:52] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [01:15:58] 10SRE, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) >>! In T270517#6763884, @Legoktm wrote: > Is this a problem with icinga? Yes! And it's really weird. I tracked down the NRPE command that is run from Icinga and it behaves different on... [01:17:48] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:18:36] 10SRE, 10Icinga, 10observability, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Legoktm) [01:19:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:22:15] !log [WDQS Deploy] Tests on canary `wdqs1003` passing before start of deploy, proceeding with deploy of wdqs `0.3.60` to canary [01:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:20] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@70f9d37]: 0.3.60 [01:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:50] !log [WDQS Deploy] Automated tests passing on canary`wdqs1003` but manually visiting `http://localhost:9999` (my tunnel to `wdqs1003`) gives `404 Not Found`from nginx; aborting deploy [01:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:05] !log [WDQS Deploy] Rollback of canary `wdqs1003` initiated [01:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:13] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@70f9d37]: 0.3.60 (duration: 02m 53s) [01:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:01] !log [WDQS Deploy] Rollback complete, service health of `wdqs1003` is restored. Need to investigate source of 404 (possibly related to some recent changes we made in the `gui` repo) [01:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:02] RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [01:40:01] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [01:41:44] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) John was onsite and fixed db117[01] for me, they are now online. db11[56-65] have had bios and idrac firmware updates, and raid setup. I've updated the task descr... [01:42:23] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [01:43:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:49:50] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:51:56] PROBLEM - PHP opcache health on mw2367 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:53:48] PROBLEM - PHP opcache health on mw2369 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:54:06] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:02:29] 10SRE, 10Icinga, 10observability, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) upon further investigation I realized mw2226 is actually still stretch and I made a mistake to mark it as DONE in the etherpad for appserver upgrades... som... [02:04:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_proton_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:06:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:10:40] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [02:11:58] 10SRE, 10ops-codfw: Degraded RAID on logstash2022 - https://phabricator.wikimedia.org/T269552 (10Papaul) 05Open→03Resolved @herron I am closing this task, please fell free to open a decom task when server is ready for decommission Thanks [02:12:34] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [02:13:56] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:19:35] 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10Dzahn) [02:23:04] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:23:54] PROBLEM - PHP opcache health on mw2355 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [02:30:42] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:04] (03PS1) 10Legoktm: admin: Update my (legoktm)'s dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/657458 [02:32:46] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:33:36] (03PS2) 10Legoktm: admin: Update my (legoktm)'s dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/657458 [02:34:11] (03CR) 10Legoktm: [C: 03+2] admin: Update my (legoktm)'s dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/657458 (owner: 10Legoktm) [02:35:12] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:37:54] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:38:12] 10SRE, 10Icinga, 10observability, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) I tried installing nagios-nrpe-server 4.0.3-1~bpo10+1 over 3.2.1-2 but that did not fix the issue either. [02:43:01] (03PS1) 10Legoktm: libraryupgrader: Update celery systemd units [puppet] - 10https://gerrit.wikimedia.org/r/657459 [02:51:10] PROBLEM - PHP opcache health on mw2359 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [02:57:42] RECOVERY - PHP opcache health on mw2225 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:03:55] 10SRE, 10Icinga, 10observability, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) I found the issue. Changing line 28 in /usr/local/lib/nagios/plugins/nrpe_check_opcache to: ` OUT=$(/usr/local/bin/php7adm /opcache-info | jq . 2>&1) ` f... [03:09:11] 10SRE, 10Icinga, 10observability, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) a:03Dzahn [03:21:05] (03PS1) 10Dzahn: nrpe_check_opcache: use full path to php7adm to fix opcache monitor on buster [puppet] - 10https://gerrit.wikimedia.org/r/657460 (https://phabricator.wikimedia.org/T270517) [03:25:00] (03CR) 10Dzahn: [C: 03+2] nrpe_check_opcache: use full path to php7adm to fix opcache monitor on buster [puppet] - 10https://gerrit.wikimedia.org/r/657460 (https://phabricator.wikimedia.org/T270517) (owner: 10Dzahn) [03:26:28] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:27:12] PROBLEM - PHP opcache health on mw2363 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:27:55] (03CR) 10Dzahn: "[alert1001:~] $ /usr/lib/nagios/plugins/check_nrpe -H mw2225.codfw.wmnet -c check_opcache" [puppet] - 10https://gerrit.wikimedia.org/r/657460 (https://phabricator.wikimedia.org/T270517) (owner: 10Dzahn) [03:28:06] PROBLEM - PHP opcache health on mw2365 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:28:54] RECOVERY - PHP opcache health on mwdebug1003 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:29:03] legoktm: ^ fixing :) [03:29:16] RECOVERY - PHP opcache health on mw1265 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:29:30] RECOVERY - PHP opcache health on mw2240 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:29:30] RECOVERY - PHP opcache health on mw2363 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:29:46] RECOVERY - PHP opcache health on mw2255 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:29:52] RECOVERY - PHP opcache health on mw2329 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:29:52] RECOVERY - PHP opcache health on mw2335 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:29:52] RECOVERY - PHP opcache health on mw2357 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:30:20] RECOVERY - PHP opcache health on mw2234 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:30:40] 10SRE, 10Icinga, 10observability, 10serviceops, 10Patch-For-Review: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) ` 03:28 <+icinga-wm> RECOVERY - PHP opcache health on mwdebug1003 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Ap... [03:30:41] mutante: woooww nice find! So the PATH was off on the buster hosts? [03:30:48] RECOVERY - PHP opcache health on mw2277 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:31:11] legoktm: yea, for some reason on stretch it worked without full path [03:31:16] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:31:17] but not anymore [03:31:20] RECOVERY - PHP opcache health on mw2310 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:31:28] though generally it's recommended to use full path in the plugins [03:31:41] php7adm is in the same location [03:31:42] * legoktm nods [03:32:16] RECOVERY - PHP opcache health on mw2231 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:32:28] /usr/local/bin seems to be in $PATH when i echo it .. but ..yea [03:32:32] RECOVERY - PHP opcache health on mw2325 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:32:44] RECOVERY - PHP opcache health on mw2369 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:32:46] RECOVERY - PHP opcache health on mw2274 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:33:12] RECOVERY - PHP opcache health on mw2327 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:33:45] legoktm: and then the little bonus things like that it did not exit with an error but claimed the 99.85% in the case it does not find php7adm .. and that I marked a server as buster that is stretch :) [03:34:04] alright, with all the recoveries now i'll head out. cya [03:34:11] Bye :)) [03:34:16] RECOVERY - PHP opcache health on mw2315 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:34:26] RECOVERY - PHP opcache health on mw2233 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:34:30] RECOVERY - PHP opcache health on mw2303 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:35:12] RECOVERY - PHP opcache health on mw2316 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:35:40] RECOVERY - PHP opcache health on mw1276 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:36:56] RECOVERY - PHP opcache health on mw1277 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:36:58] RECOVERY - PHP opcache health on mw2313 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:37:04] RECOVERY - PHP opcache health on mw2331 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:37:04] RECOVERY - PHP opcache health on mw2353 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:37:36] RECOVERY - PHP opcache health on mw2236 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:38:20] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:38:30] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:38:48] RECOVERY - PHP opcache health on mw1267 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:38:52] 10SRE, 10Icinga, 10observability, 10serviceops, 10Patch-For-Review: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10Dzahn) 05Open→03Resolved [03:38:55] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [03:39:16] RECOVERY - PHP opcache health on mw2230 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:39:44] RECOVERY - PHP opcache health on mw2275 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:40:20] RECOVERY - PHP opcache health on mw2367 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:40:22] RECOVERY - PHP opcache health on mw2238 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:41:16] RECOVERY - PHP opcache health on mw2269 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:41:42] RECOVERY - PHP opcache health on mw2235 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:42:24] RECOVERY - PHP opcache health on mw2228 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:42:28] RECOVERY - PHP opcache health on mw2312 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:42:28] RECOVERY - PHP opcache health on mw2273 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:43:16] RECOVERY - PHP opcache health on mw2314 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:43:16] RECOVERY - PHP opcache health on mw2311 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:43:46] RECOVERY - PHP opcache health on parse2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:43:58] RECOVERY - PHP opcache health on mw2307 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:44:04] RECOVERY - PHP opcache health on mw2351 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:44:04] RECOVERY - PHP opcache health on mw2339 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:45:30] RECOVERY - PHP opcache health on mw2268 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:45:30] RECOVERY - PHP opcache health on mw2227 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:46:24] RECOVERY - PHP opcache health on mw2338 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:46:28] RECOVERY - PHP opcache health on mw2270 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:46:52] RECOVERY - PHP opcache health on mw2224 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:47:18] RECOVERY - PHP opcache health on mw2243 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:47:54] RECOVERY - PHP opcache health on mw2232 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:47:54] RECOVERY - PHP opcache health on mw2237 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:48:28] RECOVERY - PHP opcache health on mw2305 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:51:08] RECOVERY - PHP opcache health on mw2359 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:51:24] RECOVERY - PHP opcache health on mw1266 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:52:24] RECOVERY - PHP opcache health on mw2242 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:52:24] RECOVERY - PHP opcache health on mw2239 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:52:26] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:52:33] !log milimetric@deploy1001 Started deploy [analytics/refinery@57589e7]: Minor typo fix [03:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:28] RECOVERY - PHP opcache health on mw2333 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:53:28] RECOVERY - PHP opcache health on mw2337 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:53:28] RECOVERY - PHP opcache health on mw2361 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:53:28] RECOVERY - PHP opcache health on mw2355 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:53:58] RECOVERY - PHP opcache health on mw2241 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:53:58] RECOVERY - PHP opcache health on mw2309 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:54:00] RECOVERY - PHP opcache health on mw2365 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:54:13] !log milimetric@deploy1001 deploy aborted: Minor typo fix (duration: 01m 39s) [03:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:44] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:55:52] RECOVERY - PHP opcache health on mw2229 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:56:32] (03PS5) 10Andrew Bogott: nova vendordata/firstboot: move puppet config into cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657401 (https://phabricator.wikimedia.org/T271273) [03:57:32] RECOVERY - PHP opcache health on mw2258 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:06:22] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:08:28] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:08:42] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:08:53] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata/firstboot: move puppet config into cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657401 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [04:10:48] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:15:40] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:17:48] 10Puppet, 10SRE: Unused puppet modules audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Ladsgroup) [04:18:04] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:25:02] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:25:56] (03PS1) 10Andrew Bogott: Nova: reload api and api-metadata service when the vendordata source changes [puppet] - 10https://gerrit.wikimedia.org/r/657462 (https://phabricator.wikimedia.org/T271273) [04:25:58] (03PS1) 10Andrew Bogott: nova firstboot script: remove file updates that are handled by puppet [puppet] - 10https://gerrit.wikimedia.org/r/657463 (https://phabricator.wikimedia.org/T271273) [04:26:59] (03CR) 10jerkins-bot: [V: 04-1] Nova: reload api and api-metadata service when the vendordata source changes [puppet] - 10https://gerrit.wikimedia.org/r/657462 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [04:32:46] (03PS2) 10Andrew Bogott: Nova: reload api and api-metadata service when the vendordata source changes [puppet] - 10https://gerrit.wikimedia.org/r/657462 (https://phabricator.wikimedia.org/T271273) [04:32:48] (03PS2) 10Andrew Bogott: nova firstboot script: remove file updates that are handled by puppet [puppet] - 10https://gerrit.wikimedia.org/r/657463 (https://phabricator.wikimedia.org/T271273) [04:33:57] (03CR) 10Andrew Bogott: [C: 03+2] Nova: reload api and api-metadata service when the vendordata source changes [puppet] - 10https://gerrit.wikimedia.org/r/657462 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [04:34:24] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:39:20] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:41:36] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:49:05] (03CR) 10Andrew Bogott: [C: 03+2] nova firstboot script: remove file updates that are handled by puppet [puppet] - 10https://gerrit.wikimedia.org/r/657463 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [04:51:16] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01163 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [05:01:46] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:08:10] (03PS1) 10Andrew Bogott: Nova cloud-init: rework logic for initial volume setup [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273) [05:11:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:18:18] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:22:25] (03CR) 10Andrew Bogott: "/usr/local/sbin/make-instance-vg: lvm is not active on this host; unable to create a volume." [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [05:30:00] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:32:10] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:39:06] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:50:22] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:59:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:16:39] (03PS1) 10Marostegui: production-m2.sql.erb: Add INDEX grant to sockpuppet_import user [puppet] - 10https://gerrit.wikimedia.org/r/657468 (https://phabricator.wikimedia.org/T272533) [06:19:44] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) @Cmjohnson unfortunately the server isn't accessible yet - I cannot even reach its idrac :-( ` root@cumin1001:~# ping clouddb1019.eqiad.wmnet -c5 PING clouddb1019.eqiad.wmnet (10.64.48.9) 56(84)... [06:20:22] (03CR) 10Marostegui: [C: 03+2] production-m2.sql.erb: Add INDEX grant to sockpuppet_import user [puppet] - 10https://gerrit.wikimedia.org/r/657468 (https://phabricator.wikimedia.org/T272533) (owner: 10Marostegui) [06:31:10] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:30] 10Puppet, 10SRE: Unused puppet modules audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Joe) Your methodology is not 100% accurate, so before removing anything I'd verify with the authors/service owners as there can be some false positives. Also: - Let's exclude third-party modules like `stdli... [06:37:50] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:37:54] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:10] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:42:14] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:44:50] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:47:03] (03PS1) 10Marostegui: db1087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/657469 [06:49:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1087 and pool db1099:3318 into s8 vslow', diff saved to https://phabricator.wikimedia.org/P13860 and previous config saved to /var/cache/conftool/dbconfig/20210121-064903-marostegui.json [06:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:20] (03CR) 10Marostegui: [C: 03+2] db1087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/657469 (owner: 10Marostegui) [06:53:36] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:54:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1087', diff saved to https://phabricator.wikimedia.org/P13861 and previous config saved to /var/cache/conftool/dbconfig/20210121-065408-marostegui.json [06:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:26] (03PS1) 10Marostegui: Revert "db1087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/657432 [06:55:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repoool db1099:3318', diff saved to https://phabricator.wikimedia.org/P13862 and previous config saved to /var/cache/conftool/dbconfig/20210121-065459-marostegui.json [06:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:07] 10Puppet, 10SRE: Unused puppet modules audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Ladsgroup) >>! In T272559#6764172, @Joe wrote: > Your methodology is not 100% accurate, so before removing anything I'd verify with the authors/service owners as there can be some false positives. > > Also:... [06:55:09] (03CR) 10Marostegui: [C: 03+2] Revert "db1087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/657432 (owner: 10Marostegui) [06:56:20] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:58:16] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:59:11] 10Puppet, 10SRE: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Ladsgroup) [07:01:04] ACKNOWLEDGEMENT - Host clouddb1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Marostegui T272125 [07:03:24] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:03:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repoool db1099:3318', diff saved to https://phabricator.wikimedia.org/P13863 and previous config saved to /var/cache/conftool/dbconfig/20210121-070346-marostegui.json [07:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:48] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:10:10] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:10:34] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:08] (03PS4) 10Effie Mouzeli: scap: enable logging to syslog [puppet] - 10https://gerrit.wikimedia.org/r/574485 (https://phabricator.wikimedia.org/T227080) (owner: 10Filippo Giunchedi) [07:20:08] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:21:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P13864 and previous config saved to /var/cache/conftool/dbconfig/20210121-072101-marostegui.json [07:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:45] 10Puppet, 10SRE: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Joe) >>! In T272559#6764178, @Ladsgroup wrote: >> - It could be interesting to audit what is in puppetdb and check it against what is in the puppet tree. I suspect there is more stale stuff in site.pp than... [07:30:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:30:49] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:36:15] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:36:55] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:10] (03CR) 10Effie Mouzeli: [C: 03+2] scap: enable logging to syslog [puppet] - 10https://gerrit.wikimedia.org/r/574485 (https://phabricator.wikimedia.org/T227080) (owner: 10Filippo Giunchedi) [07:38:00] (03CR) 10Effie Mouzeli: [C: 03+2] scap: disable udp logging [puppet] - 10https://gerrit.wikimedia.org/r/657136 (https://phabricator.wikimedia.org/T227080) (owner: 10Effie Mouzeli) [07:38:14] (03PS2) 10Effie Mouzeli: scap: disable udp logging [puppet] - 10https://gerrit.wikimedia.org/r/657136 (https://phabricator.wikimedia.org/T227080) [07:38:27] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:40:55] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:42:33] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:43:23] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:45:13] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:56:41] (03PS3) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [07:58:55] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27561/console" [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [08:00:48] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:30] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:08:03] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) (owner: 10Ryan Kemper) [08:08:56] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:11:00] (03PS2) 10Effie Mouzeli: mediawiki: reduce the number of cached keys that trigger a restart [puppet] - 10https://gerrit.wikimedia.org/r/657398 (https://phabricator.wikimedia.org/T245183) [08:11:03] (03PS1) 10Muehlenhoff: Remove tor::instance [puppet] - 10https://gerrit.wikimedia.org/r/657531 (https://phabricator.wikimedia.org/T272559) [08:14:20] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:16:22] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:20:13] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: reduce the number of cached keys that trigger a restart [puppet] - 10https://gerrit.wikimedia.org/r/657398 (https://phabricator.wikimedia.org/T245183) (owner: 10Effie Mouzeli) [08:22:44] (03PS4) 10Effie Mouzeli: modules/scap/templates/scap.cfg.erb: Define php_fpm_unsafe_restart_script [puppet] - 10https://gerrit.wikimedia.org/r/636074 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy) [08:22:54] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:24:24] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:25:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:25:54] 10SRE, 10ops-eqiad: ms-be1046 stuck on reboot - https://phabricator.wikimedia.org/T272396 (10fgiunchedi) Thank you @Cmjohnson ! contacting dell SGTM [08:26:36] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:26:56] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: decrease object replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/656837 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [08:28:19] (03PS4) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [08:29:00] (03CR) 10Elukey: [C: 03+2] refinery: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/657363 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:30:39] (03CR) 10Muehlenhoff: [C: 03+2] Remove tor::instance [puppet] - 10https://gerrit.wikimedia.org/r/657531 (https://phabricator.wikimedia.org/T272559) (owner: 10Muehlenhoff) [08:31:20] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10MoritzMuehlenhoff) [08:31:56] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:03] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10MoritzMuehlenhoff) [08:33:37] (03CR) 10Elukey: profile::analytics::refinery::job::hdfs_cleaner Update (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [08:34:18] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:34:47] !log roll-restart swift-object in codfw to apply new concurrency [08:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:04] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10MoritzMuehlenhoff) [08:36:21] (03CR) 10Elukey: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [08:36:30] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:37:52] !log Silence m1 hosts in preparation for the restart T271540 [08:37:53] (03CR) 10Elukey: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [08:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:56] T271540: Upgrade and restart m1 master (db1080) - https://phabricator.wikimedia.org/T271540 [08:38:25] (03CR) 10Elukey: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [08:38:40] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:14] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:41:42] is there maintenance for cr2-esams? [08:42:27] (03PS2) 10Jcrespo: admin: Add wikitrent to the list of privileged LDAP accounts [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489) [08:42:35] This is the Lumen link to eqiad [08:43:16] !log swift codfw-prod: more weight to ms-be20[58-61] - T269337 [08:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:20] yes it seems so, they are fixing the link [08:43:20] T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337 [08:44:42] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:45:01] (03PS9) 10Effie Mouzeli: varnish: Set debug=1 in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) [08:45:58] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27562/console" [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [08:51:41] !log stopping puppet and bacula for backup1001 T271540 [08:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:46] T271540: Upgrade and restart m1 master (db1080) - https://phabricator.wikimedia.org/T271540 [08:52:12] akosiaris jynus pre-steps done [08:52:55] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10fgiunchedi) Thank you @Cmjohnson ! Doesn't look like the host likes the new disk :( Once ms-be1046 is repaired in T272396 I'll start decom of one host so there will be spare HP 4TB drives. ` => ld 11 modify reenable... [08:53:55] marostegui, I can confirm bacula down [08:54:06] \o/ [08:54:27] prometheus alert for monitoring may happen [08:54:43] and etherpad alert might too [08:54:48] I am ready to restart etherpad anyways [08:54:59] the one that gathers https://grafana.wikimedia.org/d/413r2vbWk/bacula [08:55:36] and the one for zarcillo [08:55:53] sorry [08:56:04] not zarcillo, dbbackups [08:56:09] but that has not an alert [08:56:35] the proxy will complain too? [08:56:59] depends on how long it takes the alert might not happen [08:57:23] not worrying, just trying to think all potential alerts so people don't wory [08:57:29] *other [08:57:57] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) Most of the openstack ones are dynamically imported, see [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/openstack/manifests... [08:58:15] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [08:58:31] (03PS1) 10Ladsgroup: eventlogging: Remove multiple unused modules [puppet] - 10https://gerrit.wikimedia.org/r/657538 (https://phabricator.wikimedia.org/T272559) [08:59:02] for next step, you will dump buffer pool, disable automatic pool on shutdown and then reduce the buffer pool ratio, is that what it means? [08:59:12] that is done [08:59:16] cool [08:59:19] (03PS5) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [08:59:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:59:55] ^this is what I meant before [09:00:02] let's go? [09:00:07] +1 [09:00:11] !log m1 master restart - T271540 [09:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:18] T271540: Upgrade and restart m1 master (db1080) - https://phabricator.wikimedia.org/T271540 [09:00:23] stopping [09:00:33] let us know if errors or success [09:00:35] starting [09:00:51] started [09:00:52] checking [09:01:09] everything should be back as normal [09:01:10] checking etherpad [09:01:21] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27563/console" [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [09:01:21] I can write fine [09:01:45] etherpad logs looking ok [09:02:02] I can create a new etherpad, so it looks good [09:02:21] reloading now the non active proxy [09:03:05] active proxy didn't failover? [09:03:13] it was too fast :) [09:03:26] librenms looking good [09:03:33] akosiaris, just check any alert/anything wrong you can check :-) [09:03:41] rt looking good [09:03:43] will wait to reenable bacula [09:03:57] jynus: ok [09:03:57] racktables looking good [09:04:04] Everything seems to be working fine [09:04:09] IDP is also just fine, just tested access to a U2F token [09:04:21] thank you moritzm [09:05:21] will reenable bacula if everything else looks good and redo missed gerrit backup [09:07:02] 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10JMeybohm) I don't see anything interesting in the 2.7.1 release (https://github.com/docker/distribution/releases/tag/v2.7.1, https://metadata.ftp-master.debian.org/changelogs//main/d/docker-regi... [09:07:04] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:25] jynus: everything looks good yep [09:09:22] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:11:40] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:11:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:13:24] ok, that was the alert I was waiting to recover [09:14:27] there is a few puppet wmcs/no resources failures since 4:38, but not related to this [09:14:43] arturo ^ [09:15:02] in a meeting [09:15:29] (no rush, just a friendly heads up ping 0:-)) [09:15:38] doesn't impact our team [09:20:41] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) > Is there a public API of it? it'd be amazing. Here are the docs for the [[ https://puppet.com/docs/puppetdb/6.13/api/index.html | puppetdb API ]]. You can run curl commands... [09:21:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:23:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:26:54] (03Abandoned) 10Thiemo Kreuz (WMDE): [POC] Convert all Wikipedia logos to (true) grayscale [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609584 (https://phabricator.wikimedia.org/T252108) (owner: 10Thiemo Kreuz (WMDE)) [09:30:32] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:06] (03PS1) 10Filippo Giunchedi: debian: add packaging [debs/alertmanager-webhook-logger] - 10https://gerrit.wikimedia.org/r/657541 (https://phabricator.wikimedia.org/T272474) [09:37:18] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:26] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:38:54] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:41:58] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:44:09] !log Updated the Wikidata property suggester with data from the 2021-01-11 JSON dump and applied the T132839 workarounds [09:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:14] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [09:47:25] (03CR) 10Volans: [C: 04-1] "I think there are a couple of bugs to fix, see inline for the details" (034 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov) [09:49:30] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:01] (03CR) 10Volans: [C: 03+1] "Diff looks reasonable" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/657454 (owner: 10CRusnov) [09:52:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics for lilients - https://phabricator.wikimedia.org/T272264 (10lilients_WMDE) I can now access the event logging metrics. I also got the mail for kerberos. Thank you for the support! [09:54:13] 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10akosiaris) [09:54:32] (03PS1) 10Kormat: udp2log: Install bsection [puppet] - 10https://gerrit.wikimedia.org/r/657543 [09:55:31] 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10akosiaris) I 've marked T272111 as a parent of this task for greater visibility. This one seems more generic than the sp... [09:55:56] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27564/console" [puppet] - 10https://gerrit.wikimedia.org/r/657543 (owner: 10Kormat) [09:57:38] (03CR) 10Kormat: [V: 03+1] "Apparently i forgot to do this when i created bsection." [puppet] - 10https://gerrit.wikimedia.org/r/657543 (owner: 10Kormat) [10:01:27] (03CR) 10Gehel: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/656369 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [10:03:09] (03CR) 10Gehel: "LGTM (with Moritz comments)." [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) (owner: 10Ryan Kemper) [10:19:07] (03CR) 10Gehel: [C: 03+2] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [10:20:09] (03CR) 10Gehel: [C: 03+2] "LGTM" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [10:20:44] (03Merged) 10jenkins-bot: update flink logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [10:30:48] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:46] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:41:54] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:50:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment (but feel free to ignore)" (031 comment) [debs/alertmanager-webhook-logger] - 10https://gerrit.wikimedia.org/r/657541 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi) [10:55:14] (03PS3) 10Elukey: varnish: block python-request UA bots for AQS [puppet] - 10https://gerrit.wikimedia.org/r/657288 [10:57:37] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489) (owner: 10Jcrespo) [10:57:52] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: ecs indices to use a weekly rotation [puppet] - 10https://gerrit.wikimedia.org/r/657371 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [10:59:32] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657370 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [11:00:04] mvolz: Your horoscope predicts another unfortunate Services – Citoid / Zotero deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T1100). [11:00:28] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:01:23] (03CR) 10Muehlenhoff: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) (owner: 10Ryan Kemper) [11:02:07] (03CR) 10Gehel: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) (owner: 10Ryan Kemper) [11:02:28] (03PS2) 10Filippo Giunchedi: debian: add packaging [debs/alertmanager-webhook-logger] - 10https://gerrit.wikimedia.org/r/657541 (https://phabricator.wikimedia.org/T272474) [11:02:36] (03PS1) 10Hnowlan: similar-users: release new container version with unicode parsing fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657546 [11:02:41] (03CR) 10Filippo Giunchedi: debian: add packaging (031 comment) [debs/alertmanager-webhook-logger] - 10https://gerrit.wikimedia.org/r/657541 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi) [11:03:42] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [debs/alertmanager-webhook-logger] - 10https://gerrit.wikimedia.org/r/657541 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi) [11:03:48] (03PS1) 10Elukey: profile::analytics::cluster::users: add analytics to the druid group [puppet] - 10https://gerrit.wikimedia.org/r/657547 [11:06:05] (03CR) 10Hnowlan: [C: 03+2] similar-users: release new container version with unicode parsing fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657546 (owner: 10Hnowlan) [11:07:08] (03PS2) 10Elukey: profile::analytics::cluster::users: add analytics to the druid group [puppet] - 10https://gerrit.wikimedia.org/r/657547 [11:07:28] (03Merged) 10jenkins-bot: similar-users: release new container version with unicode parsing fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/657546 (owner: 10Hnowlan) [11:12:46] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [11:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:53] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [11:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:49] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27566/console" [puppet] - 10https://gerrit.wikimedia.org/r/657547 (owner: 10Elukey) [11:15:51] (03CR) 10Ema: [C: 03+1] "LGTM and tests are green:" [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey) [11:18:07] (03CR) 10Elukey: [C: 03+2] varnish: block python-request UA bots for AQS [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey) [11:18:47] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] debian: add packaging [debs/alertmanager-webhook-logger] - 10https://gerrit.wikimedia.org/r/657541 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi) [11:19:40] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10ArielGlenn) [11:20:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cr/firewall.conf: cloud-in4: introduce ACL for novafullstack [homer/public] - 10https://gerrit.wikimedia.org/r/657358 (https://phabricator.wikimedia.org/T272486) (owner: 10Arturo Borrero Gonzalez) [11:21:05] (03Merged) 10jenkins-bot: cr/firewall.conf: cloud-in4: introduce ACL for novafullstack [homer/public] - 10https://gerrit.wikimedia.org/r/657358 (https://phabricator.wikimedia.org/T272486) (owner: 10Arturo Borrero Gonzalez) [11:28:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1085', diff saved to https://phabricator.wikimedia.org/P13867 and previous config saved to /var/cache/conftool/dbconfig/20210121-112849-marostegui.json [11:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:24] !log Stop replication on db1085 to move wiki replicas under the other sanitarium host [11:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:13] (03PS1) 10Ayounsi: Remove unused roles librenms and rancid [puppet] - 10https://gerrit.wikimedia.org/r/657551 (https://phabricator.wikimedia.org/T272559) [11:31:26] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:46] (03PS1) 10Arturo Borrero Gonzalez: Revert "Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"" [homer/public] - 10https://gerrit.wikimedia.org/r/657439 [11:32:18] (03CR) 10Ayounsi: [C: 03+1] Revert "Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"" [homer/public] - 10https://gerrit.wikimedia.org/r/657439 (owner: 10Arturo Borrero Gonzalez) [11:32:51] (03PS2) 10Arturo Borrero Gonzalez: Revert "Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"" [homer/public] - 10https://gerrit.wikimedia.org/r/657439 (https://phabricator.wikimedia.org/T209082) [11:33:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"" [homer/public] - 10https://gerrit.wikimedia.org/r/657439 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [11:33:44] (03Merged) 10jenkins-bot: Revert "Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"" [homer/public] - 10https://gerrit.wikimedia.org/r/657439 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [11:35:16] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) [11:35:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 25%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13868 and previous config saved to /var/cache/conftool/dbconfig/20210121-113533-root.json [11:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:22] (03PS2) 10Hnowlan: services: similar-users discovery and LVS component [puppet] - 10https://gerrit.wikimedia.org/r/657101 (https://phabricator.wikimedia.org/T268837) [11:38:04] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 50%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13870 and previous config saved to /var/cache/conftool/dbconfig/20210121-115036-root.json [11:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:17] (03CR) 10Ayounsi: [C: 03+2] Remove unused roles librenms and rancid [puppet] - 10https://gerrit.wikimedia.org/r/657551 (https://phabricator.wikimedia.org/T272559) (owner: 10Ayounsi) [11:54:57] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10ayounsi) [11:57:55] (03CR) 10David Caro: [C: 03+1] "Now with the proper exceptions should work ok 😊" [homer/public] - 10https://gerrit.wikimedia.org/r/657439 (https://phabricator.wikimedia.org/T209082) (owner: 10Arturo Borrero Gonzalez) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T1200). [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:03:54] yup, looks like nothing to do [12:05:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 75%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13871 and previous config saved to /var/cache/conftool/dbconfig/20210121-120540-root.json [12:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:21] (03CR) 10Volans: "Nice addition! Minor things inline." (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond) [12:07:03] (03PS2) 10Matthias Mullie: Add global to indicate that elastic LTR features are available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646663 [12:07:17] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/657547 (owner: 10Elukey) [12:20:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 100%: After moving wikireplicas to another host', diff saved to https://phabricator.wikimedia.org/P13872 and previous config saved to /var/cache/conftool/dbconfig/20210121-122043-root.json [12:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:35] (03PS1) 10Marostegui: sys: Add the current version of sys database. [software] - 10https://gerrit.wikimedia.org/r/657558 [12:22:41] (03CR) 10Marostegui: [C: 03+2] sys: Add the current version of sys database. [software] - 10https://gerrit.wikimedia.org/r/657558 (owner: 10Marostegui) [12:27:41] (03PS15) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [12:29:23] (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [12:31:18] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:47] (03PS1) 10Ladsgroup: logstash: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/657560 (https://phabricator.wikimedia.org/T209953) [12:33:59] (03CR) 10Hnowlan: start using imposm as OSM sync tool (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [12:35:02] PROBLEM - SSH on ms-be2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:35:34] PROBLEM - very high load average likely xfs on ms-be2021 is CRITICAL: CRITICAL - load average: 84.11, 108.71, 66.41 https://wikitech.wikimedia.org/wiki/Swift [12:37:58] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:18] RECOVERY - SSH on ms-be2021 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:39:41] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/27569/" [puppet] - 10https://gerrit.wikimedia.org/r/657560 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [12:39:58] RECOVERY - very high load average likely xfs on ms-be2021 is OK: OK - load average: 22.88, 60.73, 56.86 https://wikitech.wikimedia.org/wiki/Swift [12:58:34] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:29] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Ladsgroup) @jbond Thanks for the detailed comment. I will definitely use it to redo most of the work the script does but one big problem. Since I'm no SRE, I can't login to puppetdb1... [13:12:00] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:31] (03PS1) 10Muehlenhoff: Remove obsolete role [puppet] - 10https://gerrit.wikimedia.org/r/657569 (https://phabricator.wikimedia.org/T272559) [13:21:25] (03PS1) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) [13:21:54] (03CR) 10jerkins-bot: [V: 04-1] utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) (owner: 10Jbond) [13:22:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/657560 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [13:23:10] (03PS3) 10JMeybohm: docker_registry_ha: Add "Vary: Accept" to response [puppet] - 10https://gerrit.wikimedia.org/r/650153 (https://phabricator.wikimedia.org/T256762) [13:24:16] (03PS1) 10A2569875: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612) [13:24:18] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612) (owner: 10A2569875) [13:25:26] (03CR) 10JMeybohm: [C: 03+2] docker_registry_ha: Add "Vary: Accept" to response [puppet] - 10https://gerrit.wikimedia.org/r/650153 (https://phabricator.wikimedia.org/T256762) (owner: 10JMeybohm) [13:26:52] (03CR) 10Lucas Werkmeister (WMDE): "Have you had any luck with those tests yet? Otherwise we’re pretty stuck here :/" [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man) [13:31:59] (03CR) 10Muehlenhoff: [C: 03+1] "The patch looks good to me. I'd suggest to move the system user/group handling to systemd::sysuser to simplify things, but that's unrelate" [puppet] - 10https://gerrit.wikimedia.org/r/657547 (owner: 10Elukey) [13:32:10] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:12] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) @Ladsgroup managed to nerdsnipe be good on this one :). I have created a [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/657571 | CR ]] which is mostly the logic in y... [13:33:22] (03PS2) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) [13:38:13] !log put eqiad/esams lumen link back in service [13:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:10] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:02] 10SRE, 10netops: eqiad-esams link issue - https://phabricator.wikimedia.org/T272524 (10ayounsi) 05Open→03Resolved Back in service. [13:44:48] PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 113.71, 100.11, 70.21 https://wikitech.wikimedia.org/wiki/Swift [13:48:06] (03PS1) 10Marostegui: *.sql: Add sql_log_bin=0 [software] - 10https://gerrit.wikimedia.org/r/657574 [13:48:09] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) {meme, src="antoine-approve", below="{{done\}\}"} [13:49:28] (03CR) 10Kormat: [C: 03+1] *.sql: Add sql_log_bin=0 [software] - 10https://gerrit.wikimedia.org/r/657574 (owner: 10Marostegui) [13:49:30] RECOVERY - very high load average likely xfs on ms-be2055 is OK: OK - load average: 57.02, 73.63, 66.87 https://wikitech.wikimedia.org/wiki/Swift [13:49:38] (03CR) 10Marostegui: [C: 03+2] *.sql: Add sql_log_bin=0 [software] - 10https://gerrit.wikimedia.org/r/657574 (owner: 10Marostegui) [13:50:22] (03Merged) 10jenkins-bot: *.sql: Add sql_log_bin=0 [software] - 10https://gerrit.wikimedia.org/r/657574 (owner: 10Marostegui) [13:53:08] PROBLEM - Check systemd state on registry1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:42] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host bast3004.wikimedia.org [13:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:10] (03PS1) 10Mforns: Migrate SuggestedTagsAction to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657579 (https://phabricator.wikimedia.org/T267351) [13:57:18] (03PS1) 10Kormat: dbtools: Add sys/apply script [software] - 10https://gerrit.wikimedia.org/r/657581 [13:58:16] (03PS1) 10Muehlenhoff: Add comment in site.pp for former bastions [puppet] - 10https://gerrit.wikimedia.org/r/657583 [13:59:00] (03CR) 10Marostegui: [C: 03+1] dbtools: Add sys/apply script [software] - 10https://gerrit.wikimedia.org/r/657581 (owner: 10Kormat) [13:59:33] (03CR) 10Muehlenhoff: [C: 03+2] Add comment in site.pp for former bastions [puppet] - 10https://gerrit.wikimedia.org/r/657583 (owner: 10Muehlenhoff) [14:00:04] brennen and liw: How many deployers does it take to do Mediawiki train - American+European Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T1400). [14:00:46] (03CR) 10Kormat: [C: 03+2] dbtools: Add sys/apply script [software] - 10https://gerrit.wikimedia.org/r/657581 (owner: 10Kormat) [14:03:07] (03CR) 10Volans: "By any chance was https://github.com/camptocamp/puppet-ghostbuster evaluated/discarded for some reason?" [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) (owner: 10Jbond) [14:04:31] * jbond42 dosn;t want to look at the comment volans just posted :( [14:04:51] 10SRE, 10Traffic, 10Patch-For-Review: Docker registry needs cache to vary on Accept header value - https://phabricator.wikimedia.org/T242200 (10JMeybohm) 05Open→03Resolved The registry now responds properly with `vary: Accept` [14:04:59] (03CR) 10Ottomata: "Not opposed at all, but I'd expect many many clients to have a UA of 'python-requests', no? Not just a malicious one?" [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey) [14:05:05] jbond42: lol [14:05:07] 10SRE, 10Traffic: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [14:06:20] (03CR) 10Ottomata: [C: 03+1] Migrate SuggestedTagsAction to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657579 (https://phabricator.wikimedia.org/T267351) (owner: 10Mforns) [14:06:59] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3004.wikimedia.org [14:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:00] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host bast4002.wikimedia.org [14:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:13] 10SRE, 10Traffic: Consolidate misc servers at edge sites - https://phabricator.wikimedia.org/T257323 (10MoritzMuehlenhoff) [14:09:56] 10SRE, 10Traffic, 10Patch-For-Review: Consolidate edge bastion server into ganeti - https://phabricator.wikimedia.org/T257324 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is done. [14:10:22] (03PS3) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) [14:10:49] (03CR) 10jerkins-bot: [V: 04-1] utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) (owner: 10Jbond) [14:13:32] (03CR) 10Jbond: utils::audit: add puppet audit script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) (owner: 10Jbond) [14:13:46] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:49] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4002.wikimedia.org [14:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:24] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host bast5001.wikimedia.org [14:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:00] !log roll-restart swift-object in eqiad to apply new concurrency [14:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:50] (03PS1) 10JMeybohm: Demo - don't merge: Add a new listener to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/657591 [14:20:52] (03PS1) 10JMeybohm: Demo - don't merge: Enable the service-proxy-demo listener for MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/657592 [14:20:54] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5001.wikimedia.org [14:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:50] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27570/console" [puppet] - 10https://gerrit.wikimedia.org/r/657591 (owner: 10JMeybohm) [14:25:14] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27571/console" [puppet] - 10https://gerrit.wikimedia.org/r/657592 (owner: 10JMeybohm) [14:26:41] jouncebot: next [14:26:41] In 2 hour(s) and 33 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T1700) [14:30:26] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:26] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:44] RECOVERY - Check systemd state on registry1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:27] (03PS1) 10David Caro: config: allow using ~ for cookbook path [software/spicerack] - 10https://gerrit.wikimedia.org/r/657608 [14:51:00] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::analytics::cluster::users: add analytics to the druid group [puppet] - 10https://gerrit.wikimedia.org/r/657547 (owner: 10Elukey) [14:51:36] (03PS1) 10David Caro: gitignore: add vim swap files [software/spicerack] - 10https://gerrit.wikimedia.org/r/657609 [14:54:18] (03CR) 10Jbond: icinga: add wait_for_optimal function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond) [14:55:06] (03PS2) 10Andrew Bogott: Nova cloud-init: rework logic for initial volume setup [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273) [14:55:08] (03PS1) 10Andrew Bogott: Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273) [14:55:45] (03CR) 10jerkins-bot: [V: 04-1] config: allow using ~ for cookbook path [software/spicerack] - 10https://gerrit.wikimedia.org/r/657608 (owner: 10David Caro) [14:56:45] (03CR) 10jerkins-bot: [V: 04-1] Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [14:58:22] (03PS1) 10JMeybohm: Remove discovery.enabled from services (it's unused) [deployment-charts] - 10https://gerrit.wikimedia.org/r/657613 [14:59:14] (03PS2) 10Andrew Bogott: Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273) [14:59:16] (03PS3) 10Andrew Bogott: Nova cloud-init: rework logic for initial volume setup [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273) [15:00:47] (03CR) 10jerkins-bot: [V: 04-1] Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [15:01:05] (03CR) 10Volans: "All comments are the outcome of a chat with John" (034 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond) [15:01:45] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove discovery.enabled from services (it's unused) [deployment-charts] - 10https://gerrit.wikimedia.org/r/657613 (owner: 10JMeybohm) [15:03:00] (03CR) 10Elukey: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/657288 (owner: 10Elukey) [15:03:12] (03PS1) 10Alexandros Kosiaris: Remove k8s::ssl [puppet] - 10https://gerrit.wikimedia.org/r/657615 (https://phabricator.wikimedia.org/T272559) [15:06:28] (03PS3) 10Andrew Bogott: Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273) [15:06:30] (03PS4) 10Andrew Bogott: Nova cloud-init: rework logic for initial volume setup [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273) [15:06:57] (03CR) 10Ladsgroup: [C: 03+1] "Its last usage was removed four years ago I6ad769d0225c4" [puppet] - 10https://gerrit.wikimedia.org/r/657615 (https://phabricator.wikimedia.org/T272559) (owner: 10Alexandros Kosiaris) [15:08:09] (03CR) 10jerkins-bot: [V: 04-1] Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [15:10:29] (03PS4) 10Andrew Bogott: Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273) [15:10:31] (03PS5) 10Andrew Bogott: Nova cloud-init: rework logic for initial volume setup [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273) [15:11:22] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [15:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:10] (03CR) 10jerkins-bot: [V: 04-1] Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [15:12:21] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) @Volans pointed me towards [[ https://github.com/camptocamp/puppet-ghostbuster | puppet-ghostbuster ]]. I have run this locally with a tunnel to the puppetdb server and here a... [15:12:50] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [15:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:17] !log installing cairo security updates on stretch [15:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:24] (03PS5) 10Andrew Bogott: Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273) [15:13:26] (03PS6) 10Andrew Bogott: Nova cloud-init: rework logic for initial volume setup [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273) [15:16:24] (03CR) 10Andrew Bogott: [C: 03+2] Nova: move vendordata handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/657610 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [15:18:07] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/657608 (owner: 10David Caro) [15:19:33] (03PS6) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [15:19:39] (03PS1) 10Muehlenhoff: Add library hint for cairo [puppet] - 10https://gerrit.wikimedia.org/r/657621 [15:21:01] (03CR) 10jerkins-bot: [V: 04-1] [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [15:22:11] 10SRE, 10Icinga, 10observability, 10serviceops: Investigate opcache hit rate on Buster appserver - https://phabricator.wikimedia.org/T270517 (10RLazarus) Nice find! Thanks for tracking this down. [15:25:09] (03PS7) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [15:25:25] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for cairo [puppet] - 10https://gerrit.wikimedia.org/r/657621 (owner: 10Muehlenhoff) [15:26:16] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27577/console" [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [15:26:45] (03CR) 10Volans: [C: 03+1] "LGTM, why not :-)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/657608 (owner: 10David Caro) [15:27:15] (03PS1) 10Anne Tomasevich: Distinguish between null continue value and unknown one [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657623 (https://phabricator.wikimedia.org/T272548) [15:27:20] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/657609 (owner: 10David Caro) [15:28:40] (03PS2) 10Anne Tomasevich: Distinguish between null continue value and unknown one [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657623 (https://phabricator.wikimedia.org/T272548) [15:29:11] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Epic: [Epic] Scaling strategy for Wikidata Query Service - https://phabricator.wikimedia.org/T221938 (10CBogen) [15:31:04] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/657615 (https://phabricator.wikimedia.org/T272559) (owner: 10Alexandros Kosiaris) [15:35:45] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10akosiaris) [15:37:14] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10akosiaris) I 've checked off stdlib and lvm classes as they are from external modules that have been imported to the tree as is (aka vendoring). [15:38:12] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:34] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004896 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:45:32] (03CR) 10Eric Gardner: [C: 03+1] Distinguish between null continue value and unknown one [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657623 (https://phabricator.wikimedia.org/T272548) (owner: 10Anne Tomasevich) [15:45:50] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10MoritzMuehlenhoff) [15:59:08] <_joe_> is someone looking at the puppet failures? I'm in a meeting rn [15:59:20] <_joe_> oh it was a recovery [15:59:55] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host krb2001.codfw.wmnet [15:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:36] (03PS2) 10Razzi: sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) [16:03:38] (03PS1) 10Ayounsi: Add Lumen transit in eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657627 (https://phabricator.wikimedia.org/T271748) [16:04:10] (03CR) 10jerkins-bot: [V: 04-1] Add Lumen transit in eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657627 (https://phabricator.wikimedia.org/T271748) (owner: 10Ayounsi) [16:04:20] (03PS2) 10Ayounsi: Add Lumen transit in eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657627 (https://phabricator.wikimedia.org/T271748) [16:04:53] (03CR) 10jerkins-bot: [V: 04-1] sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [16:05:16] PROBLEM - Check the last execution of replicate-krb-database on krb1001 is CRITICAL: CRITICAL: Status of the systemd unit replicate-krb-database https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:05:37] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2001.codfw.wmnet [16:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:52] (03CR) 10Ayounsi: [C: 03+2] Add Lumen transit in eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657627 (https://phabricator.wikimedia.org/T271748) (owner: 10Ayounsi) [16:06:32] (03Merged) 10jenkins-bot: Add Lumen transit in eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657627 (https://phabricator.wikimedia.org/T271748) (owner: 10Ayounsi) [16:07:14] PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:59] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet [16:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:56] RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:12:10] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10dancy) Thanks Legoktm. Small feature request: Can you add "last updated at " text to the top right corner of the page? [16:13:36] (03CR) 10JMeybohm: [C: 03+2] Remove discovery.enabled from services (it's unused) [deployment-charts] - 10https://gerrit.wikimedia.org/r/657613 (owner: 10JMeybohm) [16:14:10] RECOVERY - Check the last execution of replicate-krb-database on krb1001 is OK: OK: Status of the systemd unit replicate-krb-database https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:14:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet [16:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:04] (03Merged) 10jenkins-bot: Remove discovery.enabled from services (it's unused) [deployment-charts] - 10https://gerrit.wikimedia.org/r/657613 (owner: 10JMeybohm) [16:15:29] (03PS1) 10Mforns: Migrate WebUIActionsTracking schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657630 (https://phabricator.wikimedia.org/T267347) [16:16:05] (03PS1) 10Hnowlan: similar-users: remove unused releases, set log to DEBUG in staging, new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657631 [16:22:21] (03CR) 10Gmodena: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/657631 (owner: 10Hnowlan) [16:24:52] (03CR) 10Hnowlan: [C: 03+2] similar-users: remove unused releases, set log to DEBUG in staging, new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657631 (owner: 10Hnowlan) [16:26:22] (03Merged) 10jenkins-bot: similar-users: remove unused releases, set log to DEBUG in staging, new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657631 (owner: 10Hnowlan) [16:26:36] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) >>! In T179696#6765834, @dancy wrote: > Thanks Legoktm. Small feature request: Can you add "last updated at > " text to the top righ... [16:27:18] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [16:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:29:37] (03CR) 10RLazarus: "LGTM -- please test with httpbb on one host before deploying everywhere, either before or after merging" [puppet] - 10https://gerrit.wikimedia.org/r/657138 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [16:29:43] (03CR) 10RLazarus: [C: 03+1] mediawiki::web::prod_sites: remove unused code from main.conf [puppet] - 10https://gerrit.wikimedia.org/r/657138 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [16:31:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:32:08] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:12] PROBLEM - MariaDB Replica Lag: m2 on db2133 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1344.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:32:26] PROBLEM - MariaDB Replica Lag: m2 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1358.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:32:59] checking [16:35:15] (03PS3) 10Razzi: sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) [16:35:26] must be something on 2133 [16:35:37] jynus: yes, I am on it [16:37:12] PROBLEM - MariaDB Replica SQL: m2 on db2133 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1061, Errmsg: Error Duplicate key name ix_user_user_text on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:37:25] ^there you have it [16:37:32] jynus: yes, I am on it [16:39:12] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:37] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10bd808) > that are not used anywhere (including WMCS, you can use it too (an example). Be aware that the puppet class/role reporting for Cloud VPS instances **//only//** reports the o... [16:53:43] (03PS1) 10Ottomata: Install python3-snappy for webperf navtiming [puppet] - 10https://gerrit.wikimedia.org/r/657639 (https://phabricator.wikimedia.org/T272613) [16:54:18] PROBLEM - WDQS high update lag on wdqs1013 is CRITICAL: 4.904e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:54:41] (03PS2) 10Cwhite: logstash: enable curator to accept custom age filters [puppet] - 10https://gerrit.wikimedia.org/r/657370 (https://phabricator.wikimedia.org/T234565) [16:55:52] (03CR) 10Gilles: [C: 03+1] Install python3-snappy for webperf navtiming [puppet] - 10https://gerrit.wikimedia.org/r/657639 (https://phabricator.wikimedia.org/T272613) (owner: 10Ottomata) [16:57:11] (03CR) 10Ottomata: [C: 03+2] Install python3-snappy for webperf navtiming [puppet] - 10https://gerrit.wikimedia.org/r/657639 (https://phabricator.wikimedia.org/T272613) (owner: 10Ottomata) [16:59:34] (03CR) 10Ottomata: "Could we make this index non 'w3creportingapi' specific, and instead use it for any/all events that use Event Platform based event schemas" [puppet] - 10https://gerrit.wikimedia.org/r/657452 (https://phabricator.wikimedia.org/T265938) (owner: 10Cwhite) [17:00:04] jbond42 and cdanis: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T1700). [17:05:45] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) icinga::nsca::client is an example for something used in fundraising. the server is in production and the clients are in frack and that does not use the same puppetmaster [17:09:02] (03PS3) 10Jcrespo: admin: Add wikitrent to the list of privileged LDAP accounts [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489) [17:10:10] (03CR) 10Jcrespo: [C: 03+2] admin: Add wikitrent to the list of privileged LDAP accounts [puppet] - 10https://gerrit.wikimedia.org/r/657378 (https://phabricator.wikimedia.org/T272489) (owner: 10Jcrespo) [17:12:52] (03CR) 10Cwhite: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/657452 (https://phabricator.wikimedia.org/T265938) (owner: 10Cwhite) [17:15:53] (03CR) 10Ottomata: [C: 03+1] "Ah ok, I think I had forgotten that. We'll just have to figure out how to reconcile that `http` field between our event schemas and ECS. " [puppet] - 10https://gerrit.wikimedia.org/r/657452 (https://phabricator.wikimedia.org/T265938) (owner: 10Cwhite) [17:16:54] RECOVERY - MariaDB Replica Lag: m2 on db2133 is OK: OK slave_sql_lag Replication lag: 0.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:17:12] RECOVERY - MariaDB Replica SQL: m2 on db2133 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:18:17] 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) 05Open→03Resolved The extra wmf privileges have been deployed on LDAP for wikitrent. Reopen if you find any issues while using gerrit because of that. [17:19:01] 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10Tchanders) Thanks @jcrespo [17:23:57] (03PS4) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) [17:24:24] (03CR) 10jerkins-bot: [V: 04-1] utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) (owner: 10Jbond) [17:28:29] (03PS5) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) [17:30:26] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) (owner: 10Jbond) [17:31:02] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:03] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10Papaul) [17:33:20] (03CR) 10Jeena Huneidi: [C: 03+2] "It seems like the comments have been addressed/I saw a +1 in them so I'll merge this" [deployment-charts] - 10https://gerrit.wikimedia.org/r/650633 (https://phabricator.wikimedia.org/T269876) (owner: 10Mstyles) [17:33:40] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) have finished hacking with the audit script, this is the list produced by that script ` lines=5 alternatives::install apparmor::hardlink apt::noupgrade arclamp::profiler bacula... [17:34:06] (03PS1) 10Andrew Bogott: mwopenstackclients3.py: apply 70bade8f82a505b25e5cc1a09449dc6e0ebc34b6 to py3 [puppet] - 10https://gerrit.wikimedia.org/r/657668 (https://phabricator.wikimedia.org/T272553) [17:34:08] (03PS1) 10Andrew Bogott: wmcs-wikireplica-dns.py: Add support for db.svc.wikimedia.cloud. entries [puppet] - 10https://gerrit.wikimedia.org/r/657669 (https://phabricator.wikimedia.org/T272553) [17:34:33] 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) Tchanders, small followup- I understand the process may not be trivial for newcomers, but the simplification, before I edited, on the Engineering's handbook made us unable to proceed with th... [17:34:52] (03Merged) 10jenkins-bot: update flink config with swift and other values [deployment-charts] - 10https://gerrit.wikimedia.org/r/650633 (https://phabricator.wikimedia.org/T269876) (owner: 10Mstyles) [17:34:56] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients3.py: apply 70bade8f82a505b25e5cc1a09449dc6e0ebc34b6 to py3 [puppet] - 10https://gerrit.wikimedia.org/r/657668 (https://phabricator.wikimedia.org/T272553) (owner: 10Andrew Bogott) [17:34:58] (03CR) 10jerkins-bot: [V: 04-1] wmcs-wikireplica-dns.py: Add support for db.svc.wikimedia.cloud. entries [puppet] - 10https://gerrit.wikimedia.org/r/657669 (https://phabricator.wikimedia.org/T272553) (owner: 10Andrew Bogott) [17:35:28] !log [wdqs] Depooled `wdqs1013` to allow it to catch up on lag [17:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:39] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) [17:36:41] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [17:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:10] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:42] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10jcrespo) Hey, @JTannerWMF, I tried to search on my own for your LDAP/Developer account, but the one you provided (JTanner (WMF)) doesn't exist. I am adding @ggellerman to the ticket (ap... [17:39:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:40:38] 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10Tchanders) >>! In T272489#6766201, @jcrespo wrote: > Tchanders, small followup- I understand the process may not be trivial for newcomers, but the simplification, before I edited, on the Engineering'... [17:41:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:42:07] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:42] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1013 is CRITICAL: 4.386e+04 ge 4.32e+04 Ryan Kemper Affected node has been depooled while it catches up on 12h of update lag: https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=1611196826628&to=1611251346414 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wiki [17:44:42] e?orgId=1&panelId=8&fullscreen [17:45:38] (03PS2) 10Andrew Bogott: wmcs-wikireplica-dns.py: Add support for db.svc.wikimedia.cloud. entries [puppet] - 10https://gerrit.wikimedia.org/r/657669 (https://phabricator.wikimedia.org/T272553) [17:45:40] (03PS1) 10Andrew Bogott: wmcs-wikireplica-dns.py: format with black [puppet] - 10https://gerrit.wikimedia.org/r/657671 (https://phabricator.wikimedia.org/T272553) [17:46:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:48:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:56:02] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10Papaul) [18:00:04] chrisalbon and accraze: Dear deployers, time to do the Services – Graphoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T1800). [18:02:55] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [18:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:03] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10elukey) `statistics-privatedata-users` is deprecated, let's use `analytics-privatedata-users` (need @Ottomata's approval) [18:08:12] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:32] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [18:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:45] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) a:05Ottomata→03elukey [18:12:51] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:18] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:14:26] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [18:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:13] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:15:56] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:18:54] (03CR) 10CDanis: [C: 03+1] "This looks good, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/657452 (https://phabricator.wikimedia.org/T265938) (owner: 10Cwhite) [18:19:48] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:21:13] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:54] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10KFrancis) @jcrespo The NDA is out for signatures. I will confirm when it's complete. Thanks! [18:23:40] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10Ottomata) Approved. [18:26:10] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) Oo we'll also want eventstreams-internal.svc.* LVS set up too. [18:28:03] 10SRE, 10SRE-Access-Requests: Requesting ssh key change for production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10jcrespo) [18:29:02] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) >>! In T272559#6766020, @Dzahn wrote: > icinga::nsca::client is an example for something used in fundraising. the server is in production and the clients are in frack and that... [18:30:26] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:33:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2371.codfw.wmnet with reason: REIMAGE [18:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2373.codfw.wmnet with reason: REIMAGE [18:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:23] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10ggellerman) Thanks, @jcrespo ! I have added @JKatzWMF who is Jazmin's manager now. Would you please let me know which records still list me as @JTannerWMF 's manager so that I can... [18:34:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2375.codfw.wmnet with reason: REIMAGE [18:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2371.codfw.wmnet with reason: REIMAGE [18:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:28] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2375.codfw.wmnet with reason: REIMAGE [18:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:13] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:10] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Andrew) [18:37:27] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2373.codfw.wmnet with reason: REIMAGE [18:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:35] (03CR) 10Andrew Bogott: [C: 03+2] Remove the 'letsencrypt' module [puppet] - 10https://gerrit.wikimedia.org/r/655762 (https://phabricator.wikimedia.org/T252199) (owner: 10Andrew Bogott) [18:38:55] (03Abandoned) 10Andrew Bogott: Nova cloud-init: rework logic for initial volume setup [puppet] - 10https://gerrit.wikimedia.org/r/657464 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [18:42:24] 10SRE, 10SRE-Access-Requests: Requesting ssh key change for production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10jcrespo) No approvals needed, this is just an ssh change (no permission changes) I only need to verify identity of requester and we should be done. @calbon @ACraze can we... [18:43:12] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10jcrespo) Thank for the update, looking forward for the process to be complete. Thanks to you! [18:44:18] PROBLEM - PHP7 rendering on mw2375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:44:58] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10jcrespo) @ggellerman Apologies for the mistake, I checked corporate ldap records, the one used for Google account authentication. Not sure if it is also used for some of the other hr too... [18:46:58] 10SRE, 10SRE-Access-Requests: Requesting ssh key change for production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10jcrespo) [18:47:06] (03PS1) 10Andrew Bogott: cinder: comment out the memcached servers for keystone authtoken [puppet] - 10https://gerrit.wikimedia.org/r/657673 (https://phabricator.wikimedia.org/T272113) [18:48:08] (03CR) 10Andrew Bogott: [C: 03+2] cinder: comment out the memcached servers for keystone authtoken [puppet] - 10https://gerrit.wikimedia.org/r/657673 (https://phabricator.wikimedia.org/T272113) (owner: 10Andrew Bogott) [18:48:18] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27578/console" [puppet] - 10https://gerrit.wikimedia.org/r/635751 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [18:48:55] (03PS1) 10Jcrespo: admin: Update ssh key for accraze [puppet] - 10https://gerrit.wikimedia.org/r/657674 (https://phabricator.wikimedia.org/T272541) [18:49:06] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2226.codfw.wmnet with reason: REIMAGE [18:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2226.codfw.wmnet with reason: REIMAGE [18:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:32] !log razzi@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid public cluster: Reboot Druid nodes - razzi@cumin1001 [18:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:11] PROBLEM - Host mw2375 is DOWN: PING CRITICAL - Packet loss = 100% [18:54:29] (03PS5) 10Joal: profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) [18:55:14] (03CR) 10Joal: "Thanks for the explanation elukey - should be ok now :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [18:55:49] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2371.codfw.wmnet'] ` an... [18:56:05] RECOVERY - PHP7 rendering on mw2375 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.347 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:56:07] RECOVERY - Host mw2375 is UP: PING OK - Packet loss = 0%, RTA = 33.46 ms [18:56:44] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2373.codfw.wmnet'] ` an... [18:56:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting ssh key change for production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10jcrespo) p:05Triage→03High [18:58:21] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2375.codfw.wmnet'] ` an... [18:59:07] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:00:05] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T1900). [19:00:05] ottomata, hmonroy, and mforns: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:22] here :] [19:00:26] I can deploy today :) [19:00:34] I also represent ottomata from my team [19:00:44] ack [19:00:48] hmonroy: are you here? [19:00:57] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:01:43] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [19:01:44] (03CR) 10Urbanecm: [C: 03+2] "B&C" [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657391 (https://phabricator.wikimedia.org/T253121) (owner: 10Ottomata) [19:01:47] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [19:01:47] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:02:01] (03CR) 10Urbanecm: [C: 03+2] "B&C" [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657392 (https://phabricator.wikimedia.org/T253121) (owner: 10Ottomata) [19:02:13] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:02:34] mforns: do the config patches depend on the backports (ie. can I deploy them before)? [19:02:42] (03PS1) 10Jeena Huneidi: rdf-streaming-updater: Increment version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657676 [19:02:45] yes, I'm here [19:02:51] Urbanecm: you can deploy them before [19:02:59] ack, thanks [19:03:06] (03CR) 10Urbanecm: [C: 03+2] Migrate WebUIActionsTracking schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657630 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns) [19:03:43] (03PS1) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python [puppet] - 10https://gerrit.wikimedia.org/r/657677 [19:03:48] (03CR) 10Urbanecm: [C: 03+2] Migrate SuggestedTagsAction to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657579 (https://phabricator.wikimedia.org/T267351) (owner: 10Mforns) [19:03:51] (03CR) 10Urbanecm: Migrate WebUIActionsTracking schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657630 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns) [19:04:48] (03Merged) 10jenkins-bot: Migrate SuggestedTagsAction to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657579 (https://phabricator.wikimedia.org/T267351) (owner: 10Mforns) [19:05:16] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Port nrpe_check_opcache to Python [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm) [19:05:17] mforns: please test 657579: Migrate SuggestedTagsAction to Event Platform on all wikis | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/657579 at mwdebug1001 [19:05:38] hi hmonroy, will ping you once your patches are ready :) [19:05:48] urbanecm: thank you! [19:05:52] Urbanecm: doing [19:05:59] thanks a lot Urbanecm [19:06:07] (03PS2) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python [puppet] - 10https://gerrit.wikimedia.org/r/657677 [19:06:17] np :) [19:06:51] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2371.codfw.wmnet [19:06:56] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2373.codfw.wmnet [19:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:06] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2375.codfw.wmnet [19:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:35] (03CR) 10Dzahn: "This wouldn't have been true for me personally, fwiw." [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm) [19:11:15] mforns: how is it going? anything i can help with? [19:11:48] Urbanecm: I was looking at Kafka to see if events are flowing in, can not see them, but might be because the stream is low throughput [19:12:06] I have never used mwdebug1001 to test, will try now [19:12:20] mforns: ah, that's because the change is not yet deployed [19:12:27] oh, ok ok [19:13:17] you need to install an extension from https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_extensions to your browser, enable it, pick mwdebug1001 as your server, and do sth on-wiki to make the server send the event, and then you can see it in Kafka [19:13:25] (03PS1) 10Andrew Bogott: Revert "cinder: comment out the memcached servers for keystone authtoken" [puppet] - 10https://gerrit.wikimedia.org/r/657650 [19:13:56] mforns: it's a way how to test a change before actually pushing it live, affecting everyone else, to make sure it doesn't bring us down, or cause other bad things [19:14:06] of course [19:14:30] (03CR) 10Andrew Bogott: [C: 03+2] Revert "cinder: comment out the memcached servers for keystone authtoken" [puppet] - 10https://gerrit.wikimedia.org/r/657650 (owner: 10Andrew Bogott) [19:14:35] I wasn't aware this was a requirement [19:15:01] Urbanecm: please feel free to cancel those patches if they are blocking the window [19:15:24] does that mean there's an issue with testing them mforns ? [19:15:28] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10jcrespo) @Lea_WMDE To speed up access, could you come back to me about questions at T271725#6755696. Interns and researchers, in our best practices, have a time-bound... [19:16:12] Urbanecm: btw i will be available in 15 mins and can help with testing both of thest things [19:16:16] mforns: ^ [19:16:20] Urbanecm: I assume it will take a while, but if it's not blocking the window I'm trying [19:16:30] ok, thanks! [19:17:21] mforns: it's all right, we have time :). Sorry, I thought you're familiar with the process, would make it more clear otherwise :) [19:17:35] no problem, thanks! [19:18:42] (03CR) 10Urbanecm: [C: 03+2] "docs-only change, no-op" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651890 (https://phabricator.wikimedia.org/T255790) (owner: 10Samwilson) [19:18:47] (03CR) 10Jeena Huneidi: [C: 03+2] rdf-streaming-updater: Increment version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657676 (owner: 10Jeena Huneidi) [19:19:39] (03Merged) 10jenkins-bot: Add notes about load order of Wikisource and Collection extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651890 (https://phabricator.wikimedia.org/T255790) (owner: 10Samwilson) [19:19:45] (and also ack to otto.mata's msg) [19:19:53] (03CR) 10Jcrespo: [C: 03+1] "I verified new key on a videocall, with Calbon confirming identity of Andy and Andy confirming thew key's hash." [puppet] - 10https://gerrit.wikimedia.org/r/657674 (https://phabricator.wikimedia.org/T272541) (owner: 10Jcrespo) [19:20:25] (03Merged) 10jenkins-bot: rdf-streaming-updater: Increment version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657676 (owner: 10Jeena Huneidi) [19:20:27] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10ggellerman) Thanks, @jcrespo ! I'll ask IT if they can update ldap records to reflect @JTannerWMF 's current manager. [19:21:06] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) I don't know the answer to that question. Things may have changed over time. We'd have to ask frack people like Jeff Green. [19:21:10] (03CR) 10Jcrespo: [C: 03+1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/657674 (https://phabricator.wikimedia.org/T272541) (owner: 10Jcrespo) [19:21:48] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 0b46c9f1f75fc773f57bfa70521c9eaf20410b9e: [no-op] Add notes about load order of Wikisource and Collection extensions (T255790) (duration: 01m 11s) [19:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:52] T255790: Wikisource: Replace ElectronPDF with WSExport PDF support - https://phabricator.wikimedia.org/T255790 [19:21:56] (03CR) 10Jcrespo: [C: 03+2] admin: Update ssh key for accraze [puppet] - 10https://gerrit.wikimedia.org/r/657674 (https://phabricator.wikimedia.org/T272541) (owner: 10Jcrespo) [19:22:09] hmonroy: fyi, your docs-only change is merged [19:22:22] cool! [19:24:43] mforns: I tried to submit an event via mwdebug1001 to help you testing it, it sent a post-request to intake-analytics.wikimedia.org [19:25:13] Urbanecm: trying to do the same here [19:25:34] Urbanecm: I think I managed now :] [19:25:40] mforns: great :) [19:26:37] Urbanecm: yes, the event did come in to kafka :] [19:26:44] great! [19:26:45] syncing it then :) [19:26:48] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10jcrespo) @ggellerman I just saw another mistake on the corporate ldap not being up to date in terms of management, so I will stop using it to locate managers and use some of the hr tools... [19:27:11] Urbanecm: cool! thanks a lot for the patience [19:27:38] no problem :) [19:28:26] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 62c9c35a76e2d065922f8c9f5a58672240dea7de: Migrate SuggestedTagsAction to Event Platform on all wikis (T267351) (duration: 01m 03s) [19:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:30] T267351: SuggestedTagsAction Event Platform Migration - https://phabricator.wikimedia.org/T267351 [19:29:01] mforns: should be live :) [19:29:12] Urbanecm: ok! checking [19:29:35] mforns: I'll deploy hmonroy's patch, which appears to be simpler, now :) [19:29:42] ok [19:29:43] (03PS3) 10Urbanecm: Enables the Wikisource extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657407 (https://phabricator.wikimedia.org/T272163) (owner: 10Tpt) [19:29:46] (03CR) 10Urbanecm: [C: 03+2] Enables the Wikisource extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657407 (https://phabricator.wikimedia.org/T272163) (owner: 10Tpt) [19:29:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27580/console" [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [19:30:39] (03Merged) 10jenkins-bot: Refactor EventLogging Event Platform PHP integration [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657391 (https://phabricator.wikimedia.org/T253121) (owner: 10Ottomata) [19:31:00] (03Merged) 10jenkins-bot: Enables the Wikisource extension on oldwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657407 (https://phabricator.wikimedia.org/T272163) (owner: 10Tpt) [19:31:49] hmonroy: your patch is available at mwdebug1001 for testing [19:31:59] checking [19:32:08] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10ggellerman) No need to apologize, @jcrespo - you surfaced something that I did not know about that needs to be fixed. I thank you for that :) [19:33:21] OoOOk ! hello! [19:33:25] mforns: where we at? [19:33:44] in the middle [19:33:46] : [19:33:47] :] [19:33:53] ottomata: 657579 Migrate SuggestedTagsAction to Event Platform on all wikis is deployed, currently deploying some other (unrelated) patch to give you time to appear :) [19:34:02] (03Merged) 10jenkins-bot: Fix possible undefined index warning in arg checking in EventServiceClient [extensions/EventLogging] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657392 (https://phabricator.wikimedia.org/T253121) (owner: 10Ottomata) [19:34:07] backports are merged, I'll pull them to mwdebug1002 so you can test [19:34:11] nice [19:34:12] ok [19:35:04] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) >>! In T272559#6766195, @jbond wrote: > have finished hacking with the audit script, this is the list produced by that script > diamond > diamond::collector > diamond::collect... [19:35:07] ottomata: mforns: ok, your backports are at mwdebug1002 for testing :) [19:35:17] (both of them) [19:35:20] testing on mwdebug1002 [19:35:57] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) ^ Those are just the ones that stood out to me from the list. I have not gone through the others. But it seems to me there are a lot of false positives here. Please don't delet... [19:36:18] hmonroy: how is it going with your patch? :) [19:37:32] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2226.codfw.wmnet'] ` an... [19:37:57] Urbanecm: tested, works perfect. [19:38:04] thanks, syncing it out then :) [19:38:08] Urbanecm: Looks good! I just checked with the team to make sure it is working as expected. [19:38:13] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2226.codfw.wmnet [19:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:19] thanks hmonroy, will sync too :) [19:38:30] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2371.codfw.wmnet [19:38:36] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2373.codfw.wmnet [19:38:42] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2375.codfw.wmnet [19:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:27] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2226.codfw.wmnet [19:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:58] PROBLEM - Host es2025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:40:02] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 9.913 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [19:40:12] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/EventLogging/: ee830a5ec2051fa970084e89b477a44c384e309c: f7152a74e00404fc561c44d1c2e37d7f882e2f52: EventLogging backport, see commits for details (T253121) (duration: 01m 05s) [19:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:19] T253121: MEP Client MediaWiki PHP - https://phabricator.wikimedia.org/T253121 [19:40:20] ottomata: mforns: backports deployed [19:41:51] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 4bb9e5d13be702516368774732a9e1711bec42e5: Enables the Wikisource extension on oldwikisource (T272163) (duration: 01m 04s) [19:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:55] T272163: Install the Wikisource extension on oldwikisource - https://phabricator.wikimedia.org/T272163 [19:42:00] hmonroy: and deployed :) [19:42:24] (03PS2) 10Urbanecm: Migrate WebUIActionsTracking schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657630 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns) [19:42:29] (03CR) 10Urbanecm: [C: 03+2] Migrate WebUIActionsTracking schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657630 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns) [19:42:49] Urbanecm: woohoo thank youy [19:42:59] no problem :) [19:43:18] (03Merged) 10jenkins-bot: Migrate WebUIActionsTracking schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657630 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns) [19:44:19] ottomata: mforns: 657630: Migrate WebUIActionsTracking schemas to Event Platform on testwiki | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/657630 is at mwdebug1002 for testing :) [19:44:27] Urbanecm: ok! on it [19:45:03] mforns: thanks, let me know if there's something I can help you with :) [19:45:11] ok :] [19:45:48] (03CR) 10Legoktm: [C: 03+1] mediawiki::web::prod_sites: remove unused code from main.conf [puppet] - 10https://gerrit.wikimedia.org/r/657138 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [19:45:58] RECOVERY - Host es2025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.96 ms [19:47:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:47:19] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) >>! In T179696#6765834, @dancy wrote: > Thanks Legoktm. Small feature request: Can you add "last updated at > " text to the top... [19:47:19] 1/2 schemas tested [19:47:31] ack [19:49:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:49:01] (03PS1) 10Legoktm: docker_registry_ha: Add timestamp to build-homepage output [puppet] - 10https://gerrit.wikimedia.org/r/657678 (https://phabricator.wikimedia.org/T179696) [19:49:02] (afk for a bit! ) [19:49:18] ack [19:49:59] Urbanecm: both schemas tested, working! [19:50:07] mforns: great, syncing :) [19:51:42] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ac99da75f9507e19472ab3020be638262857ec07: Migrate WebUIActionsTracking schemas to Event Platform on testwiki (T267347; T271164) (duration: 01m 03s) [19:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:49] T267347: MobileWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T267347 [19:51:49] T271164: DesktopWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T271164 [19:51:53] mforns: that should be all :). Anything else? [19:52:16] Urbanecm: don't think so! thanks a lot for showing me how to test :] [19:52:23] happy to help :) [19:53:14] Urbanecm: Thank you! [19:53:20] no problem :) [20:00:04] brennen and liw: (Dis)respected human, time to deploy Mediawiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T2000). Please do the needful. [20:03:35] (03PS1) 10Brennen Bearnes: all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657679 [20:03:37] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657679 (owner: 10Brennen Bearnes) [20:04:17] (03PS6) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) [20:04:28] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657679 (owner: 10Brennen Bearnes) [20:04:41] !log razzi@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid public cluster: Reboot Druid nodes - razzi@cumin1001 [20:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:02] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.27 [20:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:36] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jcrespo) Comments for persistence-related modules: but please @Marostegui @Kormat comment too. * profile::proxysql I wrote this for deployment of proxysql. While it is basic, it is f... [20:13:28] 10SRE, 10SRE-Access-Requests: Requesting ssh key change for production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10jcrespo) [20:15:40] 10SRE, 10SRE-Access-Requests: Requesting ssh key change for production shell for Andy Craze - https://phabricator.wikimedia.org/T272541 (10jcrespo) 05Open→03Resolved a:03jcrespo Change was merged and should have been applied to all servers now. Reopen if you find any issues accessing the production cluster. [20:19:50] (03PS7) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) [20:20:12] RECOVERY - WDQS high update lag on wdqs1013 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 1.645e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:22:58] (03PS1) 10Ottomata: Finalize QuickSurvey* Event Platform migration [puppet] - 10https://gerrit.wikimedia.org/r/657681 (https://phabricator.wikimedia.org/T271165) [20:24:23] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) Thanks for the review @Dzahn this is helping get rid of some false positives in the [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/657571 | audit script ]]. i went th... [20:25:20] (03PS2) 10Ottomata: Finalize QuickSurvey* Event Platform migration [puppet] - 10https://gerrit.wikimedia.org/r/657681 (https://phabricator.wikimedia.org/T271165) [20:33:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:34:17] (03CR) 10Ahmon Dancy: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/657678 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [20:36:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:38:22] (03CR) 10RLazarus: "I think this is a good idea -- just minor comments on the implementation." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm) [20:40:47] (03PS8) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) [20:46:26] (03CR) 10Ottomata: [C: 03+2] Finalize QuickSurvey* Event Platform migration [puppet] - 10https://gerrit.wikimedia.org/r/657681 (https://phabricator.wikimedia.org/T271165) (owner: 10Ottomata) [20:56:34] (03PS2) 10Ottomata: Remove wgEventLoggingSchemas ContentTranslationAbuseFilter override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639579 (https://phabricator.wikimedia.org/T259163) [20:57:52] (03Abandoned) 10Ottomata: Remove wgEventLoggingSchemas ContentTranslationAbuseFilter override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639579 (https://phabricator.wikimedia.org/T259163) (owner: 10Ottomata) [21:01:17] 10SRE, 10Traffic, 10serviceops: ChartMuseum responses are cached in the CDN with default (24h) ttl - https://phabricator.wikimedia.org/T272633 (10Dzahn) `hieradata/role/common/cache/text.yaml` has: ` 60 helm-charts.wikimedia.org: 61 caching: 'normal' ` That should confirm that it is indeed the 24... [21:01:28] is there a reason bast4002.wikimedia.org is unreachable to me? [21:01:53] https://www.irccloud.com/pastebin/k8iiOPV2/ [21:02:27] chrisalbon: bast4003 is the new hotness [21:03:10] should be a quick update to your ssh config, https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast4003.wikimedia.org has the correct fingerprint [21:03:56] okay whew cool [21:04:18] I thought it was because I upgraded to Ubuntu 20 or something and was like "NoooooooOOOooOoooo" [21:04:22] Thanks rzl [21:04:33] 👍 [21:08:13] (03PS4) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712) [21:08:43] (03CR) 10Jforrester: "Good to go?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [21:09:59] 10SRE, 10Traffic, 10serviceops: ChartMuseum responses are cached in the CDN with default (24h) ttl - https://phabricator.wikimedia.org/T272633 (10CDanis) >>! In T272633#6766881, @Dzahn wrote: > An easy way to do this would be to just switch 'normal' to 'pass' here. Then there would be no caching at all. We... [21:24:30] jouncebot now [21:24:30] For the next 0 hour(s) and 35 minute(s): Mediawiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210121T2000) [21:24:35] rollin' back. [21:25:41] brennen: sorry to hear, shout if you need anything from SRE [21:25:50] rzl: thanks, will do. [21:27:11] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert group2 wikis to 1.36.0-wmf.26 [21:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:17] (03PS1) 10Ottomata: Remove migrated EventLoggingSchemas overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657688 (https://phabricator.wikimedia.org/T259163) [21:28:47] (03PS1) 10Brennen Bearnes: Revert "all wikis to 1.36.0-wmf.27" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657689 [21:28:49] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "all wikis to 1.36.0-wmf.27" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657689 (owner: 10Brennen Bearnes) [21:29:48] (03Merged) 10jenkins-bot: Revert "all wikis to 1.36.0-wmf.27" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657689 (owner: 10Brennen Bearnes) [21:30:20] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:25] hrm, looks like the error spike i was just seeing probably isn't train-related, but i will dig a bit before rolling back to group2. [21:32:53] (03CR) 10Ottomata: "To be deployed on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657688 (https://phabricator.wikimedia.org/T259163) (owner: 10Ottomata) [21:38:40] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:48] (03CR) 10Daimona Eaytoy: "> Patch Set 4:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [21:47:15] (03CR) 10Bstorm: [C: 03+1] wmcs-wikireplica-dns.py: format with black [puppet] - 10https://gerrit.wikimedia.org/r/657671 (https://phabricator.wikimedia.org/T272553) (owner: 10Andrew Bogott) [21:48:07] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) @elukey it works! I realized that since this service is not proxied via... [21:49:48] (03CR) 10Bstorm: [C: 03+1] "Definitely a hack-ish way to handle it, but this script is 99% for the wikireplicas and not even 1% a few other things that we thought of." [puppet] - 10https://gerrit.wikimedia.org/r/657669 (https://phabricator.wikimedia.org/T272553) (owner: 10Andrew Bogott) [21:56:36] (03PS1) 10DLynch: Enroll idwiki in the DiscussionTools a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657691 (https://phabricator.wikimedia.org/T268191) [21:57:00] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.867 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [21:58:11] Jdlrobson: about? [22:05:08] (03PS1) 10Aklapper: mariadb: grant user 'phstats' additional select on phabricator_policy db [puppet] - 10https://gerrit.wikimedia.org/r/657692 [22:06:05] (03PS2) 10Aklapper: mariadb: grant user 'phstats' additional select on phabricator_policy db [puppet] - 10https://gerrit.wikimedia.org/r/657692 [22:06:57] (03PS9) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) [22:07:58] (03PS2) 10A2569875: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612) [22:09:50] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-wikireplica-dns.py: Add support for db.svc.wikimedia.cloud. entries [puppet] - 10https://gerrit.wikimedia.org/r/657669 (https://phabricator.wikimedia.org/T272553) (owner: 10Andrew Bogott) [22:10:01] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-wikireplica-dns.py: format with black [puppet] - 10https://gerrit.wikimedia.org/r/657671 (https://phabricator.wikimedia.org/T272553) (owner: 10Andrew Bogott) [22:10:08] !log 1.36.0-wmf.27 train status: for avoidance of doubt, no deploys until further notice - sorting out T272638 [22:10:10] (03PS2) 10Andrew Bogott: wmcs-wikireplica-dns.py: format with black [puppet] - 10https://gerrit.wikimedia.org/r/657671 (https://phabricator.wikimedia.org/T272553) [22:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:12] T272638: TypeError: null is not an object (evaluating 't[e.title]') on mobile domain - https://phabricator.wikimedia.org/T272638 [22:17:23] (03PS10) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) [22:23:22] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 149 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:23:51] (03CR) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm) [22:23:59] (03PS3) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python [puppet] - 10https://gerrit.wikimedia.org/r/657677 [22:25:08] (03PS11) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) [22:25:36] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 5 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:30:54] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:32:24] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [22:32:38] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [22:36:55] (03PS12) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) [22:37:48] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:29] brennen: OK for me to sling out a beta config patch? [22:39:57] James_F: that's fine, but give me one sec to clear things [22:40:23] James_F: you're clear [22:42:12] (03PS5) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712) [22:42:19] (03CR) 10Jforrester: [C: 03+2] wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [22:43:05] (03Merged) 10jenkins-bot: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [22:45:57] brennen: All done, thanks. [22:46:57] James_F: ack, thanks. [22:46:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:49:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:50:08] (03PS1) 10Urbanecm: wgAbuseFilterAflFilterMigrationStage: Set READ_NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657694 (https://phabricator.wikimedia.org/T269712) [22:50:15] (03PS1) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make READ_NEW in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657695 (https://phabricator.wikimedia.org/T269712) [22:50:17] (03PS1) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657696 (https://phabricator.wikimedia.org/T269712) [22:50:19] (03PS1) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Stop setting, COMPAT_NEW is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657697 (https://phabricator.wikimedia.org/T269712) [22:51:23] sorry, didn't know you're uploading the patches James_F [22:51:24] Urbanecm: Pah. :-) [22:52:30] (03Abandoned) 10Urbanecm: wgAbuseFilterAflFilterMigrationStage: Set READ_NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657694 (https://phabricator.wikimedia.org/T269712) (owner: 10Urbanecm) [22:53:31] (03CR) 10Urbanecm: [C: 04-2] "Do not merge until it's actually changed in the AbuseFilter's repo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657697 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [22:53:53] (03CR) 10Urbanecm: [C: 04-2] "do not merge until we're sure the new schema doesn't cause any issues" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657696 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [22:54:09] Thanks. [22:54:15] Was looking for the AF patch. [22:54:19] just placed a procedural -2 to avoid bad things happening [22:54:24] I don't think there is any [22:54:34] Yeah, will write one. [22:57:05] (03PS2) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Stop setting, COMPAT_NEW is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657697 (https://phabricator.wikimedia.org/T269712) [22:57:16] (03CR) 10Volans: "Quick first pass" (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [22:57:30] (03CR) 1020after4: [C: 03+1] "+1 because I can't +2" [puppet] - 10https://gerrit.wikimedia.org/r/657692 (owner: 10Aklapper) [23:01:29] (03CR) 10Urbanecm: [C: 03+1] "this sounds like a good idea" [puppet] - 10https://gerrit.wikimedia.org/r/657692 (owner: 10Aklapper) [23:02:56] (03CR) 10RLazarus: "Thanks! Almost there, IMO." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm) [23:05:46] rzl: will icinga handle extra lines as long as the first one starts with UNKNOWN? [23:06:33] or is that determination solely exit code based? [23:06:45] I *think* it's just the exit code but I'm not positive [23:07:09] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Update to Netbox 2.10.3-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/657454 (owner: 10CRusnov) [23:07:12] 10SRE, 10DBA, 10Phabricator: Grant phstats user SELECT rights to phstats user - https://phabricator.wikimedia.org/T272654 (10Urbanecm) [23:07:47] (03PS5) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) [23:08:15] (03CR) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [23:08:22] 10SRE, 10DBA, 10Phabricator: Grant phstats user SELECT rights to phstats user - https://phabricator.wikimedia.org/T272654 (10Urbanecm) [23:08:45] (03PS3) 10Urbanecm: mariadb: grant user 'phstats' additional select on phabricator_policy db [puppet] - 10https://gerrit.wikimedia.org/r/657692 (https://phabricator.wikimedia.org/T272654) (owner: 10Aklapper) [23:10:36] (03PS1) 10CDanis: tweak User-Agent for bot_posts_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/657700 (https://phabricator.wikimedia.org/T272330) [23:10:58] 10SRE, 10DBA, 10Phabricator, 10Patch-For-Review: Grant phstats user SELECT rights for phabricator_policy database - https://phabricator.wikimedia.org/T272654 (10Urbanecm) [23:12:17] https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/pluginapi.html suggests it's just exit code and multiple lines are OK [23:13:24] yeah -- no guarantee icinga does exactly the same thing, and for some reason I can't find anything about it in the icinga docs, but I think it's likely correct [23:13:39] I mean, I'm sure the answer to this is known, I just don't know it :D [23:15:02] (03CR) 10CDanis: [C: 03+2] "0 tests failed, 0 tests skipped, 26 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/657700 (https://phabricator.wikimedia.org/T272330) (owner: 10CDanis) [23:15:22] oh cdanis is about, I bet he knows [23:15:37] I figured he'd be done for the day but now he's outed himself, the fool [23:20:01] oh no [23:20:10] (03PS6) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) [23:20:13] re: icinga, yes, it is just the exit code that matters [23:20:31] I don't believe the textual output matters at all, aside from it is shown in the UI [23:20:45] thanks [23:20:50] 👍 [23:21:01] (03PS4) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python [puppet] - 10https://gerrit.wikimedia.org/r/657677 [23:26:09] brennen: thcipriani hey [23:26:14] sorry about the delay [23:26:36] Jdlrobson: hey, wb. sorry for the dental appointment interruption. [23:26:49] so the issue is only an issue if we rollback [23:26:54] which hopefully we wont do [23:26:57] i can prepare a fix now [23:27:05] but maybe it's too late to roll the train forward? [23:27:35] we're technically past the cutoff, but this is always a judgment call. in practice i'd rather it be fully deployed than left in a split state over the weekend. [23:28:02] i _would_ like to avoid having to roll back after some window of time and then having things in a much more broken state than they are currently, though. [23:29:09] i don't think that's super likely, but if a fix is quick i think i'm ok slinging it out and then rolling forward yet this afternoon. [23:29:26] ...otherwise i guess i welcome advice. [23:29:46] i think it's okay to roll forward [23:29:52] the patch i need to write is going to be super trivial [23:30:06] if we need to roll back, and ill be around for next 3 hrs, we can apply my patch [23:30:39] (03CR) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [23:31:16] (03PS3) 10Bstorm: wikireplicas: add a multiinstance role for the dedicated analytics host [puppet] - 10https://gerrit.wikimedia.org/r/654558 (https://phabricator.wikimedia.org/T269211) [23:31:19] Jdlrobson: k. let's go ahead and give it a shot. [23:31:56] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:32:54] (03CR) 10Bstorm: [C: 03+2] "I'll merge this, since it isn't connected to any hosts now. Whenever we want to add the needed hiera and connect this to the host, I'll le" [puppet] - 10https://gerrit.wikimedia.org/r/654558 (https://phabricator.wikimedia.org/T269211) (owner: 10Bstorm) [23:33:05] (03CR) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm) [23:33:06] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:33:09] brennen: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/657702 is the patch [23:33:24] probably best to backport that now [23:33:31] so that if we do rollback it's straightforward [23:34:01] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:34:52] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:34:58] Jdlrobson: is that testable on an mwdebug? [23:35:00] On the plus side this is the biggest test of our error logging tracking at 105,965 errors in the last 12hrs [23:35:09] brennen: yes [23:35:14] i can test it on mwdebug [23:35:31] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:35:36] (03PS1) 10Brennen Bearnes: Fix toggling storage cleanup [extensions/MobileFrontend] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657652 (https://phabricator.wikimedia.org/T272638) [23:35:48] (03PS1) 10DLynch: A/B test output when a specific feature is being tested [extensions/DiscussionTools] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657653 (https://phabricator.wikimedia.org/T268191) [23:35:52] Jdlrobson: cool, i'll sync out the backport [23:36:26] once merged, that is. then test and go ahead to group2. [23:37:11] brennen: it looks like we'd be in a worse state by not deploying so definitely want to do this :) [23:37:22] (03CR) 10Brennen Bearnes: [C: 03+2] Fix toggling storage cleanup [extensions/MobileFrontend] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657652 (https://phabricator.wikimedia.org/T272638) (owner: 10Brennen Bearnes) [23:37:53] Jdlrobson: heh, yeah. this is sort of the inverse of the typical train blocker. [23:37:53] https://logstash.wikimedia.org/goto/dbb3c95c431a5d301fd6f2cc32cd8fe0 not looking healthy [23:38:21] usally the top error is 1000 in 12hrs :/ [23:38:32] oof, yeah. [23:39:08] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:23] (03PS13) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) [23:43:18] Jdlrobson: https://integration.wikimedia.org/ci/job/mwgate-node10-docker/200571/console [23:43:30] ahghh [23:43:42] my sentiments exactly [23:44:12] linting issue fixed [23:44:15] new patch up [23:45:31] (03CR) 10RLazarus: [C: 03+1] mediawiki: Port nrpe_check_opcache to Python [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm) [23:46:10] (03CR) 10Brennen Bearnes: Fix toggling storage cleanup [extensions/MobileFrontend] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657652 (https://phabricator.wikimedia.org/T272638) (owner: 10Brennen Bearnes) [23:47:40] (03PS2) 10Brennen Bearnes: Fix toggling storage cleanup [extensions/MobileFrontend] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657652 (https://phabricator.wikimedia.org/T272638) [23:48:28] (03CR) 10Brennen Bearnes: [C: 03+2] Fix toggling storage cleanup [extensions/MobileFrontend] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657652 (https://phabricator.wikimedia.org/T272638) (owner: 10Brennen Bearnes) [23:49:21] hrm. is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/657652 going to need a recheck? i don't think i've actually ever gotten myself into this situation with gerrit before. [23:50:13] brennen: think it should be fine [23:50:32] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [23:50:55] ah, yeah, there we go. started gate-and-submit again. [23:51:14] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2374.codfw.wmnet'] ` Of... [23:53:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2374.codfw.wmnet with reason: REIMAGE [23:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2372.codfw.wmnet with reason: REIMAGE [23:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2370.codfw.wmnet with reason: REIMAGE [23:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2374.codfw.wmnet with reason: REIMAGE [23:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:13] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2370.codfw.wmnet with reason: REIMAGE [23:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:08] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2374.codfw.wmnet'] ` Of... [23:57:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2372.codfw.wmnet with reason: REIMAGE [23:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:55] jouncebot now [23:58:55] No deployments scheduled for the next 0 hour(s) and 1 minute(s) [23:59:02] jouncebot next [23:59:02] In 0 hour(s) and 0 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210122T0000)