[00:02:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:05:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:12:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.11% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:15:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:15:59] (03PS1) 10Scott French: mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249437 (https://phabricator.wikimedia.org/T364245) [00:16:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:21:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:27:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:27:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:27:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:29:44] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host contint1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:32:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:33:36] (03CR) 10Scott French: [V:03+2] "Built and tested locally against envoy testbed." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249428 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [00:34:14] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:37:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:38:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:39:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1249460 [00:39:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1249460 (owner: 10TrainBranchBot) [00:39:10] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host contint1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:42:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:42:23] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host contint1003.wikimedia.org with OS trixie [00:43:02] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11690585 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host contin... [00:43:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:47:05] The db connection for Phabricator seems to be flapping quite a bit in the last 1-2 hours. T419494 [00:47:05] T419494: Phabricator database connection flapping - https://phabricator.wikimedia.org/T419494 [00:47:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:47:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:49:14] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:51:39] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1249460 (owner: 10TrainBranchBot) [00:58:55] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Lumen 100g transport) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:03:51] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11690643 (10VRiley-WMF) @Dzahn It seems like it's getting stuck. Does this need a specific raid setup? [01:03:55] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Lumen 100g transport) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:07:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:08:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1249475 [01:08:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1249475 (owner: 10TrainBranchBot) [01:12:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:17:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:18:39] (03CR) 10Ottomata: [C:03+1] "ty! Javier or I can deploy this tomorrow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249367 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [01:19:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:23:13] (03CR) 10Zabe: [C:03+2] "retry" [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1249460 (owner: 10TrainBranchBot) [01:26:23] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1249475 (owner: 10TrainBranchBot) [01:27:57] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1249460 (owner: 10TrainBranchBot) [01:29:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:30:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:31:07] vriley@cumin1003 reimage (PID 2422614) is awaiting input [01:34:02] (03CR) 10RLazarus: [C:03+1] mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249437 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [01:34:29] (03CR) 10RLazarus: [C:03+1] envoy: Decouple graceful drain from drain strategy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249428 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [01:36:46] 06SRE, 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, and 2 others: Deleted files can remain on swift due to race conditions - https://phabricator.wikimedia.org/T168002#11690676 (10BPirkle) [01:37:01] !log [WDQS] T410573 repooled wdqs1011.eqiad.wmnet - erroneously depooled since `2025-11-19` by failed `sre.wdqs.reboot` cookbook [01:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:05] T410573: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573 [01:40:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:42:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:47:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:49:33] FIRING: KubernetesCalicoDown: dse-k8s-worker1028.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1028.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:50:08] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for EMcFarland - https://phabricator.wikimedia.org/T419145#11690692 (10andrea.denisse) 05Open→03In progress a:03andrea.denisse [01:51:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:58:05] (03CR) 10Scott French: "Thanks, Blake!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1249322 (https://phabricator.wikimedia.org/T418133) (owner: 10Blake) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260310T0200) [02:00:51] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:01:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:03:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:04:51] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:04:51] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:05:51] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:05:51] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:07:29] (03PS1) 10Voidwalker: ircecho: force irc.client to strip message tags [puppet] - 10https://gerrit.wikimedia.org/r/1249506 (https://phabricator.wikimedia.org/T419190) [02:08:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.19 [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249511 (https://phabricator.wikimedia.org/T413810) [02:08:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.19 [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249511 (https://phabricator.wikimedia.org/T413810) (owner: 10TrainBranchBot) [02:08:51] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:08:51] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:08:55] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:01] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 08m 10s) [02:10:51] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:10:51] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:13:42] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/1.46.0-wmf.19 [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249511 (https://phabricator.wikimedia.org/T413810) (owner: 10TrainBranchBot) [02:13:51] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:13:51] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:14:15] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:15:51] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:15:51] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:17:15] (03PS1) 10Scott French: aptrepo: add pcre2 updates for component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1249522 (https://phabricator.wikimedia.org/T419058) [02:18:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:19:51] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:20:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:20:51] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:23:51] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:23:51] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:25:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:25:51] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:25:51] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:29:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:29:51] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2012.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:29:51] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:30:51] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:30:51] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:33:55] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:33:55] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:51] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:34:51] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:35:51] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:35:51] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:38:53] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:38:55] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:39:53] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:40:53] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:40:53] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:47:17] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978#11690710 (10Papaul) Looks like changing the module on the switch side fixed the issue. ` sw1-b3-magru> show interfaces et-0/0/50 descriptions Interface... [02:49:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:50:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260310T0300) [03:00:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:01:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:06:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:09:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:14:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:17:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:19:15] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:22:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:27:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:37:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:42:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:43:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:46:38] FIRING: GnmiTargetDown: cr2-eqdfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [03:50:41] FIRING: SystemdUnitFailed: bitu-permission-request.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:52:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:56:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260310T0400) [04:01:50] !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.16 (duration: 01m 48s) [04:08:56] !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device asw1-23-ulsfo [04:09:17] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device asw1-23-ulsfo [04:34:14] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:36:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:40:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:45:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:47:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:49:15] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:52:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:54:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:59:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:03:55] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Lumen 100g transport) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:13:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:18:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:48:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:49:33] FIRING: KubernetesCalicoDown: dse-k8s-worker1028.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1028.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:54:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:54:31] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978#11690822 (10cmooney) Showing as down right now both sides, lane 3 RX still poor on cr2-magru: ` cmooney@cr2-magru> show interfaces diagnostics optics et-0... [05:58:44] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11690825 (10wiki_willy) [05:58:50] (03PS3) 10Ryan Kemper: profile::pyrra: rework wdqs availability SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1248760 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey) [05:58:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260310T0600) [06:00:05] marostegui, Amir1, and federico3: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260310T0600). [06:00:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:03:55] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Lumen 100g transport) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:08:55] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Lumen 100g transport) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:09:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:11:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 19.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:14:15] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:14:44] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2009.codfw.wmnet with OS bookworm [06:15:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:16:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:17:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:19:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:24:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:27:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:30:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:50:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:58:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:00:05] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260310T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:45] (03CR) 10Ryan Kemper: profile::pyrra: rework wdqs availability SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1248760 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey) [07:02:58] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248760 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey) [07:03:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:07:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:17:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:19:15] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:19:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:27:10] (03CR) 10Ryan Kemper: [C:03+1] "ready for merge whenever" [puppet] - 10https://gerrit.wikimedia.org/r/1248760 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey) [07:27:37] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new, 07Essential-Work: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11690876 (10ecarg) Thanks a lot for these @elukey~ I’m not entirely sure what the MediaWiki-side... [07:34:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:37:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:38:37] (03CR) 10JavierMonton: [C:03+1] stream: mediawiki.page_edit_type_simple.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249367 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [07:38:57] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11690882 (10wiki_willy) I was reading the notification for DRMRS a bit more closely, and it looks like March 31 is the due date for... [07:41:30] !log filippo@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 19 hosts with reason: switch down tests [07:41:56] (03CR) 10Ryan Kemper: [C:03+1] "forgot to resolve a comment." [puppet] - 10https://gerrit.wikimedia.org/r/1248760 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey) [07:42:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:43:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:45:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:46:38] FIRING: GnmiTargetDown: cr2-eqdfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [07:49:30] !log prep cloudsw reboot tests 'ceph osd set noout' - T417393 [07:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:34] T417393: Carry out controlled network switch down tests in cloud - https://phabricator.wikimedia.org/T417393 [07:50:41] FIRING: SystemdUnitFailed: bitu-permission-request.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:55:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:05:04] (03PS4) 10KartikMistry: machinetranslation: Optimize model loading and memory footprints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248388 (https://phabricator.wikimedia.org/T411058) [08:05:14] !log start disabling cloudcephosd interfaces - T417393 [08:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:34] T417393: Carry out controlled network switch down tests in cloud - https://phabricator.wikimedia.org/T417393 [08:10:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:15:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:16:28] (03CR) 10DCausse: [C:03+1] semanticsearch: Increase heap by 1gb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249382 (https://phabricator.wikimedia.org/T414623) (owner: 10Ebernhardson) [08:16:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:18:28] !log disabled interfaces for cloudcephosd1016 cloudcephosd1017 cloudcephosd1016 cloudcephosd1018 cloudcephosd1017 cloudcephosd1035 - T417393 [08:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:32] T417393: Carry out controlled network switch down tests in cloud - https://phabricator.wikimedia.org/T417393 [08:21:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:22:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:22:56] !log disabled interfaces for cloudcephosd1021 cloudcephosd1042 cloudcephosd1043 cloudcephosd1018 cloudcephosd1022 - T417393 [08:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:41] RESOLVED: SystemdUnitFailed: bitu-permission-request.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:28:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:30:08] !log disabled interface for cloudcephmon1004 - T417393 [08:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:12] T417393: Carry out controlled network switch down tests in cloud - https://phabricator.wikimedia.org/T417393 [08:34:14] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:37:24] (03CR) 10Elukey: [C:03+2] profile::kafka::broker: replace Confluent 3.5 with 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1249344 (https://phabricator.wikimedia.org/T416670) (owner: 10Elukey) [15:39:48] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:42:38] (03PS1) 10Andrew Bogott: codfw1dev cas/idp: use internal/private addresses for ldap access [puppet] - 10https://gerrit.wikimedia.org/r/1249986 [15:42:49] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1249986 (owner: 10Andrew Bogott) [15:43:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10Recommendation-API, 13Patch-For-Review: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626#11692685 (10elukey) @Jclark-ctr provisioning is done for both hosts, I filed a patch to upgrade preseed and I'll try to install trixie after that :) [15:43:52] (03CR) 10Andrew Bogott: [C:03+1] cr-cloud-hosts: Allow return traffic from LDAP directory [homer/public] - 10https://gerrit.wikimedia.org/r/1249985 (https://phabricator.wikimedia.org/T419558) (owner: 10Majavah) [15:45:15] (03Merged) 10jenkins-bot: dse-k8s-eqiad: provision the airflow-fr-tech namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249218 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [15:45:18] (03Merged) 10jenkins-bot: dse-k8s-eqiad: add the airflow-fr-tech ns to the ceph tenant list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249219 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [15:46:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:46:47] (03PS2) 10Blake: switchdc: update set-readonly comment [cookbooks] - 10https://gerrit.wikimedia.org/r/1249322 (https://phabricator.wikimedia.org/T418133) [15:46:51] (03PS2) 10Andrew Bogott: codfw1dev cas/idp: use internal/private addresses for ldap access [puppet] - 10https://gerrit.wikimedia.org/r/1249986 (https://phabricator.wikimedia.org/T419558) [15:47:52] (03CR) 10Blake: switchdc: update set-readonly comment (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1249322 (https://phabricator.wikimedia.org/T418133) (owner: 10Blake) [15:47:56] (03CR) 10Andrew Bogott: [C:03+2] toolforge etcdctl: remove get_cluster_health and associated rigging [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248842 (owner: 10Andrew Bogott) [15:48:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:49:37] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: provision the airflow-fr-tech instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249222 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [15:51:49] jouncebot: nowandnext [15:51:49] For the next 0 hour(s) and 8 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260310T1500) [15:51:49] In 0 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260310T1600) [15:52:13] FYI, we're going to be deploying a pending change to mw-cron [15:53:13] (03CR) 10Ayounsi: [C:03+1] cr-cloud-hosts: Allow return traffic from LDAP directory [homer/public] - 10https://gerrit.wikimedia.org/r/1249985 (https://phabricator.wikimedia.org/T419558) (owner: 10Majavah) [15:53:25] (03CR) 10Majavah: [C:03+2] cr-cloud-hosts: Allow return traffic from LDAP directory [homer/public] - 10https://gerrit.wikimedia.org/r/1249985 (https://phabricator.wikimedia.org/T419558) (owner: 10Majavah) [15:54:13] (03Merged) 10jenkins-bot: toolforge etcdctl: remove get_cluster_health and associated rigging [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248842 (owner: 10Andrew Bogott) [15:54:47] (03Merged) 10jenkins-bot: cr-cloud-hosts: Allow return traffic from LDAP directory [homer/public] - 10https://gerrit.wikimedia.org/r/1249985 (https://phabricator.wikimedia.org/T419558) (owner: 10Majavah) [15:55:32] !log jynus@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [15:55:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-fr-tech: apply [15:55:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-fr-tech: apply [15:57:24] (03PS2) 10Elukey: installserver: update preseed config for ml-serve101[4,5] [puppet] - 10https://gerrit.wikimedia.org/r/1249984 (https://phabricator.wikimedia.org/T400626) [15:58:28] (03CR) 10Ottomata: "Let's merge https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-primary/-/merge_requests/36#e6b243bfcf4ce6bc9dcd53ac37fc2b99" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249959 (https://phabricator.wikimedia.org/T419258) (owner: 10JavierMonton) [15:58:32] (03CR) 10Majavah: [C:03+1] codfw1dev cas/idp: use internal/private addresses for ldap access [puppet] - 10https://gerrit.wikimedia.org/r/1249986 (https://phabricator.wikimedia.org/T419558) (owner: 10Andrew Bogott) [15:59:21] !log update cr firewall policy for codfw1dev ldap tree https://gerrit.wikimedia.org/r/1249985 [15:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:53] !log jynus@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [16:00:05] jhathaway and rzl: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260310T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:39] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev cas/idp: use internal/private addresses for ldap access [puppet] - 10https://gerrit.wikimedia.org/r/1249986 (https://phabricator.wikimedia.org/T419558) (owner: 10Andrew Bogott) [16:02:25] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11692959 (10Dzahn) As far as I can tell the size of disks has never mattered for the partman recipe. It just require... [16:08:55] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:15:50] (03PS1) 10Reedy: Revert "CommonSettings: Remove orphaned $wgWebAuthnNewCredsDisabled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249993 [16:15:59] (03PS2) 10Reedy: Revert "CommonSettings: Remove orphaned $wgWebAuthnNewCredsDisabled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249993 [16:16:03] (03CR) 10Reedy: [C:03+2] Revert "CommonSettings: Remove orphaned $wgWebAuthnNewCredsDisabled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249993 (owner: 10Reedy) [16:16:28] (03CR) 10Herron: [C:03+1] alertmanager/o11y: add route to handle alerts with severity=task [puppet] - 10https://gerrit.wikimedia.org/r/1249349 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [16:17:12] (03Merged) 10jenkins-bot: Revert "CommonSettings: Remove orphaned $wgWebAuthnNewCredsDisabled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249993 (owner: 10Reedy) [16:17:50] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1249993|Revert "CommonSettings: Remove orphaned $wgWebAuthnNewCredsDisabled"]] [16:19:38] !log reedy@deploy2002 reedy: Backport for [[gerrit:1249993|Revert "CommonSettings: Remove orphaned $wgWebAuthnNewCredsDisabled"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:19:41] (03CR) 10Klausman: installserver: update preseed config for ml-serve101[4,5] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1249984 (https://phabricator.wikimedia.org/T400626) (owner: 10Elukey) [16:20:28] (03CR) 10Klausman: installserver: update preseed config for ml-serve101[4,5] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1249984 (https://phabricator.wikimedia.org/T400626) (owner: 10Elukey) [16:21:38] !log reedy@deploy2002 reedy: Continuing with sync [16:21:43] (03PS1) 10Majavah: P:openldap_clouddev: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1249995 [16:22:26] (03PS1) 10Effie Mouzeli: Add Chart.yaml metadata for ServiceOps Bug: T412693 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249996 (https://phabricator.wikimedia.org/T412693) [16:22:31] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8247/co" [puppet] - 10https://gerrit.wikimedia.org/r/1249995 (owner: 10Majavah) [16:22:56] (03CR) 10Majavah: P:openldap_clouddev: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1249995 (owner: 10Majavah) [16:24:16] (03CR) 10Effie Mouzeli: "I will have a look if we need those tests and get back to you" [puppet] - 10https://gerrit.wikimedia.org/r/1248768 (owner: 10Muehlenhoff) [16:24:16] (03CR) 10Scott French: [V:03+2 C:03+2] envoy: Decouple graceful drain from drain strategy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249428 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [16:24:36] (03PS2) 10Effie Mouzeli: Add Chart.yaml metadata for ServiceOps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249996 (https://phabricator.wikimedia.org/T412693) [16:24:46] (03PS3) 10Effie Mouzeli: Add Chart.yaml metadata for ServiceOps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249996 (https://phabricator.wikimedia.org/T412693) [16:25:15] (03PS4) 10Effie Mouzeli: ipoid: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249249 (https://phabricator.wikimedia.org/T412693) [16:25:23] (03PS4) 10Effie Mouzeli: eventgate, eventstreams: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249241 (https://phabricator.wikimedia.org/T412693) [16:25:35] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1249993|Revert "CommonSettings: Remove orphaned $wgWebAuthnNewCredsDisabled"]] (duration: 07m 45s) [16:26:57] (03CR) 10Effie Mouzeli: [C:03+2] apertium: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249961 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [16:27:56] (03PS1) 10Andrew Bogott: Add .wikimediacloud.org cnames for codfw1dev ldap servers [dns] - 10https://gerrit.wikimedia.org/r/1249998 (https://phabricator.wikimedia.org/T419558) [16:28:26] (03PS2) 10Andrew Bogott: Add .wikimediacloud.org cnames for codfw1dev ldap servers [dns] - 10https://gerrit.wikimedia.org/r/1249998 (https://phabricator.wikimedia.org/T419558) [16:28:55] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:29:06] (03Merged) 10jenkins-bot: apertium: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249961 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [16:29:35] (03PS3) 10Andrew Bogott: Add .wikimediacloud.org cnames for codfw1dev ldap servers [dns] - 10https://gerrit.wikimedia.org/r/1249998 (https://phabricator.wikimedia.org/T419558) [16:30:10] (03PS2) 10Clément Goubert: trafficserver: Support fractional routing for api.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1245389 (https://phabricator.wikimedia.org/T418145) [16:30:14] (03PS9) 10Lerickson: Add lerickson and trueg to analytics-wikidata-users [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) [16:30:49] (03CR) 10Elukey: installserver: update preseed config for ml-serve101[4,5] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1249984 (https://phabricator.wikimedia.org/T400626) (owner: 10Elukey) [16:33:04] (03PS1) 10Majavah: hieradata: acme_chief: Add new codfw1dev LDAP server names [puppet] - 10https://gerrit.wikimedia.org/r/1249999 (https://phabricator.wikimedia.org/T419558) [16:33:07] (03PS1) 10Majavah: hieradata: idp-clouddev: Use new LDAP service names [puppet] - 10https://gerrit.wikimedia.org/r/1250000 (https://phabricator.wikimedia.org/T419558) [16:33:42] (03PS4) 10Andrew Bogott: Add .wikimediacloud.org cnames for codfw1dev ldap servers [dns] - 10https://gerrit.wikimedia.org/r/1249998 (https://phabricator.wikimedia.org/T419558) [16:33:55] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:14] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:35:50] (03CR) 10Gmodena: [C:03+1] "LGTM. Both Lindsay and Sebastian need access." [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [16:36:54] (03PS1) 10Brouberol: airflow: delete failed task pods from kubernetes to avoid overloading the control plane [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250001 [16:36:58] (03CR) 10Majavah: [C:03+1] Add .wikimediacloud.org cnames for codfw1dev ldap servers [dns] - 10https://gerrit.wikimedia.org/r/1249998 (https://phabricator.wikimedia.org/T419558) (owner: 10Andrew Bogott) [16:37:43] (03CR) 10Bking: [C:03+1] airflow: delete failed task pods from kubernetes to avoid overloading the control plane [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250001 (owner: 10Brouberol) [16:38:12] (03CR) 10Andrew Bogott: [C:03+2] Add .wikimediacloud.org cnames for codfw1dev ldap servers [dns] - 10https://gerrit.wikimedia.org/r/1249998 (https://phabricator.wikimedia.org/T419558) (owner: 10Andrew Bogott) [16:38:52] !log andrew@dns1004 START - running authdns-update [16:39:05] (03CR) 10BPirkle: [C:04-2] "Thanks for the change. This was fixed via https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1235070. Can we abandon your change (so that i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1234927 (https://phabricator.wikimedia.org/T415877) (owner: 10Jgiannelos) [16:39:22] (03CR) 10Andrew Bogott: [C:03+1] hieradata: acme_chief: Add new codfw1dev LDAP server names [puppet] - 10https://gerrit.wikimedia.org/r/1249999 (https://phabricator.wikimedia.org/T419558) (owner: 10Majavah) [16:39:38] (03CR) 10Andrew Bogott: [C:03+1] hieradata: idp-clouddev: Use new LDAP service names [puppet] - 10https://gerrit.wikimedia.org/r/1250000 (https://phabricator.wikimedia.org/T419558) (owner: 10Majavah) [16:39:40] (03CR) 10Majavah: [C:03+2] hieradata: acme_chief: Add new codfw1dev LDAP server names [puppet] - 10https://gerrit.wikimedia.org/r/1249999 (https://phabricator.wikimedia.org/T419558) (owner: 10Majavah) [16:40:00] (03PS2) 10Brouberol: airflow: delete failed task pods from kubernetes to avoid overloading the control plane [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250001 (https://phabricator.wikimedia.org/T419457) [16:40:15] !log andrew@dns1004 END - running authdns-update [16:41:17] (03CR) 10Andrew Bogott: [C:03+2] hieradata: idp-clouddev: Use new LDAP service names [puppet] - 10https://gerrit.wikimedia.org/r/1250000 (https://phabricator.wikimedia.org/T419558) (owner: 10Majavah) [16:42:58] (03PS1) 10Majavah: hieradata: Unset LDAP config from cloudweb2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1250004 [16:43:08] (03PS2) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249959 (https://phabricator.wikimedia.org/T419258) [16:43:55] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:44:23] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8248/co" [puppet] - 10https://gerrit.wikimedia.org/r/1250004 (owner: 10Majavah) [16:44:49] (03Abandoned) 10Majavah: hieradata: Unset LDAP config from cloudweb2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1250004 (owner: 10Majavah) [16:44:49] (03PS3) 10JavierMonton: stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249217 (https://phabricator.wikimedia.org/T419258) [16:45:05] (03CR) 10Joal: [C:03+1] airflow: delete failed task pods from kubernetes to avoid overloading the control plane [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250001 (https://phabricator.wikimedia.org/T419457) (owner: 10Brouberol) [16:45:40] (03PS3) 10Ayounsi: Add more depool strategies for rack depool cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1249958 (https://phabricator.wikimedia.org/T327300) [16:47:33] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978#11693404 (10RobH) > Support, > > You have swapped the optic on the router side, and the MPO patch cable. The link is still down, so we'd like you to swa... [16:48:49] (03PS3) 10Brouberol: airflow: delete failed task pods from kubernetes to avoid overloading the control plane [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250001 (https://phabricator.wikimedia.org/T419457) [16:53:06] (03CR) 10Brouberol: [C:03+2] airflow: delete failed task pods from kubernetes to avoid overloading the control plane [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250001 (https://phabricator.wikimedia.org/T419457) (owner: 10Brouberol) [16:54:48] (03PS1) 10C. Scott Ananian: Enables legacy processing in ParserOutputPostCacheTransform when cached [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250007 (https://phabricator.wikimedia.org/T372592) [16:57:57] (03CR) 10Vgutierrez: trafficserver: Support fractional routing for api.w.o (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1245389 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [16:58:08] (03CR) 10Hashar: "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/1250000 (https://phabricator.wikimedia.org/T419558) (owner: 10Majavah) [16:59:17] 06SRE, 10SRE-swift-storage: ms swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419577 (10MatthewVernon) 03NEW [16:59:35] (03CR) 10CI reject: [V:04-1] Enables legacy processing in ParserOutputPostCacheTransform when cached [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250007 (https://phabricator.wikimedia.org/T372592) (owner: 10C. Scott Ananian) [16:59:44] 06SRE, 10SRE-swift-storage: ms swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419577#11693500 (10MatthewVernon) [17:00:04] swfrench-wmf: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260310T1700). [17:00:19] o/ I'll get started in a few minutes [17:01:28] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host contint1003.wikimedia.org with OS trixie [17:01:35] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11693523 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host contin... [17:05:21] (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249959 (https://phabricator.wikimedia.org/T419258) (owner: 10JavierMonton) [17:05:44] (03CR) 10Ottomata: [C:03+1] stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249217 (https://phabricator.wikimedia.org/T419258) (owner: 10JavierMonton) [17:08:42] getting started with the infra window now [17:09:14] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [17:09:24] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [17:11:10] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [17:11:26] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [17:12:43] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [17:12:51] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [17:17:01] (03PS4) 10Effie Mouzeli: Add Chart.yaml metadata for ServiceOps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249996 (https://phabricator.wikimedia.org/T412693) [17:17:52] (03PS1) 10Herron: Revert "mwlog: copy archives to trixie hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1250011 [17:18:29] (03CR) 10Scott French: [C:03+2] mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249437 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:20:09] (03CR) 10Herron: [C:03+2] Revert "mwlog: copy archives to trixie hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1250011 (owner: 10Herron) [17:20:32] (03Merged) 10jenkins-bot: mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249437 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:21:56] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:22:26] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:22:33] (03PS1) 10C. Scott Ananian: Parser: Raise minimum TTL from 30 min to 'next midnight' in miser mode [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1250013 (https://phabricator.wikimedia.org/T416616) [17:22:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1250013 (https://phabricator.wikimedia.org/T416616) (owner: 10C. Scott Ananian) [17:23:17] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1250014 [17:23:30] brennen: I guess T419497 is only a train blocker in so far as we can't merge any patches into OATHAuth [17:23:31] T419497: Insecure web-auth/webauthn-lib version blocks CI - https://phabricator.wikimedia.org/T419497 [17:23:39] vriley@cumin1003 reimage (PID 2719772) is awaiting input [17:23:56] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host contint1003.wikimedia.org with OS trixie [17:23:58] And as it's the same on .18 and .19... it doesn't really block the train [17:24:18] Also the train bot didn't make any release notes — did it run at all? [17:24:23] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11693656 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host contint100... [17:24:49] It blocks merging the branching patch [17:24:54] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1249511 [17:24:54] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host contint1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:25:05] oh [17:25:15] I presumed .19 was already branched [17:25:16] Ah, and ReleaseNotesBot is downstream of that merger. [17:25:22] Branched but not landed. [17:25:23] timing is great [17:25:24] Technically. [17:25:37] So yes, please fix Right Now™. ;-) [17:25:55] (03PS2) 10C. Scott Ananian: Enables legacy processing in ParserOutputPostCacheTransform when cached [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250007 (https://phabricator.wikimedia.org/T372592) [17:25:56] (03PS3) 10Elukey: installserver: update preseed config for ml-serve101[4,5] [puppet] - 10https://gerrit.wikimedia.org/r/1249984 (https://phabricator.wikimedia.org/T400626) [17:26:01] (03CR) 10Elukey: installserver: update preseed config for ml-serve101[4,5] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1249984 (https://phabricator.wikimedia.org/T400626) (owner: 10Elukey) [17:26:04] (03CR) 10Reedy: [C:03+2] Upgrading web-auth/webauthn-lib (5.2.3 => 5.2.4) [vendor] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249922 (https://phabricator.wikimedia.org/T419497) (owner: 10Mszwarc) [17:26:13] If we land the wmf.19 patches will that work for the branch commits to land? [17:26:45] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host contint1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:26:50] (03CR) 10Reedy: [C:03+2] "Not really necessary, but won't hurt either" [extensions/OATHAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249923 (https://phabricator.wikimedia.org/T419497) (owner: 10Mszwarc) [17:26:57] (03Abandoned) 10C. Scott Ananian: Parser: Raise minimum TTL from 30 min to 'next midnight' in miser mode [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1250013 (https://phabricator.wikimedia.org/T416616) (owner: 10C. Scott Ananian) [17:27:30] (03CR) 10Reedy: [C:03+2] Upgrading web-auth/webauthn-lib (5.2.3 => 5.2.4) [vendor] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249934 (https://phabricator.wikimedia.org/T419497) (owner: 10Zabe) [17:27:48] (03Merged) 10jenkins-bot: Upgrading web-auth/webauthn-lib (5.2.3 => 5.2.4) [vendor] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249922 (https://phabricator.wikimedia.org/T419497) (owner: 10Mszwarc) [17:28:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250007 (https://phabricator.wikimedia.org/T372592) (owner: 10C. Scott Ananian) [17:28:16] (03Merged) 10jenkins-bot: Bump required web-auth/webauthn-lib to at least 5.2.4 [extensions/OATHAuth] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249923 (https://phabricator.wikimedia.org/T419497) (owner: 10Mszwarc) [17:28:28] (03CR) 10AOkoth: [C:03+1] mailman: move mailman-web behind CDN [dns] - 10https://gerrit.wikimedia.org/r/1249310 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [17:28:41] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [17:28:47] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host contint1003.wikimedia.org with OS trixie [17:28:50] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [17:28:54] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11693692 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host contin... [17:29:39] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [17:29:43] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [17:29:49] James_F: I guess https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1249511/1/vendor is pointing to the wrong (old) commit now... [17:30:08] Get the branch bot to re-make it? [17:30:10] Reedy: Yeah, might need to manually bump. [17:30:23] Or re-trigger the wmf.19 branch process entirely, but that feels risky. [17:30:36] People might have merged code in the last 12 hours assuming all was well for a week's testing. [17:30:49] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [17:30:58] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7015.magru.wmnet with OS trixie [17:30:59] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind CDN - https://phabricator.wikimedia.org/T286066#11693716 (10taavi) Note that outbound mail from the lists server currently uses `lists.wikimedia.org` in the `HELO`/`EHLO` command, so... [17:31:06] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [17:32:19] (03PS1) 10C. Scott Ananian: Parser: Raise minimum TTL from 30 min to 'next midnight' in miser mode [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250015 (https://phabricator.wikimedia.org/T416616) [17:32:27] (03PS2) 10Reedy: Branch commit for wmf/1.46.0-wmf.19 [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249511 (https://phabricator.wikimedia.org/T413810) (owner: 10TrainBranchBot) [17:32:33] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [17:32:41] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [17:33:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250015 (https://phabricator.wikimedia.org/T416616) (owner: 10C. Scott Ananian) [17:34:09] James_F, Reedy: was just trying to think through this after realizing it failed [17:34:21] brennen: Yeah, it's an unusual situation. [17:34:23] (which seems like a problem unto itself, usually there's a notification) [17:34:39] I think (a) land the back-ports, (b) land the branch, (c) manually rev the /vendor submodule. [17:34:57] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1249511/1..2 [17:35:00] wait and see what CI says [17:35:12] Or that. [17:35:33] Getting ReleaseNotesBot to run should be doable. [17:37:48] (03PS3) 10Clément Goubert: trafficserver: Support fractional routing for api.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1245389 (https://phabricator.wikimedia.org/T418145) [17:42:12] (03CR) 10CI reject: [V:04-1] Upgrading web-auth/webauthn-lib (5.2.3 => 5.2.4) [vendor] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249934 (https://phabricator.wikimedia.org/T419497) (owner: 10Zabe) [17:42:45] (03CR) 10Reedy: Upgrading web-auth/webauthn-lib (5.2.3 => 5.2.4) [vendor] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249934 (https://phabricator.wikimedia.org/T419497) (owner: 10Zabe) [17:42:48] (03CR) 10Reedy: [C:03+2] Upgrading web-auth/webauthn-lib (5.2.3 => 5.2.4) [vendor] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249934 (https://phabricator.wikimedia.org/T419497) (owner: 10Zabe) [17:43:15] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:43:40] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:45:18] (03Merged) 10jenkins-bot: Upgrading web-auth/webauthn-lib (5.2.3 => 5.2.4) [vendor] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249934 (https://phabricator.wikimedia.org/T419497) (owner: 10Zabe) [17:46:50] (03CR) 10Vgutierrez: trafficserver: Support fractional routing for api.w.o (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1245389 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [17:49:05] (03CR) 10Reedy: [C:03+2] Branch commit for wmf/1.46.0-wmf.19 [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249511 (https://phabricator.wikimedia.org/T413810) (owner: 10TrainBranchBot) [17:51:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238733 (https://phabricator.wikimedia.org/T397402) (owner: 10Jforrester) [17:51:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238734 (https://phabricator.wikimedia.org/T397403) (owner: 10Jforrester) [17:52:23] (03CR) 10Jforrester: [C:04-2] "Let's not have this auto-land unattended." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [17:52:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249393 (https://phabricator.wikimedia.org/T419476) (owner: 10Jforrester) [17:52:55] (03PS2) 10Jforrester: build: Upgrade mediawiki-codesniffer from 49.0.0 to 50.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249394 [17:53:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249394 (owner: 10Jforrester) [17:53:10] (03PS2) 10Jforrester: build: Upgrade symfony/yaml from 7.4.0 to 7.4.6 and alpha-sort [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249395 [17:53:11] vriley@cumin1003 reimage (PID 2723006) is awaiting input [17:53:12] (03PS1) 10Scott French: Revert "mw-debug: Pilot new drain configuration" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250019 (https://phabricator.wikimedia.org/T364245) [17:53:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249395 (owner: 10Jforrester) [17:54:12] (03CR) 10Clément Goubert: trafficserver: Support fractional routing for api.w.o (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1245389 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [17:54:15] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.19 [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249511 (https://phabricator.wikimedia.org/T413810) (owner: 10TrainBranchBot) [17:54:23] !log hashar@deploy2002 Started deploy [integration/docroot@f544f49]: Catch up with composer/npm dev dependencies. Noop for production [17:54:35] !log hashar@deploy2002 Finished deploy [integration/docroot@f544f49]: Catch up with composer/npm dev dependencies. Noop for production (duration: 00m 11s) [17:54:39] brennen: ^ Does something need manually poking to make scap deploy it etc? [17:55:49] (03CR) 10Scott French: [C:03+2] Revert "mw-debug: Pilot new drain configuration" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250019 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:55:55] Reedy: at this point i _think_ i could just move it to testwikis, but thinking about that for a second. [17:56:05] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7015.magru.wmnet with reason: host reimage [17:56:23] It will still need staging etc... Presumably manually? Unless the script(s) have a continue? [17:56:40] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for EMcFarland - https://phabricator.wikimedia.org/T419145#11693912 (10andrea.denisse) [17:56:53] (03CR) 10BCornwall: [C:03+1] prometheus/ulsfo: update svc record [dns] - 10https://gerrit.wikimedia.org/r/1249915 (https://phabricator.wikimedia.org/T419430) (owner: 10Tiziano Fogli) [17:57:23] (03CR) 10BCornwall: [C:03+1] wmnet: add wikikube-ctrl2006 to etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1249423 (https://phabricator.wikimedia.org/T406596) (owner: 10Jasmine) [17:58:02] (03Merged) 10jenkins-bot: Revert "mw-debug: Pilot new drain configuration" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250019 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:59:00] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:59:19] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:59:32] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7015.magru.wmnet with reason: host reimage [17:59:39] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:00:00] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:00:05] brennen and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260310T1800). [18:01:16] Reedy: hmm, i've updated /srv/mediawiki-staging manually. i think i can go to testwikis from here. i'm not sure about other odds and ends (release notes?). [18:01:46] oh, it's staged on there... [18:01:57] the merged patches do need deploying I guess [18:02:40] Yes. [18:02:46] brennen: Security patches aren't there either [18:03:50] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for EMcFarland - https://phabricator.wikimedia.org/T419145#11694028 (10andrea.denisse) [18:04:51] 18:04:41 Checking whether requested changes are valid for backport... [18:04:51] Change '1249923', project: 'mediawiki/extensions/OATHAuth', branch: 'wmf/1.46.0-wmf.19' is not deployable to production [18:04:52] heh [18:06:08] Reedy: I'm around. What are we trying to do? [18:06:11] i'm wondering if we just need to start over from scratch here. [18:06:45] removing php-1.46.0-wmf.19 and getting scap to stage it again would probably be easier [18:06:53] Yeah. :-( [18:07:22] which is what... scap prep? [18:07:48] scap prep 1.46.0-wmf.19 [18:10:13] looks like that handled it quite nicely :) [18:11:29] brennen: Ignoring the release notes... It's probably safe to move the testwikis over... [18:11:37] and looks like patches are present... yeah, ok, i'll do that. [18:12:08] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250025 (https://phabricator.wikimedia.org/T413810) [18:12:10] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250025 (https://phabricator.wikimedia.org/T413810) (owner: 10TrainBranchBot) [18:12:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240012 (https://phabricator.wikimedia.org/T413912) (owner: 10Bernard Wang) [18:13:08] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250025 (https://phabricator.wikimedia.org/T413810) (owner: 10TrainBranchBot) [18:13:24] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp7003.magru.wmnet [reason: trixie reimaging] [18:13:31] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp7003.magru.wmnet [reason: trixie reimaging] [18:13:39] !log brennen@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.19 refs T413810 [18:13:42] T413810: 1.46.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T413810 [18:14:15] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:14:33] after which i think i can backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1249937 and roll to group0. [18:16:03] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp7003.magru.wmnet with OS trixie [18:16:59] (03PS2) 10Bking: WIP: Add new active-active discovery records for dse-k8s [dns] - 10https://gerrit.wikimedia.org/r/1248625 (https://phabricator.wikimedia.org/T417698) [18:17:05] dancy: Is there a command to get scap to specifically post the deploy notes to mediawiki.org too? [18:17:50] It's a separate jenkins-releases job, isn't it? [18:17:53] https://www.mediawiki.org/w/index.php?title=MediaWiki_1.46/wmf.19/Changelog&action=history [18:17:57] (03CR) 10CI reject: [V:04-1] WIP: Add new active-active discovery records for dse-k8s [dns] - 10https://gerrit.wikimedia.org/r/1248625 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [18:18:00] looks like it's been done anyway [18:18:01] so nvm [18:18:05] Aha. [18:18:11] oh good because I didn't know the answer [18:18:14] :D [18:18:15] <3 [18:18:27] I'm guessing the merging of the branch commit did something magic somewhere in the background? [18:18:41] ah yes [18:18:41] Post-merge build succeeded. [18:18:41] https://integration.wikimedia.org/ci/job/train-deploy-notes/4489/console : SUCCESS in 31s (non-voting) [18:18:45] excellent [18:18:45] ah ha. [18:18:49] Magic! [18:18:50] back in my day... [18:19:02] thank you all for the assistance during a particularly mush-brained tuesday. [18:19:55] It's of course missing the new patches. [18:20:10] Oh, no, it has them. Magic. [18:21:35] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11694127 (10RobH) The order is placed and I'm currently scheduling the Unisys/Dell engineer to go onsite sometime between Friday-Wednesday of this/next week. Host is hard down, so no traffic interven... [18:21:52] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp7004.magru.wmnet [reason: trixie reimaging] [18:23:03] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp7004.magru.wmnet with OS trixie [18:24:37] (03PS3) 10Bking: WIP: Add new active-active discovery records for dse-k8s [dns] - 10https://gerrit.wikimedia.org/r/1248625 (https://phabricator.wikimedia.org/T417698) [18:25:36] (03CR) 10CI reject: [V:04-1] WIP: Add new active-active discovery records for dse-k8s [dns] - 10https://gerrit.wikimedia.org/r/1248625 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [18:26:52] (03CR) 10Klausman: [C:03+1] installserver: update preseed config for ml-serve101[4,5] [puppet] - 10https://gerrit.wikimedia.org/r/1249984 (https://phabricator.wikimedia.org/T400626) (owner: 10Elukey) [18:27:46] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7015.magru.wmnet with OS trixie [18:31:14] (03CR) 10Jforrester: "This should now be good to deploy to production." [extensions/Translate] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249937 (https://phabricator.wikimedia.org/T419294) (owner: 10Abijeet Patro) [18:33:17] (03CR) 10JHathaway: sre.hosts.provision: allow no-pxe settings for NIC on Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1249973 (https://phabricator.wikimedia.org/T400626) (owner: 10Elukey) [18:38:23] (03CR) 10Brennen Bearnes: [C:03+2] Re-add correct namespace for translatable pages [extensions/Translate] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249937 (https://phabricator.wikimedia.org/T419294) (owner: 10Abijeet Patro) [18:40:30] (03CR) 10Brennen Bearnes: [C:03+2] "Getting this started through CI while the train finishes up getting to testwikis." [extensions/Translate] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249937 (https://phabricator.wikimedia.org/T419294) (owner: 10Abijeet Patro) [18:41:54] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7015.* [18:44:30] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7003.magru.wmnet with reason: host reimage [18:44:32] (03CR) 10Jforrester: [C:03+1] "This looks reasonable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242542 (https://phabricator.wikimedia.org/T414048) (owner: 10Pppery) [18:47:43] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7003.magru.wmnet with reason: host reimage [18:49:13] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7004.magru.wmnet with reason: host reimage [18:50:41] (03Merged) 10jenkins-bot: Re-add correct namespace for translatable pages [extensions/Translate] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1249937 (https://phabricator.wikimedia.org/T419294) (owner: 10Abijeet Patro) [18:52:13] !log brennen@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.19 refs T413810 (duration: 38m 34s) [18:52:16] T413810: 1.46.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T413810 [18:52:59] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7004.magru.wmnet with reason: host reimage [18:54:59] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1249937|Re-add correct namespace for translatable pages (T419294)]] [18:55:31] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7013.magru.wmnet with OS trixie [18:58:58] !log brennen@deploy2002 abi, brennen: Backport for [[gerrit:1249937|Re-add correct namespace for translatable pages (T419294)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:59:48] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:01:22] !log brennen@deploy2002 abi, brennen: Continuing with sync [19:01:41] (03PS1) 10Jforrester: OrchestratorRequest: Switch evaluations to v2 endpoint [extensions/WikiLambda] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250051 (https://phabricator.wikimedia.org/T413727) [19:01:46] (03CR) 10Jforrester: [C:04-2] OrchestratorRequest: Switch evaluations to v2 endpoint [extensions/WikiLambda] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250051 (https://phabricator.wikimedia.org/T413727) (owner: 10Jforrester) [19:05:18] (03CR) 10Aude: [C:03+1] Enable personal main menu to all users in minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240012 (https://phabricator.wikimedia.org/T413912) (owner: 10Bernard Wang) [19:06:38] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host contint1003.wikimedia.org with OS trixie [19:06:47] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11694295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host contint100... [19:07:29] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1249937|Re-add correct namespace for translatable pages (T419294)]] (duration: 12m 30s) [19:09:05] !log 1.46.0-wmf.19 train status: blockers believed resolved, rolling to group0 [19:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:31] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250052 (https://phabricator.wikimedia.org/T413810) [19:09:34] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250052 (https://phabricator.wikimedia.org/T413810) (owner: 10TrainBranchBot) [19:10:21] (03PS4) 10Bking: WIP: Add new active-active discovery records for dse-k8s [dns] - 10https://gerrit.wikimedia.org/r/1248625 (https://phabricator.wikimedia.org/T417698) [19:10:28] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250052 (https://phabricator.wikimedia.org/T413810) (owner: 10TrainBranchBot) [19:14:10] (03PS2) 10Bking: WIP: Add new active-active discovery service for dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1248605 (https://phabricator.wikimedia.org/T417698) [19:16:04] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7003.magru.wmnet with OS trixie [19:16:17] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.19 refs T413810 [19:16:21] T413810: 1.46.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T413810 [19:17:53] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp7003.magru.wmnet [reason: trixie reimaging] [19:18:13] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7013.magru.wmnet with reason: host reimage [19:19:10] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp7005.magru.wmnet [reason: trixie reimaging] [19:19:42] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7004.magru.wmnet with OS trixie [19:19:44] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp7005.magru.wmnet with OS trixie [19:24:22] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7013.magru.wmnet with reason: host reimage [19:24:32] (03PS3) 10VolkerE: Enable personal main menu to all users in Minerva Neue skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240012 (https://phabricator.wikimedia.org/T413912) (owner: 10Bernard Wang) [19:24:37] (03CR) 10VolkerE: [C:03+1] Enable personal main menu to all users in Minerva Neue skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240012 (https://phabricator.wikimedia.org/T413912) (owner: 10Bernard Wang) [19:25:18] (03CR) 10VolkerE: [C:03+1] "https://www.mediawiki.org/wiki/Skin:Minerva_Neue also needs update." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240012 (https://phabricator.wikimedia.org/T413912) (owner: 10Bernard Wang) [19:25:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11694392 (10VRiley-WMF) [19:29:49] (03PS5) 10Bking: Add new active-active discovery records for dse-k8s [dns] - 10https://gerrit.wikimedia.org/r/1248625 (https://phabricator.wikimedia.org/T417698) [19:35:24] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978#11694408 (10RobH) They swapped the optic GT3AAG00314 out of the switch for optic GT3AAG00316 and now the link shows up: router: ` et-0/0/1 up... [19:39:42] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp7004.magru.wmnet [reason: trixie reimaging] [19:40:02] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp7006.magru.wmnet [reason: trixie reimaging] [19:40:26] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp7006.magru.wmnet with OS trixie [19:42:11] (03PS3) 10Bking: Add new active-active discovery service for dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1248605 (https://phabricator.wikimedia.org/T417698) [19:42:59] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7005.magru.wmnet with reason: host reimage [19:49:04] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7005.magru.wmnet with reason: host reimage [19:49:14] (03PS1) 10Bking: Add new active/active discovery records for dse-k8s opensearch test ns [dns] - 10https://gerrit.wikimedia.org/r/1250063 (https://phabricator.wikimedia.org/T417698) [19:49:38] (03CR) 10CI reject: [V:04-1] Add new active/active discovery records for dse-k8s opensearch test ns [dns] - 10https://gerrit.wikimedia.org/r/1250063 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [19:50:09] (03PS6) 10Bking: Add new active-active discovery records for dse-k8s [dns] - 10https://gerrit.wikimedia.org/r/1248625 (https://phabricator.wikimedia.org/T417698) [19:50:24] (03PS1) 10Andrea Denisse: admin: Grant access to analytics-privatedata-users to emc-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1250022 (https://phabricator.wikimedia.org/T419145) [19:50:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7013.magru.wmnet with OS trixie [19:51:45] (03CR) 10Andrea Denisse: [C:03+2] admin: Grant access to analytics-privatedata-users to emc-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1250022 (https://phabricator.wikimedia.org/T419145) (owner: 10Andrea Denisse) [19:59:19] I can do the deploys today. [19:59:36] im here for the window [19:59:56] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for EMcFarland - https://phabricator.wikimedia.org/T419145#11694450 (10andrea.denisse) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260310T2000). [20:00:05] danisztls, cscott, James_F, and bwang: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] Cool. danisztls isn't here yet. [20:00:20] (03PS1) 10Mszwarc: Send2FAWarningNotifications: Support reading users from file [extensions/WikimediaMessages] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250066 (https://phabricator.wikimedia.org/T419111) [20:00:41] bwang: I'll do yours alongside cscott's, if that's OK? [20:00:50] o/ [20:00:53] Works for me [20:01:26] I can also spiderpig it myself if you want to be lazy... but I'm also happy to be the lazy one [20:01:34] (03PS1) 10Andrea Denisse: admin: Add krb: present for emc-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1250065 (https://phabricator.wikimedia.org/T419145) [20:01:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240012 (https://phabricator.wikimedia.org/T413912) (owner: 10Bernard Wang) [20:01:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250007 (https://phabricator.wikimedia.org/T372592) (owner: 10C. Scott Ananian) [20:01:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250015 (https://phabricator.wikimedia.org/T416616) (owner: 10C. Scott Ananian) [20:01:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250066 (https://phabricator.wikimedia.org/T419111) (owner: 10Mszwarc) [20:01:55] Yep works w me [20:01:55] Nah, I've got way too many little ones so I'll just drive. [20:02:20] * cscott sits back in the passenger seat [20:02:32] o/ [20:03:02] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7006.magru.wmnet with reason: host reimage [20:03:07] Hey danisztls, I'm doing a group now but then will do yours. [20:03:11] Beautiful weather here today in the Boston area. 70 of the small temperature units, and some pleasant number of the large ones for those who observe [20:03:20] (03Merged) 10jenkins-bot: Enable personal main menu to all users in Minerva Neue skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240012 (https://phabricator.wikimedia.org/T413912) (owner: 10Bernard Wang) [20:03:37] (21ºC for those as can measure properly.) [20:03:48] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7013.* [20:03:55] James_F: ok, thanks! [20:04:03] Still snow on the ground, though. Water stores a lot of energy. [20:04:37] (my job as passenger is to make small talk, right?) [20:04:52] Boston's temperature scale runs from "hot out, drinking iced coffee" all the way down to "very cold out, still drinking iced coffee" [20:04:59] which doesn't convert super well [20:05:19] "hot out, drinking iced coffee while navigating around piles of leftover snow" [20:06:29] We get all the seasons here. They're all sponsored by Dunkin Donuts, though. [20:06:30] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978#11694501 (10RobH) Sent an email to investigate the return/repair of GT3AAG00314 & GT3AAG00321 [20:06:57] I believe that they're just "Dunkin'" nowadays. [20:07:39] We also give all our directions in terms of landmarks and names that haven't been current for a couple of decades [20:07:46] four seasons also known as "The Red Line Is On Fire," "The Green Line Is On Fire," "The Red Line Is On Fire," and "The Red Line Is On Fire And The AC Doesn't Work" [20:07:58] "turn right where the Dunkin donuts used to be" [20:08:06] man, I miss living there [20:08:25] * James_F gives CI a Hard Stare™ to see if it moves faster. [20:08:35] No you'll jinx it man [20:08:45] The secret is not to let it know you're watching [20:09:10] * perryprog stares respectfully at the CI from a distance [20:09:13] I made the mistake of joking about selenium yesterday and one of my patches took half a dozen attempts to get through [20:09:40] The CI wave function collapse is fun to witness. [20:10:01] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7006.magru.wmnet with reason: host reimage [20:10:02] Not selenium's fault, though. The castor-presave-bullshit is in the same union apparently and struck in solidarity [20:10:04] Sadly, I think I may have to hit CI. mediawiki-node20 for your first patch is stuck. [20:10:22] (03CR) 10CI reject: [V:04-1] Enables legacy processing in ParserOutputPostCacheTransform when cached [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250007 (https://phabricator.wikimedia.org/T372592) (owner: 10C. Scott Ananian) [20:10:54] (03PS1) 10Bking: Add new active/active discovery records for dse-k8s opensearch prod ns [dns] - 10https://gerrit.wikimedia.org/r/1250068 (https://phabricator.wikimedia.org/T417698) [20:11:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250007 (https://phabricator.wikimedia.org/T372592) (owner: 10C. Scott Ananian) [20:11:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250015 (https://phabricator.wikimedia.org/T416616) (owner: 10C. Scott Ananian) [20:11:02] So it goes [20:13:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#11694520 (10andrea.denisse) >>! In T390734#11687686, @Ben.buchenau wrote: > Hello guys - follow-up request regarding Kerebos authentication: Ca... [20:15:16] Sorry for the slowness, all. [20:16:01] (03PS4) 10Bking: Add new active/active discovery records for dse-k8s opensearch test ns [dns] - 10https://gerrit.wikimedia.org/r/1250063 (https://phabricator.wikimedia.org/T417698) [20:16:23] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7005.magru.wmnet with OS trixie [20:19:15] 10ops-magru, 06SRE, 06DC-Ops: RMA (2) QSFP-100GBASE-SR4 - https://phabricator.wikimedia.org/T419598 (10RobH) 03NEW [20:23:36] (03Merged) 10jenkins-bot: Parser: Raise minimum TTL from 30 min to 'next midnight' in miser mode [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250015 (https://phabricator.wikimedia.org/T416616) (owner: 10C. Scott Ananian) [20:23:43] (03Merged) 10jenkins-bot: Enables legacy processing in ParserOutputPostCacheTransform when cached [extensions/DiscussionTools] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1250007 (https://phabricator.wikimedia.org/T372592) (owner: 10C. Scott Ananian) [20:23:46] (03PS1) 10Xcollazo: Disable rsync access for two dead dump mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1250070 (https://phabricator.wikimedia.org/T415193) [20:24:16] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp7005.magru.wmnet [reason: trixie reimaging] [20:24:33] Finally. [20:24:46] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [reason: trixie reimaging] [20:25:40] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp7007.magru.wmnet with OS trixie [20:25:44] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1240012|Enable personal main menu to all users in Minerva Neue skin (T413912)]], [[gerrit:1250007|Enables legacy processing in ParserOutputPostCacheTransform when cached (T372592)]], [[gerrit:1250015|Parser: Raise minimum TTL from 30 min to 'next midnight' in miser mode (T416616 T416540 T419439)]] [20:25:56] T413912: Deployment: Promote advanced user menu to all users - https://phabricator.wikimedia.org/T413912 [20:25:56] T372592: Find a way to replace the use of ParserOutput::addJsConfigVars() in DiscussionTools - https://phabricator.wikimedia.org/T372592 [20:25:57] T416616: Create new cache-friendly lua/parser function for "is today before X date" and "is today after X date" - https://phabricator.wikimedia.org/T416616 [20:25:58] T416540: Mean MediaWiki backend latency increased by 60% between October and December 2025 - https://phabricator.wikimedia.org/T416540 [20:25:58] T419439: Clean up Cache Expiry computation - https://phabricator.wikimedia.org/T419439 [20:27:51] !log jforrester@deploy2002 jforrester, cscott, bwang: Backport for [[gerrit:1240012|Enable personal main menu to all users in Minerva Neue skin (T413912)]], [[gerrit:1250007|Enables legacy processing in ParserOutputPostCacheTransform when cached (T372592)]], [[gerrit:1250015|Parser: Raise minimum TTL from 30 min to 'next midnight' in miser mode (T416616 T416540 T419439)]] synced to the testservers (see https://wikitech.wi [20:27:51] kimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:28:08] Okie-dokie. [20:28:15] cscott: Can you verify? [20:28:23] bwang: Can you verify yours? [20:28:25] ok [20:29:08] (03PS3) 10Scott French: mw-(api-int|web): Pilot drain configuration in canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250067 (https://phabricator.wikimedia.org/T364245) [20:30:33] Ok it looks good [20:31:54] Thanks. [20:31:59] cscott: Am I green to continue? [20:32:35] sorry james, hang on one second more [20:32:39] Sure. [20:33:40] good for me [20:33:45] Ack. [20:34:14] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:34:41] James_F: ok, green to continue [20:34:46] !log jforrester@deploy2002 jforrester, cscott, bwang: Continuing with sync [20:34:47] i couldn't get it to let any smoke out, at least [20:34:54] Fingers crossed. [20:35:06] It's not like we're fixing a major production performance concern or any… oh. [20:35:44] (03CR) 10Gergő Tisza: "Uh yeah, sorry. Completely forgot to remove the +2 after it got stuck in CI." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:35:45] it will likely be a couple of days or a week before we see the performance impact, judging from previous experience [20:35:53] so nothing looking different is good so far [20:36:00] tgr_: No worries, just didn't want to accidentally break the world without you. :-) [20:36:04] cscott: Ack. [20:36:12] !log cdobbins@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cdobbins@cumin2002" [20:36:39] (03CR) 10RLazarus: [C:03+1] mw-(api-int|web): Pilot drain configuration in canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250067 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [20:36:42] thanks James_F! I'll deploy it at the end of the window if there's still time [20:38:01] Maybe. Everything left is config changes, which should ideally be faster. [20:38:07] (Famous last words.) [20:38:43] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1240012|Enable personal main menu to all users in Minerva Neue skin (T413912)]], [[gerrit:1250007|Enables legacy processing in ParserOutputPostCacheTransform when cached (T372592)]], [[gerrit:1250015|Parser: Raise minimum TTL from 30 min to 'next midnight' in miser mode (T416616 T416540 T419439)]] (duration: 12m 58s) [20:38:52] T413912: Deployment: Promote advanced user menu to all users - https://phabricator.wikimedia.org/T413912 [20:38:53] T372592: Find a way to replace the use of ParserOutput::addJsConfigVars() in DiscussionTools - https://phabricator.wikimedia.org/T372592 [20:38:53] T416616: Create new cache-friendly lua/parser function for "is today before X date" and "is today after X date" - https://phabricator.wikimedia.org/T416616 [20:38:53] T416540: Mean MediaWiki backend latency increased by 60% between October and December 2025 - https://phabricator.wikimedia.org/T416540 [20:38:54] T419439: Clean up Cache Expiry computation - https://phabricator.wikimedia.org/T419439 [20:39:04] OK, next up, danisztls plus a bunch of mine. [20:39:17] cdobbins@cumin2002 reimage (PID 1254217) is awaiting input [20:39:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249983 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza) [20:39:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238733 (https://phabricator.wikimedia.org/T397402) (owner: 10Jforrester) [20:39:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238734 (https://phabricator.wikimedia.org/T397403) (owner: 10Jforrester) [20:39:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249393 (https://phabricator.wikimedia.org/T419476) (owner: 10Jforrester) [20:39:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249394 (owner: 10Jforrester) [20:39:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249395 (owner: 10Jforrester) [20:40:10] James_F: thanks! [20:41:26] (03Merged) 10jenkins-bot: Deploy participant recruitment survey on ptwiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249983 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza) [20:41:44] (03Merged) 10jenkins-bot: wikifunctions: Drop temporary WikifunctionsEnableHTMLOutput flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238733 (https://phabricator.wikimedia.org/T397402) (owner: 10Jforrester) [20:41:48] (03Merged) 10jenkins-bot: wikifunctions: Drop temporary WikifunctionsEnableWikidataInputTypes flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238734 (https://phabricator.wikimedia.org/T397403) (owner: 10Jforrester) [20:42:07] (03Merged) 10jenkins-bot: build: Upgrade mediawiki-phan-config from 0.18.0 to 0.20.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249393 (https://phabricator.wikimedia.org/T419476) (owner: 10Jforrester) [20:42:11] (03Merged) 10jenkins-bot: build: Upgrade mediawiki-codesniffer from 49.0.0 to 50.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249394 (owner: 10Jforrester) [20:42:13] (03Merged) 10jenkins-bot: build: Upgrade symfony/yaml from 7.4.0 to 7.4.6 and alpha-sort [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249395 (owner: 10Jforrester) [20:42:41] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cdobbins@cumin2002" [20:42:42] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7006.magru.wmnet with OS trixie [20:43:12] (03PS1) 10Clare Ming: Test Kitchen UI: Deploy v1.2.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250073 (https://phabricator.wikimedia.org/T408233) [20:43:19] (03PS5) 10Jforrester: Migrate EmailAuth, step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:43:26] (03CR) 10Jforrester: [C:03+1] Migrate EmailAuth, step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:43:26] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1249983|Deploy participant recruitment survey on ptwiki and trwiki (T419275)]], [[gerrit:1238733|wikifunctions: Drop temporary WikifunctionsEnableHTMLOutput flag (T397402)]], [[gerrit:1238734|wikifunctions: Drop temporary WikifunctionsEnableWikidataInputTypes flag (T397403)]], [[gerrit:1249393|build: Upgrade mediawiki-phan-config from 0.18.0 to 0.20 [20:43:27] .0 (T419476)]], [[gerrit:1249394|build: Upgrade mediawiki-codesniffer from 49.0.0 to 50.0.0]], [[gerrit:1249395|build: Upgrade symfony/yaml from 7.4.0 to 7.4.6 and alpha-sort]] [20:43:34] T419275: Deploy QuickSurvey for research participant registration drive on trwiki & ptwiki - https://phabricator.wikimedia.org/T419275 [20:43:34] T397402: If we enable Wikifunctions to output HTML tables, styling, and links, we will demonstrate through a Function that displays a conjugation table its capability for generating net new knowledge on Wiktionaries beyond simple conversions. - https://phabricator.wikimedia.org/T397402 [20:43:34] T397403: Add support for Wikidata items and Wikidata lexemes as function inputs - https://phabricator.wikimedia.org/T397403 [20:43:35] T419476: Bogus PhanPluginDuplicateArrayKey error in mediawiki-config - https://phabricator.wikimedia.org/T419476 [20:43:38] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp7006.magru.wmnet with OS trixie [20:43:55] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:44:13] James_F: My change doesn't need to be tested. [20:44:25] danisztls: Ack, will go straight ahead after my tests. [20:44:34] (03PS1) 10Clare Ming: Test Kitchen UI: Deploy v1.2.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250074 (https://phabricator.wikimedia.org/T408233) [20:45:30] !log jforrester@deploy2002 dani, jforrester: Backport for [[gerrit:1249983|Deploy participant recruitment survey on ptwiki and trwiki (T419275)]], [[gerrit:1238733|wikifunctions: Drop temporary WikifunctionsEnableHTMLOutput flag (T397402)]], [[gerrit:1238734|wikifunctions: Drop temporary WikifunctionsEnableWikidataInputTypes flag (T397403)]], [[gerrit:1249393|build: Upgrade mediawiki-phan-config from 0.18.0 to 0.20.0 (T41 [20:45:30] 9476)]], [[gerrit:1249394|build: Upgrade mediawiki-codesniffer from 49.0.0 to 50.0.0]], [[gerrit:1249395|build: Upgrade symfony/yaml from 7.4.0 to 7.4.6 and alpha-sort]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:46:00] !log jforrester@deploy2002 dani, jforrester: Continuing with sync [20:48:21] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7007.magru.wmnet with reason: host reimage [20:49:33] jouncebot: nowandnext [20:49:33] For the next 0 hour(s) and 10 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260310T2000) [20:49:33] In 0 hour(s) and 10 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260310T2100) [20:49:51] Dreamy_Jazz: Do you have an emergency? tgr_ had something to deploy once I'm done. [20:49:55] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1249983|Deploy participant recruitment survey on ptwiki and trwiki (T419275)]], [[gerrit:1238733|wikifunctions: Drop temporary WikifunctionsEnableHTMLOutput flag (T397402)]], [[gerrit:1238734|wikifunctions: Drop temporary WikifunctionsEnableWikidataInputTypes flag (T397403)]], [[gerrit:1249393|build: Upgrade mediawiki-phan-config from 0.18.0 to 0.2 [20:49:56] 0.0 (T419476)]], [[gerrit:1249394|build: Upgrade mediawiki-codesniffer from 49.0.0 to 50.0.0]], [[gerrit:1249395|build: Upgrade symfony/yaml from 7.4.0 to 7.4.6 and alpha-sort]] (duration: 06m 29s) [20:49:59] No emergency, can wait [20:50:03] T419275: Deploy QuickSurvey for research participant registration drive on trwiki & ptwiki - https://phabricator.wikimedia.org/T419275 [20:50:03] T397402: If we enable Wikifunctions to output HTML tables, styling, and links, we will demonstrate through a Function that displays a conjugation table its capability for generating net new knowledge on Wiktionaries beyond simple conversions. - https://phabricator.wikimedia.org/T397402 [20:50:04] T397403: Add support for Wikidata items and Wikidata lexemes as function inputs - https://phabricator.wikimedia.org/T397403 [20:50:04] Ack. [20:50:04] T419476: Bogus PhanPluginDuplicateArrayKey error in mediawiki-config - https://phabricator.wikimedia.org/T419476 [20:50:08] Over to tgr_ . [20:50:15] thx [20:51:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:51:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:53:21] (03CR) 10CI reject: [V:04-1] Migrate EmailAuth, step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:54:11] Sigh. [20:54:22] Why is it only broken when tgr_ presses the buttons? [20:54:24] it's a different error! that counts as progress. [20:54:30] (03CR) 10Jforrester: [C:03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:54:31] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7007.magru.wmnet with reason: host reimage [20:54:36] It really doesn't. :-( [20:55:30] though actually I'm not sure what the error is [20:55:32] zabe: Now is not a great time to slam CI with merge requests. [20:55:54] It looks like CI lost track of its close-down jobs and so assumed an error happened. [20:55:58] But in all three at once? [20:56:32] (03CR) 10CI reject: [V:04-1] Migrate EmailAuth, step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:56:41] And this time https://integration.wikimedia.org/ci/job/operations-mw-config-php83-composer-lint/1539/console failed again. [20:56:43] Eurgh. [20:56:44] T419488? [20:56:44] T419488: PostBuild changing the status of successful builds to failure for no apparent reason - https://phabricator.wikimedia.org/T419488 [20:57:14] A_smart_kitten: Yes, though that task is very definitely a dupe of previous conversations somewhere. [20:57:17] oh [20:57:46] It'd be nice if LibUp didn't run during deployment windows. :-( [20:57:53] (03CR) 10Jforrester: [C:03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:57:59] sorry [20:58:07] zabe: Not your fault. :-( [20:58:43] (03Merged) 10jenkins-bot: Migrate EmailAuth, step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:58:48] Finally. [20:58:53] tgr_: Sorry again. Over to you. [20:59:08] thx for the shove [20:59:31] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7012.magru.wmnet with OS trixie [21:00:00] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1235552|Migrate EmailAuth, step 2 (T404334)]] [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260310T2100) [21:02:33] !log tgr@deploy2002 tgr: Backport for [[gerrit:1235552|Migrate EmailAuth, step 2 (T404334)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:09:37] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7006.magru.wmnet with reason: host reimage [21:11:33] ugh. Why doesn't createLocalAccount.php work on closed wikis? [21:12:29] I guess UltimateAuthority doesn't affect global permissions [21:13:51] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7006.magru.wmnet with reason: host reimage [21:21:42] !log tgr@deploy2002 tgr: Continuing with sync [21:21:55] that was mildly painful [21:22:31] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7007.magru.wmnet with OS trixie [21:24:08] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [reason: trixie reimaging] [21:25:34] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1235552|Migrate EmailAuth, step 2 (T404334)]] (duration: 25m 34s) [21:25:54] Dreamy_Jazz: all yours [21:26:01] Thanks! [21:28:13] Deploying private code... [21:30:14] Running scap, will say when done [21:42:01] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7006.magru.wmnet with OS trixie [21:42:19] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp7006.magru.wmnet [reason: trixie reimaging] [21:48:00] I'm done [21:48:32] !log Evening UTC backport window done [21:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:55] (03PS1) 10Scott French: envoy: Restore graceful default and support -max-wait-duration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1250090 (https://phabricator.wikimedia.org/T364245) [21:50:55] (03CR) 10Scott French: [V:03+2] "Built and tested locally against envoy testbed." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1250090 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [21:51:11] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7012.magru.wmnet with OS trixie [21:51:16] (03PS1) 10Pppery: Enwikinews: Only enable flaggedRevs in article namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250095 (https://phabricator.wikimedia.org/T418066) [21:55:07] (03CR) 10Scott French: [C:03+1] "Thanks, Blake!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1249322 (https://phabricator.wikimedia.org/T418133) (owner: 10Blake) [21:58:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:02:51] (03CR) 10RLazarus: [C:03+1] envoy: Restore graceful default and support -max-wait-duration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1250090 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [22:03:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:14:15] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:22:42] (03PS2) 10Clare Ming: Test Kitchen UI: Deploy v1.2.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250074 (https://phabricator.wikimedia.org/T408233) [22:24:06] (03PS2) 10Clare Ming: Test Kitchen UI: Deploy v1.2.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250073 (https://phabricator.wikimedia.org/T408233) [22:24:59] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.2.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250073 (https://phabricator.wikimedia.org/T408233) (owner: 10Clare Ming) [22:25:03] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.2.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250074 (https://phabricator.wikimedia.org/T408233) (owner: 10Clare Ming) [22:27:02] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.2.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250073 (https://phabricator.wikimedia.org/T408233) (owner: 10Clare Ming) [22:27:05] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.2.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250074 (https://phabricator.wikimedia.org/T408233) (owner: 10Clare Ming) [22:31:34] (03PS16) 10Andrew Bogott: toolforge etcdctl: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) [22:38:47] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [22:39:35] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [22:48:50] (03CR) 10Scott French: [V:03+2 C:03+2] "Thanks, Reuven!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1250090 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [22:48:53] (03PS1) 10Jforrester: [WIP] Expose new wikifunctions.v0 REST API module on Wikifunctions.org only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250107 [22:50:00] (03CR) 10CI reject: [V:04-1] [WIP] Expose new wikifunctions.v0 REST API module on Wikifunctions.org only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250107 (owner: 10Jforrester) [22:51:46] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11695340 (10Jhancock.wm) 05Open→03Resolved [22:52:08] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11695342 (10Jhancock.wm) @MatthewVernon finished [22:58:18] (03PS1) 10Clare Ming: Test Kitchen UI: Deploy v1.2.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250108 (https://phabricator.wikimedia.org/T408186) [22:59:19] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2095.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:59:32] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.2.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250108 (https://phabricator.wikimedia.org/T408186) (owner: 10Clare Ming) [22:59:48] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:59:52] (03PS1) 10Clare Ming: Test Kitchen UI: Deploy v1.2.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250109 (https://phabricator.wikimedia.org/T408186) [23:00:50] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.2.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250109 (https://phabricator.wikimedia.org/T408186) (owner: 10Clare Ming) [23:01:27] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.2.5 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250108 (https://phabricator.wikimedia.org/T408186) (owner: 10Clare Ming) [23:02:26] jhancock@cumin1003 provision (PID 2758066) is awaiting input [23:02:46] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.2.5 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250109 (https://phabricator.wikimedia.org/T408186) (owner: 10Clare Ming) [23:03:14] (03PS2) 10Jforrester: [WIP] Expose new wikifunctions.v0 REST API module on Wikifunctions.org only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250107 (https://phabricator.wikimedia.org/T419053) [23:03:30] (03CR) 10CI reject: [V:04-1] [WIP] Expose new wikifunctions.v0 REST API module on Wikifunctions.org only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250107 (https://phabricator.wikimedia.org/T419053) (owner: 10Jforrester) [23:04:43] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250107 (https://phabricator.wikimedia.org/T419053) (owner: 10Jforrester) [23:05:04] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [23:05:39] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [23:11:02] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2096.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:12:41] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host contint1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:13:48] (03PS3) 10Jforrester: [WIP] Expose new wikifunctions.v0 REST API module on Wikifunctions.org only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250107 (https://phabricator.wikimedia.org/T419053) [23:13:48] (03PS1) 10Jforrester: Move testwiki-only Attribution REST API definition to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 [23:13:48] (03PS1) 10Jforrester: Move GrowthExperiments REST API definition to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250114 [23:14:10] jhancock@cumin1003 provision (PID 2758554) is awaiting input [23:14:23] (03CR) 10Jforrester: "Note: Needs checking on deployment that this relative path actually works!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250107 (https://phabricator.wikimedia.org/T419053) (owner: 10Jforrester) [23:14:24] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11695444 (10wiki_willy) Hi @ssingh - I was thinking along the lines of seeing if we would be able to calculate SERT ourselves inste... [23:16:16] jhancock@cumin1003 provision (PID 2758066) is awaiting input [23:22:17] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host contint1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:22:39] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host contint1003.wikimedia.org with OS trixie [23:22:46] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11695464 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host contin... [23:24:26] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2096.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:26:15] (03CR) 10Andrew Bogott: toolforge etcdctl: update handling of 'member list' output (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott) [23:26:42] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2095.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:31:03] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2095.codfw.wmnet with OS bullseye [23:31:14] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11695521 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2095.codfw.wmnet with OS bullseye [23:31:33] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2096.codfw.wmnet with OS bullseye [23:31:39] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11695525 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2096.codfw.wmnet with OS bullseye [23:35:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247547 (https://phabricator.wikimedia.org/T416865) (owner: 10Santiago Faci) [23:40:07] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on contint1003.wikimedia.org with reason: host reimage [23:42:48] (03PS1) 10Zabe: Stop setting $wgImageLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250117 (https://phabricator.wikimedia.org/T299953) [23:44:30] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on contint1003.wikimedia.org with reason: host reimage [23:49:42] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2096.codfw.wmnet with reason: host reimage [23:53:08] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2096.codfw.wmnet with reason: host reimage [23:58:16] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2095.codfw.wmnet with reason: host reimage