[00:00:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T418465)', diff saved to https://phabricator.wikimedia.org/P89785 and previous config saved to /var/cache/conftool/dbconfig/20260304-000052-marostegui.json [00:00:57] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [00:01:52] OK, I think I'm turning into a pumpkin. rzl: Let's chat tomorrow about what we can try next? [00:03:11] yeah sounds good - I'm out of time before too much longer, but I'll see what I can find in the meantime and we can continue then [00:13:46] (03PS2) 10RLazarus: mw-experimental: Increase tracing sampling from 1% to 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247706 [00:14:13] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:16:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P89786 and previous config saved to /var/cache/conftool/dbconfig/20260304-001559-marostegui.json [00:26:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:28:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:30:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245473 (https://phabricator.wikimedia.org/T418612) (owner: 10Cwhite) [00:31:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P89787 and previous config saved to /var/cache/conftool/dbconfig/20260304-003107-marostegui.json [00:33:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:36:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:39:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1247718 [00:39:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1247718 (owner: 10TrainBranchBot) [00:46:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T418465)', diff saved to https://phabricator.wikimedia.org/P89788 and previous config saved to /var/cache/conftool/dbconfig/20260304-004615-marostegui.json [00:46:19] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [00:46:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1263.eqiad.wmnet with reason: Maintenance [00:46:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1263 (T418465)', diff saved to https://phabricator.wikimedia.org/P89789 and previous config saved to /var/cache/conftool/dbconfig/20260304-004638-marostegui.json [00:52:37] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1247718 (owner: 10TrainBranchBot) [00:59:13] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:03:06] (03CR) 10Scott French: [C:03+1] "Whoops, I totally missed that when reviewing the referenced patch. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247706 (owner: 10RLazarus) [01:09:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1247721 [01:09:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1247721 (owner: 10TrainBranchBot) [01:11:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T418465)', diff saved to https://phabricator.wikimedia.org/P89790 and previous config saved to /var/cache/conftool/dbconfig/20260304-011134-marostegui.json [01:11:39] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [01:14:44] (03PS1) 10BryanDavis: toolforge: Drop legacy redirects for quentinv57-tools [puppet] - 10https://gerrit.wikimedia.org/r/1247722 (https://phabricator.wikimedia.org/T210829) [01:22:58] !log zabe@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [01:23:59] !log zabe@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [01:26:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P89791 and previous config saved to /var/cache/conftool/dbconfig/20260304-012642-marostegui.json [01:27:33] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1247721 (owner: 10TrainBranchBot) [01:29:13] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:41:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P89792 and previous config saved to /var/cache/conftool/dbconfig/20260304-014150-marostegui.json [01:43:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:48:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:49:13] FIRING: SLOMetricAbsent: wdqs-scholarly-availability magru - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [01:56:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T418465)', diff saved to https://phabricator.wikimedia.org/P89793 and previous config saved to /var/cache/conftool/dbconfig/20260304-015657-marostegui.json [01:57:01] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [01:57:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [02:00:49] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:08:53] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 08m 04s) [02:08:54] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:16:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:17:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:19:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:24:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:27:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:33:54] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:54] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:52:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:54:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:19:57] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:20:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:20:51] FIRING: [2x] CoreRouterInterfaceDropPercent: Core router normal + high priority queue drops are high on cr2-magru:ae0 (External: IX.BR) - https://wikitech.wikimedia.org/wiki/Network_monitoring#CoreRouterInterfaceDropPercent - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDropPercent [03:21:13] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [03:24:57] RESOLVED: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:25:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:25:51] RESOLVED: [2x] CoreRouterInterfaceDropPercent: Core router normal + high priority queue drops are high on cr2-magru:ae0 (External: IX.BR) - https://wikitech.wikimedia.org/wiki/Network_monitoring#CoreRouterInterfaceDropPercent - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDropPercent [03:26:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [03:34:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:35:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:40:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:42:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:44:01] 06SRE, 06Infrastructure-Foundations, 10netops: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11671668 (10Papaul) I update the diagram [03:46:19] (03PS1) 10Cathal Mooney: cmooney: add temporary ssh key for network device access [homer/public] - 10https://gerrit.wikimedia.org/r/1247735 [03:47:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:48:49] 06SRE, 06Infrastructure-Foundations, 10netops: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11671673 (10cmooney) Looks great, nice work! [03:53:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:01:58] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:01:58] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:03:10] FIRING: [4x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:03:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:03:54] FIRING: [7x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:05:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:08:54] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:10:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:10:58] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:10:58] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:13:10] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:13:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:13:54] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:18:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:18:54] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:20:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:39:39] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11671699 (10Papaul) @ayounsi prior of deleting the sandbox1-ulsfo range 198.35.26.240/28 I will have to delete the interfaces et-0/0/1.1221 on both routers. D... [04:40:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:42:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:54:13] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971 (10Papaul) 03NEW [04:57:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:59:13] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:59:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:01:16] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11671717 (10Papaul) @jcrespo @Marostegui @MatthewVernon can you please let us know if backup1007, dbprov1004 and ms-be1093 need depool befo... [05:04:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:07:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:11:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:12:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:21:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:23:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:29:13] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:42:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1160.eqiad.wmnet with reason: Maintenance [05:43:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:43:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2179.codfw.wmnet with reason: Maintenance [05:46:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:49:13] FIRING: SLOMetricAbsent: wdqs-scholarly-availability magru - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [06:16:12] PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim [06:17:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:19:16] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [06:24:30] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11671766 (10Marostegui) [06:29:23] (03PS1) 10Marostegui: installserver: Remove db1291-db1298 [puppet] - 10https://gerrit.wikimedia.org/r/1247865 (https://phabricator.wikimedia.org/T407942) [06:31:38] (03CR) 10Marostegui: [C:03+2] installserver: Remove db1291-db1298 [puppet] - 10https://gerrit.wikimedia.org/r/1247865 (https://phabricator.wikimedia.org/T407942) (owner: 10Marostegui) [06:36:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:36:58] (03PS1) 10Marostegui: installserver: Remove db1291-db1298 [puppet] - 10https://gerrit.wikimedia.org/r/1247866 (https://phabricator.wikimedia.org/T407942) [06:38:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:41:37] (03PS2) 10Marostegui: installserver: Remove db1261-db1264 [puppet] - 10https://gerrit.wikimedia.org/r/1247866 (https://phabricator.wikimedia.org/T407942) [06:43:48] (03CR) 10Marostegui: [C:03+2] installserver: Remove db1261-db1264 [puppet] - 10https://gerrit.wikimedia.org/r/1247866 (https://phabricator.wikimedia.org/T407942) (owner: 10Marostegui) [06:45:39] (03PS1) 10Marostegui: site.pp: Remove db1291-db1298 [puppet] - 10https://gerrit.wikimedia.org/r/1247867 (https://phabricator.wikimedia.org/T407942) [06:46:29] (03CR) 10Marostegui: [C:03+2] site.pp: Remove db1291-db1298 [puppet] - 10https://gerrit.wikimedia.org/r/1247867 (https://phabricator.wikimedia.org/T407942) (owner: 10Marostegui) [06:48:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11671786 (10Marostegui) a:05Marostegui→03None All patches are ready [06:49:33] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11671788 (10Marostegui) [06:53:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:55:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:56:44] (03PS1) 10Marostegui: installserver: Add pc2021-pc2024 [puppet] - 10https://gerrit.wikimedia.org/r/1247868 (https://phabricator.wikimedia.org/T418907) [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T0700) [07:00:08] (03CR) 10Marostegui: [C:03+2] installserver: Add pc2021-pc2024 [puppet] - 10https://gerrit.wikimedia.org/r/1247868 (https://phabricator.wikimedia.org/T418907) (owner: 10Marostegui) [07:00:19] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11671805 (10Marostegui) [07:00:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:03:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:03:53] (03PS1) 10Marostegui: site.pp: Add pc202[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/1247869 (https://phabricator.wikimedia.org/T418907) [07:07:32] (03CR) 10Marostegui: [C:03+2] site.pp: Add pc202[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/1247869 (https://phabricator.wikimedia.org/T418907) (owner: 10Marostegui) [07:08:33] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11671812 (10Marostegui) a:05Marostegui→03None Patches ready [07:09:03] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11671814 (10Marostegui) [07:17:55] (03CR) 10Ayounsi: [C:03+2] cmooney: add temporary ssh key for network device access [homer/public] - 10https://gerrit.wikimedia.org/r/1247735 (owner: 10Cathal Mooney) [07:19:19] (03Merged) 10jenkins-bot: cmooney: add temporary ssh key for network device access [homer/public] - 10https://gerrit.wikimedia.org/r/1247735 (owner: 10Cathal Mooney) [07:22:24] (03CR) 10Arnaudb: "thanks for the hotfix!" [puppet] - 10https://gerrit.wikimedia.org/r/1247625 (owner: 10Vgutierrez) [07:38:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:40:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:43:08] !log installing libbpf updates from Bookworm point release [07:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:30] !log disabling IBGP session between ssw1-d1-eqiad and ssw1-d8-eqiad to remove backup paths T411054 [07:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:34] T411054: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054 [07:56:32] (03PS1) 10Muehlenhoff: Switch pki-root1002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1247906 (https://phabricator.wikimedia.org/T416664) [07:57:43] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11671902 (10Jelto) [07:58:51] (03PS1) 10Marostegui: installserver: Install pc102[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/1247907 (https://phabricator.wikimedia.org/T418908) [07:59:39] (03PS8) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) [08:00:04] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T0800). [08:00:05] nya_1F616EMO: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[34] - https://phabricator.wikimedia.org/T418903#11671906 (10MoritzMuehlenhoff) @RobH Why did you create a racking task only for two servers, are these shipped out in batches and we only get two initially? The order... [08:00:12] o/ [08:00:32] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 6 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [08:01:58] (03CR) 10Marostegui: [C:03+2] installserver: Install pc102[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/1247907 (https://phabricator.wikimedia.org/T418908) (owner: 10Marostegui) [08:02:44] (03CR) 10Jelto: [C:03+1] "lgtm, although I'd make that a related change to I6f364dd61262c3f495888cabf5be8da0b38977ac" [puppet] - 10https://gerrit.wikimedia.org/r/1247528 (https://phabricator.wikimedia.org/T418361) (owner: 10Arnaudb) [08:02:53] Anyone here for the morning window? [08:03:54] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:05:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[56] - https://phabricator.wikimedia.org/T418903#11671910 (10MoritzMuehlenhoff) [08:05:12] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [08:05:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[56] - https://phabricator.wikimedia.org/T418903#11671913 (10MoritzMuehlenhoff) Also, I changed the names: ganeti1053/1053 were already added last year in https://phabricator.wikimedia.org/T401691. [08:05:56] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [08:06:34] urbanecm: (repinging as I'm not sure if pings are case-sensitive) [08:06:51] (03PS1) 10Muehlenhoff: Add ganeti1055/1056/1057/1058 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1247908 (https://phabricator.wikimedia.org/T418903) [08:07:34] (03CR) 10Muehlenhoff: [C:03+2] rsyslog: Remove obsolete and misleading comments [puppet] - 10https://gerrit.wikimedia.org/r/1247613 (owner: 10Muehlenhoff) [08:08:41] (03CR) 10Muehlenhoff: [C:03+2] openstack: Remove two buster checks [puppet] - 10https://gerrit.wikimedia.org/r/1243788 (owner: 10Muehlenhoff) [08:10:00] (03CR) 10Slyngshede: [C:03+1] hiera: set haproxy version to 3.0 on codfw cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247516 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [08:10:08] (03PS3) 10Muehlenhoff: Install systemd-timesyncd universally [puppet] - 10https://gerrit.wikimedia.org/r/1243756 [08:10:24] (03CR) 10Slyngshede: [C:03+2] P:idm disallow signups from select domains [puppet] - 10https://gerrit.wikimedia.org/r/1247584 (https://phabricator.wikimedia.org/T418201) (owner: 10Slyngshede) [08:10:28] It seems like I am out of luck. Rescheduling to today afternoon. [08:10:40] (03CR) 10CI reject: [V:04-1] Install systemd-timesyncd universally [puppet] - 10https://gerrit.wikimedia.org/r/1243756 (owner: 10Muehlenhoff) [08:10:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [08:11:06] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp5032.* [08:11:40] (03CR) 10Fabfur: [C:03+2] hiera: set haproxy version to 3.0 on codfw cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247516 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [08:13:06] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:13:11] PROBLEM - Host db1232 #page is DOWN: PING CRITICAL - Packet loss = 100% [08:13:11] PROBLEM - Host db1223 #page is DOWN: PING CRITICAL - Packet loss = 100% [08:13:14] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [08:13:17] RECOVERY - Host db1223 #page is UP: PING OK - Packet loss = 0%, RTA = 8.71 ms [08:13:18] what?? [08:13:18] RECOVERY - Host db1232 #page is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [08:13:24] !ack [08:13:24] no value provided for parameter incident and no default available [08:13:24] All incidents are already acked. [08:13:32] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [08:13:36] instant recovery? [08:13:54] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:13:57] yeah, looks so [08:14:01] XioNoX: any network issue? [08:14:18] I think this was me... sry.. I made a change to fix the VRRP alert and think I briefly made the interfaces bounce [08:14:24] they are both in d3 [08:14:30] topranks: ah gotcha, thanks [08:15:57] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:17:12] (03PS1) 10Marostegui: site.pp: Add pc102[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/1247911 (https://phabricator.wikimedia.org/T418908) [08:17:58] (03CR) 10JMeybohm: install_server: use UEFI for new control plane nodes wikikube-ctrl200[4-5] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1247695 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [08:18:10] (03CR) 10Elukey: [C:03+1] Switch pki-root1002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1247906 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [08:18:38] (03CR) 10JMeybohm: wmnet: add linked-artifacts CNAME record for k8s ingress (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1247172 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [08:18:54] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:18:54] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:19:10] (03PS1) 10Muehlenhoff: docker base image build: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1247912 [08:19:26] (03CR) 10Marostegui: [C:03+2] site.pp: Add pc102[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/1247911 (https://phabricator.wikimedia.org/T418908) (owner: 10Marostegui) [08:20:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:20:57] RESOLVED: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:20:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11671957 (10Marostegui) a:05Marostegui→03None Patches are ready [08:21:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11671959 (10Marostegui) [08:21:32] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw and A:cp - 3.0 upgrade () [08:21:45] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp - 3.0 upgrade () [08:21:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:22:28] (03PS1) 10Muehlenhoff: statsite: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1247913 [08:23:09] (03CR) 10CI reject: [V:04-1] statsite: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1247913 (owner: 10Muehlenhoff) [08:23:52] (03PS1) 10Muehlenhoff: profile::java Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1247915 [08:24:53] (03PS1) 10Brouberol: Revert "growhbook: allow WMDE engineers to self-enroll" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247916 [08:25:44] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11671968 (10jcrespo) @papaul for backup1007, dbprov1004, while they are a production host with important content, a small network interrupti... [08:26:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:26:09] (03CR) 10CI reject: [V:04-1] profile::java Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1247915 (owner: 10Muehlenhoff) [08:26:14] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11671969 (10jcrespo) [08:27:03] (03PS2) 10Muehlenhoff: Switch pki-root1002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1247906 (https://phabricator.wikimedia.org/T416664) [08:27:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:28:54] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:29:24] elukey@cumin1003 provision (PID 3200208) is awaiting input [08:30:46] (03PS1) 10Muehlenhoff: Remove support for old Elastic releases [puppet] - 10https://gerrit.wikimedia.org/r/1247917 (https://phabricator.wikimedia.org/T388607) [08:31:54] (03CR) 10Muehlenhoff: [C:03+2] Switch pki-root1002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1247906 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [08:31:59] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [08:32:17] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [08:32:27] (03CR) 10Jelto: [C:03+2] "key has been verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1247522 (https://phabricator.wikimedia.org/T418415) (owner: 10Jelto) [08:32:42] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11671984 (10Marostegui) [08:33:54] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:34:42] (03PS3) 10Jelto: admin: add dtotten to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1247522 (https://phabricator.wikimedia.org/T418415) [08:35:02] (03PS9) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) [08:35:11] (03PS1) 10Marostegui: mariadb: db225[0-3].yaml [puppet] - 10https://gerrit.wikimedia.org/r/1247919 (https://phabricator.wikimedia.org/T418911) [08:35:43] (03CR) 10JMeybohm: Add a ValidatingAdmissionPolicy for use with analytics workloads (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [08:35:57] (03CR) 10Marostegui: [C:03+2] mariadb: db225[0-3].yaml [puppet] - 10https://gerrit.wikimedia.org/r/1247919 (https://phabricator.wikimedia.org/T418911) (owner: 10Marostegui) [08:36:41] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on backup1007.eqiad.wmnet,dbprov1004.eqiad.wmnet with reason: network maintenance [08:36:52] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11671989 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cd8c8777-0916-4a5b-b6f5-55f2535990f4) set by jynus@cumin1003 fo... [08:37:05] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978 (10cmooney) 03NEW p:05Triage→03High [08:37:08] (03CR) 10Jelto: [C:03+2] admin: add dtotten to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1247522 (https://phabricator.wikimedia.org/T418415) (owner: 10Jelto) [08:37:35] (03CR) 10JMeybohm: [C:04-1] Apply the new VAP to several namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [08:38:15] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11672002 (10jcrespo) [08:40:46] 06SRE, 06Traffic, 06Traffic-Icebox, 07HTTPS, 07Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378#11672025 (10Diskdance) FYI, the ECH standard has been stabilized as RFC9848: https://www.rfc-editor.org/info/rfc9848. [08:40:50] (03PS4) 10Muehlenhoff: Install systemd-timesyncd universally [puppet] - 10https://gerrit.wikimedia.org/r/1243756 [08:41:04] (03PS1) 10Gehel: wdqs: Remove non-operational endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1247920 (https://phabricator.wikimedia.org/T417571) [08:41:06] (03PS1) 10Gehel: wdqs: Add https://catalog.digital-scriptorium.org/query to allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1247921 (https://phabricator.wikimedia.org/T416910) [08:41:28] (03CR) 10JMeybohm: [C:03+2] loadbalancer.migrate-service-ipip: Allow to skip puppet on realservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1242419 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [08:41:29] (03PS3) 10Elukey: sre.hosts.provision: add a better message when no NICS are found in BIOS [cookbooks] - 10https://gerrit.wikimedia.org/r/1242408 [08:41:29] (03PS1) 10Elukey: Fix setuptools version and some code violations [cookbooks] - 10https://gerrit.wikimedia.org/r/1247922 [08:41:42] (03CR) 10JMeybohm: [C:03+2] k8s.pool-depool-node: Don't ask for confirmation on check [cookbooks] - 10https://gerrit.wikimedia.org/r/1247624 (https://phabricator.wikimedia.org/T410537) (owner: 10JMeybohm) [08:42:22] (03PS2) 10Elukey: Fix setuptools version and some code violations [cookbooks] - 10https://gerrit.wikimedia.org/r/1247922 [08:45:45] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11672031 (10Jelto) 05Open→03Resolved a:03Jelto Thank you for the key. You should have access now. I also created a kerberos principal becaus... [08:46:32] (03Merged) 10jenkins-bot: loadbalancer.migrate-service-ipip: Allow to skip puppet on realservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1242419 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [08:46:57] 06SRE, 06Traffic, 06Traffic-Icebox, 07HTTPS, 07Upstream: Support Encrypted Client Hello (ECH) on Wikimedia servers - https://phabricator.wikimedia.org/T205378#11672036 (10Diskdance) [08:46:59] (03PS1) 10Muehlenhoff: Remove obsolete spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1247923 [08:47:27] (03Merged) 10jenkins-bot: k8s.pool-depool-node: Don't ask for confirmation on check [cookbooks] - 10https://gerrit.wikimedia.org/r/1247624 (https://phabricator.wikimedia.org/T410537) (owner: 10JMeybohm) [08:47:40] (03PS2) 10JMeybohm: k8s.pool-depool-cookbook: Handle calicoctl exiting with error [cookbooks] - 10https://gerrit.wikimedia.org/r/1247628 (https://phabricator.wikimedia.org/T418259) [08:47:54] 06SRE, 06Traffic, 06Traffic-Icebox, 07HTTPS, 07Upstream: Support Encrypted Client Hello (ECH) on Wikimedia servers - https://phabricator.wikimedia.org/T205378#11672040 (10Diskdance) [08:47:54] (03CR) 10Elukey: [C:03+2] Fix setuptools version and some code violations [cookbooks] - 10https://gerrit.wikimedia.org/r/1247922 (owner: 10Elukey) [08:48:02] (03CR) 10JMeybohm: [C:03+2] k8s.pool-depool-cookbook: Handle calicoctl exiting with error (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1247628 (https://phabricator.wikimedia.org/T418259) (owner: 10JMeybohm) [08:48:03] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: add a better message when no NICS are found in BIOS [cookbooks] - 10https://gerrit.wikimedia.org/r/1242408 (owner: 10Elukey) [08:48:53] elukey@cumin1003 provision (PID 3943624) is awaiting input [08:49:16] !log disabling IBGP session between ssw1-d1-eqiad and ssw1-d8-eqiad to remove backup paths try #2 T411054 [08:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:19] T411054: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054 [08:50:13] (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [08:51:33] (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [08:51:35] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 6 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [08:52:07] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts pki-root1002.eqiad.wmnet [08:53:24] (03Merged) 10jenkins-bot: k8s.pool-depool-cookbook: Handle calicoctl exiting with error [cookbooks] - 10https://gerrit.wikimedia.org/r/1247628 (https://phabricator.wikimedia.org/T418259) (owner: 10JMeybohm) [08:53:49] Above VRRP alert is ok - both CRs think they are masters as Spines<->Spine BGP is disabled, however this does not cause a problem with that particular VXLAN config. I silenced the alert. [08:53:54] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:54:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1097.eqiad.wmnet [08:56:08] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1096.eqiad.wmnet [08:57:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [08:58:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [08:59:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [09:00:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1097.eqiad.wmnet [09:00:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [09:02:04] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1096.eqiad.wmnet [09:02:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [09:03:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [09:03:24] !log switching off Blazegraph on wdqs2009 (legacy full graph endpoint is end of life) - T411410 / T415073 [09:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:29] (03PS1) 10Mszwarc: Require 2FA from CentralNotice admins and WMF Trust & Safety [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247925 (https://phabricator.wikimedia.org/T418580) [09:03:29] T411410: Decommission WDQS full graph endpoint (wdqs2009) - https://phabricator.wikimedia.org/T411410 [09:03:29] T415073: Cleanup after decommission of the WDQS full graph endpoint - https://phabricator.wikimedia.org/T415073 [09:04:15] (03CR) 10David Caro: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1247722 (https://phabricator.wikimedia.org/T210829) (owner: 10BryanDavis) [09:05:38] (03PS1) 10Gehel: wdqs: remove query-legacy-full.wikidata.org - end of life [dns] - 10https://gerrit.wikimedia.org/r/1247926 (https://phabricator.wikimedia.org/T415073) [09:05:50] (03CR) 10David Caro: [C:03+2] toolforge: Drop legacy redirects for quentinv57-tools [puppet] - 10https://gerrit.wikimedia.org/r/1247722 (https://phabricator.wikimedia.org/T210829) (owner: 10BryanDavis) [09:08:05] (03CR) 10Ayounsi: [C:03+1] Add a new role for routed jumbo Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1247636 (https://phabricator.wikimedia.org/T410314) (owner: 10Muehlenhoff) [09:08:19] (03PS1) 10Marostegui: installserver: Install db225[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/1247928 (https://phabricator.wikimedia.org/T418911) [09:11:02] (03CR) 10Marostegui: [C:03+2] installserver: Install db225[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/1247928 (https://phabricator.wikimedia.org/T418911) (owner: 10Marostegui) [09:15:34] (03PS14) 10Arnaudb: gerrit: sync httpd config to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) [09:15:34] (03CR) 10Arnaudb: "thanks for that, this was also highlighted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1240197/comments/fe48bbdf_c561942f I've" [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [09:16:19] (03CR) 10Brouberol: [C:03+1] wdqs: Remove non-operational endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1247920 (https://phabricator.wikimedia.org/T417571) (owner: 10Gehel) [09:16:28] (03CR) 10Brouberol: [C:03+1] wdqs: Add https://catalog.digital-scriptorium.org/query to allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1247921 (https://phabricator.wikimedia.org/T416910) (owner: 10Gehel) [09:16:46] (03PS1) 10MVernon: swift: add 2 new storage nodes ms-be109{6,7} [puppet] - 10https://gerrit.wikimedia.org/r/1247932 (https://phabricator.wikimedia.org/T413089) [09:17:11] (03PS1) 10Gehel: wdqs: cleanup code related to query-legacy-full.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1247933 (https://phabricator.wikimedia.org/T415073) [09:19:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11672118 (10MatthewVernon) 05Open→03Resolved Yes, they look good now, thank you! [09:19:21] (03PS1) 10Marostegui: site.pp: Add db225[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/1247934 (https://phabricator.wikimedia.org/T418979) [09:20:19] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11672136 (10cmooney) @ayounsi thanks for following up on this. I've done some testing to see if there may be a better way to force a tunnel teardown/re-establishment today. The reason cl... [09:20:34] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts pki-root1002.eqiad.wmnet [09:20:37] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts pki-root1002.eqiad.wmnet [09:20:45] (03PS2) 10Marostegui: site.pp: Add db225[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/1247934 (https://phabricator.wikimedia.org/T418911) [09:20:49] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts pki-root1002.eqiad.wmnet [09:20:58] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts pki-root1002.eqiad.wmnet [09:21:42] (03CR) 10Marostegui: [C:03+2] site.pp: Add db225[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/1247934 (https://phabricator.wikimedia.org/T418911) (owner: 10Marostegui) [09:21:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki-root1002.eqiad.wmnet [09:22:47] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11672149 (10Marostegui) a:05Marostegui→03None Patches ready [09:23:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11672151 (10Marostegui) [09:24:13] (03PS1) 10Muehlenhoff: profile::ci::package_builder: Stop using component/ci [puppet] - 10https://gerrit.wikimedia.org/r/1247935 [09:24:59] (03CR) 10Brouberol: [C:03+1] wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [09:26:14] (03CR) 10Vgutierrez: [C:03+1] cache:text: add gerrit-spare to alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/1247528 (https://phabricator.wikimedia.org/T418361) (owner: 10Arnaudb) [09:26:35] (03CR) 10Vgutierrez: [C:03+1] cache:text: add gerrit-replica to alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/1247026 (https://phabricator.wikimedia.org/T418108) (owner: 10Jelto) [09:27:02] (03CR) 10Gehel: [C:03+2] wdqs: Remove non-operational endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1247920 (https://phabricator.wikimedia.org/T417571) (owner: 10Gehel) [09:27:11] (03CR) 10Gehel: [C:03+2] wdqs: Add https://catalog.digital-scriptorium.org/query to allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1247921 (https://phabricator.wikimedia.org/T416910) (owner: 10Gehel) [09:28:02] (03CR) 10Jelto: [C:03+2] cache:text: add gerrit-replica to alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/1247026 (https://phabricator.wikimedia.org/T418108) (owner: 10Jelto) [09:28:10] (03CR) 10Hashar: [C:03+1] "Moritz and I exchanged about it. At the time I built a custom version of the package (T212774) and then went with monkey patching." [puppet] - 10https://gerrit.wikimedia.org/r/1247935 (owner: 10Muehlenhoff) [09:28:57] (03PS1) 10Brouberol: turnilo: fix typo in configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247936 (https://phabricator.wikimedia.org/T416113) [09:29:05] (03PS1) 10MVernon: preseed: all apus-be nodes are using boss cards [puppet] - 10https://gerrit.wikimedia.org/r/1247937 (https://phabricator.wikimedia.org/T418901) [09:29:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247925 (https://phabricator.wikimedia.org/T418580) (owner: 10Mszwarc) [09:29:48] elukey@cumin1003 provision (PID 3943624) is awaiting input [09:30:04] (03Merged) 10jenkins-bot: Require 2FA from CentralNotice admins and WMF Trust & Safety [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247925 (https://phabricator.wikimedia.org/T418580) (owner: 10Mszwarc) [09:30:11] (03PS2) 10Brouberol: turnilo: fix typo in configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247936 (https://phabricator.wikimedia.org/T416113) [09:30:37] (03PS3) 10Brouberol: turnilo: fix typo in configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247936 (https://phabricator.wikimedia.org/T416113) [09:30:57] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:30:57] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1247925|Require 2FA from CentralNotice admins and WMF Trust & Safety (T418580 T417880)]] [09:31:04] T418580: Deploy 2FA requirement using $wgRestrictedGroups to Wikimedia production, instead of OATHAuth's custom config - https://phabricator.wikimedia.org/T418580 [09:31:04] T417880: Set OATH2FARequiredGroupRemovalPages value for Wikimedia cluster - https://phabricator.wikimedia.org/T417880 [09:31:47] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp - 3.0 upgrade () [09:32:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki-root1002.eqiad.wmnet [09:32:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts pki-root1002.eqiad.wmnet [09:33:05] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1247925|Require 2FA from CentralNotice admins and WMF Trust & Safety (T418580 T417880)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:34:28] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11672217 (10MatthewVernon) [09:35:23] !log mszwarc@deploy2002 mszwarc: Continuing with sync [09:36:04] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw and A:cp - 3.0 upgrade () [09:36:07] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11672223 (10MatthewVernon) Is this maintenance happening at 15:00 UTC today? @Papaul ms-be1093 needs no action taking, but it'd be worth co... [09:38:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243756 (owner: 10Muehlenhoff) [09:39:01] (03CR) 10Muehlenhoff: [C:03+2] profile::ci::package_builder: Stop using component/ci [puppet] - 10https://gerrit.wikimedia.org/r/1247935 (owner: 10Muehlenhoff) [09:39:20] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247925|Require 2FA from CentralNotice admins and WMF Trust & Safety (T418580 T417880)]] (duration: 08m 23s) [09:39:25] T418580: Deploy 2FA requirement using $wgRestrictedGroups to Wikimedia production, instead of OATHAuth's custom config - https://phabricator.wikimedia.org/T418580 [09:39:26] T417880: Set OATH2FARequiredGroupRemovalPages value for Wikimedia cluster - https://phabricator.wikimedia.org/T417880 [09:41:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247912 (owner: 10Muehlenhoff) [09:44:42] (03CR) 10Federico Ceratto: [C:03+1] "The two hostnames match across the yaml files, the commit description and the related task." [puppet] - 10https://gerrit.wikimedia.org/r/1247932 (https://phabricator.wikimedia.org/T413089) (owner: 10MVernon) [09:45:10] (03PS2) 10Muehlenhoff: statsite: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1247913 [09:45:52] (03PS3) 10Muehlenhoff: Add a new role for routed jumbo Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1247636 (https://phabricator.wikimedia.org/T410314) [09:45:55] (03Abandoned) 10Arnaudb: cache:text: add gerrit-spare to alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/1247528 (https://phabricator.wikimedia.org/T418361) (owner: 10Arnaudb) [09:47:09] (03PS1) 10Arnaudb: cache:text: add gerrit-replica to alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/1247940 (https://phabricator.wikimedia.org/T418108) [09:47:27] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti1055/1056/1057/1058 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1247908 (https://phabricator.wikimedia.org/T418903) (owner: 10Muehlenhoff) [09:48:05] (03CR) 10Jelto: [C:03+1] "lgtm thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1247940 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [09:48:26] (03CR) 10Arnaudb: [C:03+2] cache:text: add gerrit-replica to alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/1247940 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [09:49:13] FIRING: SLOMetricAbsent: wdqs-scholarly-availability magru - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:50:02] (03PS1) 10Ladsgroup: WebPHandler: Allow the original being served on the web [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247941 (https://phabricator.wikimedia.org/T414805) [09:50:48] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1200.eqiad.wmnet [09:51:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11672279 (10ops-monitoring-bot) Host an-worker1200.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [09:51:31] (03PS1) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [09:51:55] (03CR) 10Joal: [C:03+1] turnilo: fix typo in configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247936 (https://phabricator.wikimedia.org/T416113) (owner: 10Brouberol) [09:52:13] (03PS2) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [09:53:50] (03CR) 10Brouberol: [C:03+2] turnilo: fix typo in configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247936 (https://phabricator.wikimedia.org/T416113) (owner: 10Brouberol) [09:55:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install wikikube-worker23[57-74] - https://phabricator.wikimedia.org/T418925#11672291 (10JMeybohm) [09:56:44] jouncebot: nowandnext [09:56:44] No deployments scheduled for the next 1 hour(s) and 3 minute(s) [09:56:44] In 1 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T1100) [09:57:01] (03PS1) 10Ladsgroup: WebPHandler: Allow the original being served on the web [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247944 (https://phabricator.wikimedia.org/T414805) [09:57:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [09:57:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [09:57:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247941 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [09:57:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247944 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [09:57:43] (03CR) 10Arnaudb: [C:03+1] aux: add wmf-navigator namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247615 (owner: 10AOkoth) [10:01:11] RECOVERY - Host an-worker1199 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [10:01:13] PROBLEM - SSH on an-worker1199 is CRITICAL: connect to address 10.64.143.5 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:02:51] (03CR) 10Jelto: [V:03+1 C:03+1] "this should work now:" [dns] - 10https://gerrit.wikimedia.org/r/1247531 (https://phabricator.wikimedia.org/T418361) (owner: 10Arnaudb) [10:03:39] (03CR) 10Muehlenhoff: [C:03+2] Add a new role for routed jumbo Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1247636 (https://phabricator.wikimedia.org/T410314) (owner: 10Muehlenhoff) [10:04:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1200.eqiad.wmnet [10:04:06] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1201.eqiad.wmnet [10:04:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11672332 (10ops-monitoring-bot) Host an-worker1201.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:04:43] (03PS1) 10Gehel: wdqs: remove query-legay-full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247947 (https://phabricator.wikimedia.org/T415073) [10:05:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [10:05:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247913 (owner: 10Muehlenhoff) [10:06:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [10:06:34] (03CR) 10Gehel: "This might need a manual destroy of the deployment before merging this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247947 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel) [10:08:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [10:09:08] (03Merged) 10jenkins-bot: WebPHandler: Allow the original being served on the web [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247941 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [10:09:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [10:09:38] (03CR) 10CI reject: [V:04-1] WebPHandler: Allow the original being served on the web [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247944 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [10:10:16] (03CR) 10Ladsgroup: [C:03+2] "again" [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247944 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [10:10:46] (03CR) 10Joal: [V:03+2 C:03+2] "Merging" [alerts] - 10https://gerrit.wikimedia.org/r/1247600 (https://phabricator.wikimedia.org/T418152) (owner: 10Joal) [10:14:45] PROBLEM - Host wikikube-worker1069 is DOWN: PING CRITICAL - Packet loss = 25%, RTA = 2958.14 ms [10:14:49] (03PS7) 10Daniel Kinzler: rest gateway: expose headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) [10:15:04] (03CR) 10Fabfur: [C:03+2] hiera: set haproxy version to 3.0 on drmrs cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247517 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [10:15:05] (03CR) 10Daniel Kinzler: rest gateway: expose headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) (owner: 10Daniel Kinzler) [10:15:27] (03PS8) 10Daniel Kinzler: rest gateway: expose headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) [10:15:35] RECOVERY - Host wikikube-worker1069 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [10:16:04] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1201.eqiad.wmnet [10:16:05] (03Merged) 10jenkins-bot: WebPHandler: Allow the original being served on the web [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247944 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [10:16:07] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1202.eqiad.wmnet [10:16:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11672383 (10ops-monitoring-bot) Host an-worker1202.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:16:59] (03PS2) 10Muehlenhoff: aptrepo: Remove buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1247618 [10:17:18] (03CR) 10Daniel Kinzler: rest gateway: expose headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) (owner: 10Daniel Kinzler) [10:17:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:17:55] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1247941|WebPHandler: Allow the original being served on the web (T414805 T418745 T418346)]], [[gerrit:1247944|WebPHandler: Allow the original being served on the web (T414805 T418745 T418346)]] [10:18:02] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [10:18:03] T418745: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745 [10:18:03] T418346: 429 too many requests when trying to view .webp image in MediaViewer in Firefox - https://phabricator.wikimedia.org/T418346 [10:18:33] (03PS3) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [10:18:59] (03PS1) 10Muehlenhoff: varnish: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1247953 [10:19:33] (03CR) 10Jelto: [C:03+1] cache:text: add gerrit-replica to alternate_domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1247940 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [10:19:49] (03PS4) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [10:19:59] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1247941|WebPHandler: Allow the original being served on the web (T414805 T418745 T418346)]], [[gerrit:1247944|WebPHandler: Allow the original being served on the web (T414805 T418745 T418346)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:20:19] (03CR) 10Arnaudb: [C:03+2] cache:text: add gerrit-replica to alternate_domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1247940 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [10:20:39] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:20:48] (03CR) 10Arnaudb: [C:03+2] gerrit: move gerrit-spare behind CDN [dns] - 10https://gerrit.wikimedia.org/r/1247531 (https://phabricator.wikimedia.org/T418361) (owner: 10Arnaudb) [10:20:58] !log arnaudb@dns1004 START - running authdns-update [10:22:08] !log arnaudb@dns1004 END - running authdns-update [10:23:15] (03PS5) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [10:24:13] (03PS6) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [10:24:37] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247941|WebPHandler: Allow the original being served on the web (T414805 T418745 T418346)]], [[gerrit:1247944|WebPHandler: Allow the original being served on the web (T414805 T418745 T418346)]] (duration: 06m 42s) [10:24:43] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [10:24:44] T418745: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745 [10:24:44] T418346: 429 too many requests when trying to view .webp image in MediaViewer in Firefox - https://phabricator.wikimedia.org/T418346 [10:25:04] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp - 3.0 upgrade () [10:25:06] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp - 3.0 upgrade () [10:25:13] (03PS7) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [10:25:16] !log start upgrading haproxy to 3.0 on A:cp-drmrs (T417253) [10:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:19] T417253: Upgrade to HAProxy 3.0 on cache (bullseye) hosts - https://phabricator.wikimedia.org/T417253 [10:25:55] (03CR) 10MVernon: [C:03+2] swift: add 2 new storage nodes ms-be109{6,7} [puppet] - 10https://gerrit.wikimedia.org/r/1247932 (https://phabricator.wikimedia.org/T413089) (owner: 10MVernon) [10:26:25] (03PS8) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [10:27:48] (03PS9) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [10:28:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1202.eqiad.wmnet [10:28:19] PROBLEM - Confd vcl based reload on cp6013 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [10:28:20] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1203.eqiad.wmnet [10:28:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11672447 (10ops-monitoring-bot) Host an-worker1203.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:29:40] (03CR) 10Vgutierrez: [C:03+1] varnish: add trusted_req and rl_class fields to x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1247034 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [10:30:58] (03CR) 10Fabfur: [C:03+2] varnish: add trusted_req and rl_class fields to x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1247034 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [10:32:19] RECOVERY - Confd vcl based reload on cp6013 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [10:32:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [10:32:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [10:39:37] (03CR) 10Federico Ceratto: [C:03+1] "I see all apus-be* being configured for preseeding as described." [puppet] - 10https://gerrit.wikimedia.org/r/1247937 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon) [10:40:19] PROBLEM - Confd vcl based reload on cp6015 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [10:40:24] (03PS1) 10Brouberol: turnilo: enable egress to the mw-api-int service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247959 (https://phabricator.wikimedia.org/T416113) [10:41:40] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, just a nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1243756 (owner: 10Muehlenhoff) [10:42:34] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1203.eqiad.wmnet [10:42:37] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1204.eqiad.wmnet [10:42:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11672513 (10ops-monitoring-bot) Host an-worker1204.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:43:00] (03PS9) 10Daniel Kinzler: rest gateway: expose x-wmf-ratelimit-class in response [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) [10:44:19] RECOVERY - Confd vcl based reload on cp6015 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [10:44:25] (03CR) 10MVernon: [C:03+2] preseed: all apus-be nodes are using boss cards [puppet] - 10https://gerrit.wikimedia.org/r/1247937 (https://phabricator.wikimedia.org/T418901) (owner: 10MVernon) [10:44:35] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247962 (https://phabricator.wikimedia.org/T418467) [10:45:21] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, nice" [puppet] - 10https://gerrit.wikimedia.org/r/1247618 (owner: 10Muehlenhoff) [10:45:45] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11672519 (10MatthewVernon) a:05MatthewVernon→03None {{done}} [10:46:06] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901#11672521 (10MatthewVernon) a:05MatthewVernon→03None {{done}} [10:46:14] (03CR) 10Joal: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247959 (https://phabricator.wikimedia.org/T416113) (owner: 10Brouberol) [10:47:19] PROBLEM - Confd vcl based reload on cp6015 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [10:47:33] (03PS1) 10David Caro: legacy_redirector: remove some disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/1247963 (https://phabricator.wikimedia.org/T418829) [10:48:35] (03CR) 10Brouberol: [C:03+2] turnilo: enable egress to the mw-api-int service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247959 (https://phabricator.wikimedia.org/T416113) (owner: 10Brouberol) [10:48:55] (03Abandoned) 10Arnaudb: gerrit: move gerrit-replica behind CDN [dns] - 10https://gerrit.wikimedia.org/r/1247530 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [10:49:20] (03PS1) 10Arnaudb: gerrit: move gerrit-replica behind CDN [dns] - 10https://gerrit.wikimedia.org/r/1247964 (https://phabricator.wikimedia.org/T418108) [10:51:19] RECOVERY - Confd vcl based reload on cp6015 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [10:52:05] (03CR) 10Jelto: [C:03+1] "lgtm 💯" [dns] - 10https://gerrit.wikimedia.org/r/1247964 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [10:52:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:53:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [10:53:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [10:54:16] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1204.eqiad.wmnet [10:54:19] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1205.eqiad.wmnet [10:54:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11672571 (10ops-monitoring-bot) Host an-worker1205.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:55:13] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11672572 (10Clement_Goubert) [10:55:20] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2332-2356].codfw.wmnet [10:55:32] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2332-2356].codfw.wmnet [10:56:34] (03PS1) 10Dreamy Jazz: SI: Update instrumentation schema [extensions/CheckUser] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247968 (https://phabricator.wikimedia.org/T418293) [10:56:41] (03PS1) 10Clément Goubert: site.pp: Add rdb201[34] [puppet] - 10https://gerrit.wikimedia.org/r/1247969 (https://phabricator.wikimedia.org/T418922) [10:57:03] jouncebot: nowandnext [10:57:03] No deployments scheduled for the next 0 hour(s) and 2 minute(s) [10:57:03] In 0 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T1100) [10:57:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247968 (https://phabricator.wikimedia.org/T418293) (owner: 10Dreamy Jazz) [10:58:58] blake@cumin1003 roll-reimage-nodes (PID 4042970) is awaiting input [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T1100) [11:03:09] !log blake@cumin1003 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[2332-2356].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [11:03:15] !log blake@cumin1003 END (FAIL) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=1) rolling reimage on P{wikikube-worker[2332-2356].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [11:06:32] (03PS2) 10Clément Goubert: Add new rdb201[34] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247969 (https://phabricator.wikimedia.org/T418922) [11:06:43] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2332-2356].codfw.wmnet [11:06:50] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2332-2356].codfw.wmnet [11:06:59] (03CR) 10David Caro: "There's some non-disabled ones, fixing" [puppet] - 10https://gerrit.wikimedia.org/r/1247963 (https://phabricator.wikimedia.org/T418829) (owner: 10David Caro) [11:07:10] !log blake@cumin1003 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[2332-2356].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [11:07:16] !log blake@cumin1003 END (FAIL) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=1) rolling reimage on P{wikikube-worker[2332-2356].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [11:07:25] (03CR) 10CI reject: [V:04-1] SI: Update instrumentation schema [extensions/CheckUser] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247968 (https://phabricator.wikimedia.org/T418293) (owner: 10Dreamy Jazz) [11:07:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:07:59] PROBLEM - Host an-worker1199 is DOWN: PING CRITICAL - Packet loss = 100% [11:08:11] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993 (10MoritzMuehlenhoff) 03NEW [11:08:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247566 (https://phabricator.wikimedia.org/T416748) (owner: 10Michael Große) [11:08:45] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp - 3.0 upgrade () [11:09:13] RECOVERY - SSH on an-worker1199 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:09:15] RECOVERY - Host an-worker1199 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [11:10:58] (03CR) 10Dreamy Jazz: [V:03+2 C:03+2] "This will cause instrumentation to be lost and the failing tests are T418982" [extensions/CheckUser] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247968 (https://phabricator.wikimedia.org/T418293) (owner: 10Dreamy Jazz) [11:11:49] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247518 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [11:12:07] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1247968|SI: Update instrumentation schema (T418293)]] [11:12:11] T418293: Suggested Investigations: Create a filter to filter out cases with no users blocked - https://phabricator.wikimedia.org/T418293 [11:13:46] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp - 3.0 upgrade () [11:14:45] PROBLEM - Host an-worker1205 is DOWN: PING CRITICAL - Packet loss = 100% [11:15:01] (03PS1) 10Clément Goubert: Add new wikikube-ctrl100[45] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247971 (https://phabricator.wikimedia.org/T418919) [11:16:48] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11672695 (10MoritzMuehlenhoff) @Clement_Goubert The current rdb* hosts are on Bullseye, and you listed Bookworm as the designated OS, if we move to a new OS, let's dire... [11:17:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247618 (owner: 10Muehlenhoff) [11:17:32] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1247968|SI: Update instrumentation schema (T418293)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:17:36] T418293: Suggested Investigations: Create a filter to filter out cases with no users blocked - https://phabricator.wikimedia.org/T418293 [11:17:45] (03PS2) 10Clément Goubert: Add new wikikube-ctrl100[56] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247971 (https://phabricator.wikimedia.org/T418919) [11:18:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install wikikube-ctrl100[56] - https://phabricator.wikimedia.org/T418919#11672700 (10Clement_Goubert) [11:18:57] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:19:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install wikikube-ctrl100[56] - https://phabricator.wikimedia.org/T418919#11672703 (10Clement_Goubert) >>! In T418919#11670190, @RobH wrote: > @Clement_Goubert, > > I've assumed the racking details, please double check them for acc... [11:21:02] (03CR) 10Fabfur: [C:03+2] hiera: set haproxy version to 3.0 on eqiad cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247518 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [11:22:10] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [11:22:28] !log start upgrading haproxy to 3.0 on A:cp-eqiad (T417253) [11:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:31] T417253: Upgrade to HAProxy 3.0 on cache (bullseye) hosts - https://phabricator.wikimedia.org/T417253 [11:23:57] FIRING: [2x] CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:28:30] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247968|SI: Update instrumentation schema (T418293)]] (duration: 16m 22s) [11:28:33] T418293: Suggested Investigations: Create a filter to filter out cases with no users blocked - https://phabricator.wikimedia.org/T418293 [11:30:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install wikikube-ctrl100[56] - https://phabricator.wikimedia.org/T418919#11672738 (10JMeybohm) [11:30:34] (03CR) 10JMeybohm: [C:03+1] Add new wikikube-ctrl100[56] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247971 (https://phabricator.wikimedia.org/T418919) (owner: 10Clément Goubert) [11:33:45] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 3 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11672762 (10Ladsgroup) I just added any requests to non-standard thumbs that the referrer is not us to the global rate limit. So far... [11:34:13] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp - 3.0 upgrade () [11:34:20] (03CR) 10Blake: [C:03+1] Add new rdb201[34] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247969 (https://phabricator.wikimedia.org/T418922) (owner: 10Clément Goubert) [11:34:23] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp - 3.0 upgrade () [11:35:55] (03CR) 10Clément Goubert: [C:03+2] Add new wikikube-ctrl100[56] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247971 (https://phabricator.wikimedia.org/T418919) (owner: 10Clément Goubert) [11:36:19] !log jmm@cumin2002 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad [11:36:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f8-eqiad [11:37:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install wikikube-ctrl100[56] - https://phabricator.wikimedia.org/T418919#11672766 (10Clement_Goubert) a:05Clement_Goubert→03None All yours. [11:37:25] RECOVERY - Host an-worker1205 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [11:38:57] FIRING: [2x] CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:39:48] (03CR) 10Clément Goubert: [C:03+2] Add new rdb201[34] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247969 (https://phabricator.wikimedia.org/T418922) (owner: 10Clément Goubert) [11:39:59] PROBLEM - SSH on an-worker1205 is CRITICAL: connect to address 10.64.163.5 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:41:29] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11672787 (10Clement_Goubert) >>! In T418922#11672695, @MoritzMuehlenhoff wrote: > @Clement_Goubert The current rdb* hosts are on Bullseye, and you listed Bookworm as th... [11:41:49] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11672788 (10Clement_Goubert) a:05Clement_Goubert→03None [11:42:40] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11672794 (10Clement_Goubert) Updated racking details, changed OS to Trixie, puppet patches merged. All yours. [11:43:00] (03PS1) 10Muehlenhoff: Add separate puppetserver hooks for the private repo [puppet] - 10https://gerrit.wikimedia.org/r/1247976 (https://phabricator.wikimedia.org/T365798) [11:43:10] 06SRE, 10Bitu, 06Infrastructure-Foundations: wikimedia-l was signed up for a developer account - https://phabricator.wikimedia.org/T418201#11672795 (10SLyngshede-WMF) 05In progress→03Resolved [11:43:38] (03CR) 10CI reject: [V:04-1] Add separate puppetserver hooks for the private repo [puppet] - 10https://gerrit.wikimedia.org/r/1247976 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [11:44:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11672803 (10Clement_Goubert) [11:46:27] (03PS1) 10Clément Goubert: Add new rdb101[56] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247977 (https://phabricator.wikimedia.org/T418916) [11:46:31] (03PS2) 10Muehlenhoff: Add separate puppetserver hooks for the private repo [puppet] - 10https://gerrit.wikimedia.org/r/1247976 (https://phabricator.wikimedia.org/T365798) [11:46:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11672806 (10Clement_Goubert) Updated task description with racking details and OS. Waiting for review on the puppet patch. [11:48:49] PROBLEM - Host an-worker1205 is DOWN: PING CRITICAL - Packet loss = 100% [11:50:59] RECOVERY - SSH on an-worker1205 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:51:01] RECOVERY - Host an-worker1205 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [11:51:03] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11672807 (10MoritzMuehlenhoff) [11:52:23] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11672808 (10MoritzMuehlenhoff) [11:53:41] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1205 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 UGood : virtual_disk: 1 Dgrd : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [11:53:43] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1205 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 UGood : virtual_disk: 1 Dgrd : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T419000 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [11:53:51] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1205.eqiad.wmnet [11:53:52] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419000 (10ops-monitoring-bot) 03NEW [11:53:54] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1206.eqiad.wmnet [11:54:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11672819 (10ops-monitoring-bot) Host an-worker1206.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [11:55:41] (03PS1) 10MVernon: admin: remove old software-only ssh key for raymond-ndibe [puppet] - 10https://gerrit.wikimedia.org/r/1247979 (https://phabricator.wikimedia.org/T417594) [11:57:27] 06SRE, 10Bitu, 06Infrastructure-Foundations: wikimedia-l was signed up for a developer account - https://phabricator.wikimedia.org/T418201#11672822 (10RhinosF1) Thank you :) [11:57:29] 06SRE, 10Bitu, 06Infrastructure-Foundations: wikimedia-l was signed up for a developer account - https://phabricator.wikimedia.org/T418201#11672823 (10RhinosF1) Thank you :) [11:57:32] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1247979 (https://phabricator.wikimedia.org/T417594) (owner: 10MVernon) [11:57:51] (03CR) 10MVernon: [C:03+2] admin: remove old software-only ssh key for raymond-ndibe [puppet] - 10https://gerrit.wikimedia.org/r/1247979 (https://phabricator.wikimedia.org/T417594) (owner: 10MVernon) [11:59:27] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting update of Raymond Ndibe's SSH key to Yubikey-backed key - https://phabricator.wikimedia.org/T417594#11672825 (10MatthewVernon) Hi @Raymond_Ndibe - I've removed your old key now (so it'll be removed from production systems in the next 20 minutes... [12:00:04] mvolz: gettimeofday() says it's time for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T1200) [12:03:40] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1206.eqiad.wmnet [12:03:43] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1207.eqiad.wmnet [12:03:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11672839 (10ayounsi) [12:04:01] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11672840 (10ayounsi) [12:04:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11672841 (10ops-monitoring-bot) Host an-worker1207.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:04:07] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11672842 (10ayounsi) [12:04:10] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11672843 (10ayounsi) [12:04:21] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11672845 (10ayounsi) [12:04:24] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11672844 (10ayounsi) [12:06:04] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243140 (owner: 10PipelineBot) [12:06:12] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243731 (owner: 10PipelineBot) [12:06:25] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244666 (owner: 10PipelineBot) [12:08:26] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp - 3.0 upgrade () [12:10:38] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp - 3.0 upgrade () [12:12:36] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11672854 (10ayounsi) Note that this is partly blocked on {T418971} As we're planning on using 198.35.26.96/27 for `public1-virtual-ulsfo` to keep our public IP allocations standardized... [12:19:13] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:24:09] PROBLEM - Host an-worker1207 is DOWN: PING CRITICAL - Packet loss = 100% [12:25:33] (03PS9) 10Btullis: Add a ValidatingAdmissionPolicy for use with analytics workloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) [12:25:33] (03PS11) 10Btullis: Apply the new VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) [12:26:34] (03CR) 10Btullis: Add a ValidatingAdmissionPolicy for use with analytics workloads (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [12:27:44] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: wikikube-worker23[57-74] implementation tracking - https://phabricator.wikimedia.org/T418927#11672892 (10Raine) 05Open→03Stalled p:05Triage→03Medium Un-stall when racking task done. [12:29:28] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: rdb201[34] implementation tracking - https://phabricator.wikimedia.org/T418924#11672898 (10Raine) 05Open→03Stalled p:05Triage→03Medium Un-stall when racking task done. [12:29:52] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:30:03] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: wikikube-ctrl100[45] implementation tracking - https://phabricator.wikimedia.org/T418920#11672906 (10Raine) 05Open→03Stalled p:05Triage→03Medium Un-stall when racking task done. [12:31:00] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#11672912 (10Raine) 05Open→03Stalled p:05Triage→03Medium Un-stall when racking task done. [12:31:25] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247519 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [12:33:35] (03CR) 10Kamila Součková: [C:03+1] "LGTM given https://phabricator.wikimedia.org/T417781#11663198 ." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) (owner: 10Daniel Kinzler) [12:33:36] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:34:13] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:34:19] It looks like charts aren't updating again - [12:34:24] Is there an issue with helm? [12:36:28] oh nvm [12:36:35] wasn't submitted, ha [12:39:08] (03CR) 10A-pizzata: [C:03+1] stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247962 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [12:43:25] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:43:38] (03PS1) 10Clément Goubert: Add new wikikube-worker23[57-74] [puppet] - 10https://gerrit.wikimedia.org/r/1247989 (https://phabricator.wikimedia.org/T418916) [12:43:44] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:44:18] (03CR) 10Blake: [C:03+1] Add new wikikube-worker23[57-74] [puppet] - 10https://gerrit.wikimedia.org/r/1247989 (https://phabricator.wikimedia.org/T418916) (owner: 10Clément Goubert) [12:44:27] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:44:59] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:45:32] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:46:01] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247519 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [12:46:02] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:46:33] (03CR) 10Majavah: [C:03+1] Enable Bird 2.18 for cloudservices/eqiad1 and cloudlb/eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1242431 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [12:47:57] (03PS3) 10Muehlenhoff: Add separate puppetserver hooks for the private repo [puppet] - 10https://gerrit.wikimedia.org/r/1247976 (https://phabricator.wikimedia.org/T365798) [12:48:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247953 (owner: 10Muehlenhoff) [12:49:39] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244601 (owner: 10PipelineBot) [12:49:44] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244602 (owner: 10PipelineBot) [12:52:21] RECOVERY - Host an-worker1207 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [12:52:51] (03CR) 10Fabfur: [C:03+2] hiera: set haproxy version to 3.0 on esams cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247519 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [12:54:13] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:54:48] (03PS1) 10Mszwarc: Drop 'centralnoticeadmin' from $wgOATHRequiredForGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247990 (https://phabricator.wikimedia.org/T418580) [12:54:59] PROBLEM - SSH on an-worker1207 is CRITICAL: connect to address 10.64.165.5 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:55:12] (03CR) 10Majavah: [C:04-1] "Can we have these return 410 Gone to show a more accurate error page? (see some examples at the bottom of the www.toolserver.org file)" [puppet] - 10https://gerrit.wikimedia.org/r/1247963 (https://phabricator.wikimedia.org/T418829) (owner: 10David Caro) [12:55:40] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419000#11672998 (10Jclark-ctr) a:03Jclark-ctr [13:00:26] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp - 3.0 upgrade () [13:00:30] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247962 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [13:00:42] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams and A:cp - 3.0 upgrade () [13:02:15] PROBLEM - SSH on build2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:02:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:26] (03CR) 10Arnaudb: [C:03+2] gerrit: move gerrit-replica behind CDN [dns] - 10https://gerrit.wikimedia.org/r/1247964 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [13:02:33] !log arnaudb@dns1005 START - running authdns-update [13:02:56] (03CR) 10AOkoth: [C:03+2] aux: add wmf-navigator namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247615 (owner: 10AOkoth) [13:03:33] (03PS1) 10Muehlenhoff: Add library hint for mbedtls [puppet] - 10https://gerrit.wikimedia.org/r/1247991 [13:03:42] (03PS1) 10Fabfur: hiera: cleanup per-dc hiera files for haproxy30 [puppet] - 10https://gerrit.wikimedia.org/r/1247992 (https://phabricator.wikimedia.org/T417253) [13:03:48] !log arnaudb@dns1005 END - running authdns-update [13:05:23] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247992 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [13:05:41] (03CR) 10Muehlenhoff: [C:03+2] Enable Bird 2.18 for cloudservices/eqiad1 and cloudlb/eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1242431 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [13:06:28] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1207.eqiad.wmnet [13:06:31] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1208.eqiad.wmnet [13:06:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11673024 (10ops-monitoring-bot) Host an-worker1208.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:08:56] (03CR) 10Elukey: [C:03+1] "After a chat with Balthazar on slack there doesn't seem to be a way to do this without the initcontainer trick for the moment, so let's do" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247066 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [13:10:18] (03CR) 10Mforns: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247962 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [13:11:02] (03CR) 10Majavah: wmflib: hosts2ips: Allow passing in IP ranges (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211650 (owner: 10Majavah) [13:11:18] (03PS1) 10Ayounsi: ulsfo: add new LVS service IP range [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) [13:13:10] (03CR) 10Mszwarc: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247990 (https://phabricator.wikimedia.org/T418580) (owner: 10Mszwarc) [13:14:41] (03PS2) 10Fabfur: hiera: cleanup per-dc hiera files for haproxy30 [puppet] - 10https://gerrit.wikimedia.org/r/1247992 (https://phabricator.wikimedia.org/T417253) [13:15:09] !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [13:16:05] RECOVERY - SSH on build2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:16:42] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1239003 (https://phabricator.wikimedia.org/T418264) (owner: 10Arnaudb) [13:16:45] (03CR) 10Vgutierrez: [C:04-1] "haproxy version is currently set on `hieradata/common/profile/cache/haproxy.yaml`" [puppet] - 10https://gerrit.wikimedia.org/r/1247992 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [13:16:52] !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:17:08] !log aokoth@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [13:17:10] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1247534 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [13:17:39] !log aokoth@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:18:53] (03PS1) 10Elukey: admin_ng: reduce logs emitted by knative components on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247995 (https://phabricator.wikimedia.org/T416384) [13:19:23] (03PS2) 10Clément Goubert: Add new wikikube-worker23[57-74] [puppet] - 10https://gerrit.wikimedia.org/r/1247989 (https://phabricator.wikimedia.org/T418925) [13:20:16] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1208.eqiad.wmnet [13:20:20] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1209.eqiad.wmnet [13:20:25] FIRING: [3x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11673071 (10ops-monitoring-bot) Host an-worker1209.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:21:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install wikikube-worker23[57-74] - https://phabricator.wikimedia.org/T418925#11673072 (10Clement_Goubert) [13:22:33] (03CR) 10Clément Goubert: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1247989 (https://phabricator.wikimedia.org/T418925) (owner: 10Clément Goubert) [13:26:04] (03PS6) 10Arnaudb: gerrit: alerting downtime update [cookbooks] - 10https://gerrit.wikimedia.org/r/1239003 (https://phabricator.wikimedia.org/T418264) [13:27:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419000#11673132 (10Jclark-ctr) [13:28:48] (03PS7) 10Arnaudb: gerrit: alerting downtime update [cookbooks] - 10https://gerrit.wikimedia.org/r/1239003 (https://phabricator.wikimedia.org/T418264) [13:28:54] RESOLVED: SLOMetricAbsent: wdqs-scholarly-availability magru - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:30:46] (03PS1) 10Federico Ceratto: mariadb: set cluster value to "mysql" [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) [13:30:55] (03CR) 10Majavah: [C:03+2] cr-cloud: Move allow-public below deny-to-private-subnets [homer/public] - 10https://gerrit.wikimedia.org/r/970275 (owner: 10Majavah) [13:33:06] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1209.eqiad.wmnet [13:33:09] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1210.eqiad.wmnet [13:33:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11673160 (10ops-monitoring-bot) Host an-worker1210.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:33:41] (03CR) 10Marostegui: "Can you give a bit more context? Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [13:34:12] (03PS1) 10Arnaudb: Revert "gerrit: move gerrit-replica behind CDN" [dns] - 10https://gerrit.wikimedia.org/r/1247997 [13:34:46] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-02-25-131752 to 2026-02-28-010106 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247998 (https://phabricator.wikimedia.org/T417024) [13:34:49] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: ms-fe1013 reports a backplane error - https://phabricator.wikimedia.org/T419010 (10elukey) 03NEW [13:34:49] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-02-25-124326 to 2026-03-04-123739 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247999 (https://phabricator.wikimedia.org/T413728) [13:34:57] (03PS2) 10Federico Ceratto: mariadb: set cluster value to "mysql" [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) [13:35:54] (03CR) 10Marostegui: "Is this happening with all hosts? (the current problem)" [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [13:36:14] (03CR) 10Federico Ceratto: "heh I was just typing the desc 😄 - I'll try to run the puppet compiler to see what will change" [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [13:36:28] (03CR) 10Jelto: [C:03+1] "revert looks good to me, although I'd be surprised if that had an impact on production zuul/ci" [dns] - 10https://gerrit.wikimedia.org/r/1247997 (owner: 10Arnaudb) [13:36:36] (03CR) 10Federico Ceratto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [13:37:17] (03PS1) 10Gergő Tisza: Revert "Enable JWT session cookie for bot passwords (all wikis) (attempt #2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248000 (https://phabricator.wikimedia.org/T415007) [13:37:34] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp - 3.0 upgrade () [13:37:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248000 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [13:37:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [13:37:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [13:37:58] (03PS2) 10Elukey: admin_ng: reduce logs emitted by knative components on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247995 (https://phabricator.wikimedia.org/T416384) [13:38:24] (03CR) 10Arnaudb: [C:03+2] Revert "gerrit: move gerrit-replica behind CDN" [dns] - 10https://gerrit.wikimedia.org/r/1247997 (owner: 10Arnaudb) [13:38:29] (03CR) 10Arnaudb: [V:03+2 C:03+2] Revert "gerrit: move gerrit-replica behind CDN" [dns] - 10https://gerrit.wikimedia.org/r/1247997 (owner: 10Arnaudb) [13:39:01] !log arnaudb@dns1004 START - running authdns-update [13:40:14] !log arnaudb@dns1004 END - running authdns-update [13:40:23] (03PS1) 10Muehlenhoff: Add repository sync definition for nodejs 24 [puppet] - 10https://gerrit.wikimedia.org/r/1248001 (https://phabricator.wikimedia.org/T418440) [13:40:41] PROBLEM - Host lswtest-d8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [13:41:18] (03PS1) 10Arnaudb: Revert^2 "gerrit: move gerrit-replica behind CDN" [dns] - 10https://gerrit.wikimedia.org/r/1248002 (https://phabricator.wikimedia.org/T418108) [13:41:53] PROBLEM - Host lswtest-d8-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:43:58] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams and A:cp - 3.0 upgrade () [13:44:37] RECOVERY - Host lswtest-d8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [13:45:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1210.eqiad.wmnet [13:45:14] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1211.eqiad.wmnet [13:45:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11673237 (10ops-monitoring-bot) Host an-worker1211.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:46:55] RECOVERY - Host lswtest-d8-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [13:46:55] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247995 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [13:49:00] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams and A:cp - 3.0 upgrade () [13:49:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419000#11673247 (10Jclark-ctr) Service request WO21310177 [13:50:12] (03PS1) 10Majavah: site: Use nftables insetup role for cloudgw2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1248004 (https://phabricator.wikimedia.org/T418765) [13:50:22] (03CR) 10Elukey: "Staging tests look good:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247995 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [13:51:17] PROBLEM - zuul_service_running on contint1002 is CRITICAL: PROCS CRITICAL: 1 process with regex args bin/zuul-server https://www.mediawiki.org/wiki/Continuous_integration/Zuul [13:51:26] (03PS3) 10Fabfur: hiera: cleanup per-dc hiera files for haproxy30 [puppet] - 10https://gerrit.wikimedia.org/r/1247992 (https://phabricator.wikimedia.org/T417253) [13:52:09] PROBLEM - zuul_gearman_service on contint1002 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused https://www.mediawiki.org/wiki/Continuous_integration/Zuul [13:52:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Decommission an-worker11[17-41] but reuse an-worker11[17,18,31,33,34] as dse-k8s-workers - https://phabricator.wikimedia.org/T414948#11673264 (10Jclark-ctr) [13:52:54] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: ms-fe1013 reports a backplane error - https://phabricator.wikimedia.org/T419010#11673265 (10MatthewVernon) [13:53:03] (03PS1) 10Tiziano Fogli: thanos/rec_rules: add prometheus_ingested_metrics rec rules group [puppet] - 10https://gerrit.wikimedia.org/r/1248006 (https://phabricator.wikimedia.org/T415317) [13:54:09] RECOVERY - zuul_gearman_service on contint1002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 4730 https://www.mediawiki.org/wiki/Continuous_integration/Zuul [13:54:17] RECOVERY - zuul_service_running on contint1002 is OK: PROCS OK: 2 processes with regex args bin/zuul-server https://www.mediawiki.org/wiki/Continuous_integration/Zuul [13:54:57] (03CR) 10Majavah: [C:03+2] "..." [homer/public] - 10https://gerrit.wikimedia.org/r/970275 (owner: 10Majavah) [13:55:25] (03CR) 10Federico Ceratto: "No, but 122 "db.*" hosts are flagged misc." [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [13:56:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Decommission an-worker11[17-41] but reuse an-worker11[17,18,31,33,34] as dse-k8s-workers - https://phabricator.wikimedia.org/T414948#11673293 (10Jclark-ctr) 05Open→03Resolved [13:56:17] (03Merged) 10jenkins-bot: cr-cloud: Move allow-public below deny-to-private-subnets [homer/public] - 10https://gerrit.wikimedia.org/r/970275 (owner: 10Majavah) [13:56:18] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1248004 (https://phabricator.wikimedia.org/T418765) (owner: 10Majavah) [13:56:43] (03CR) 10AikoChou: [C:03+1] "Thanks for tackling this!!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247995 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [13:57:07] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1211.eqiad.wmnet [13:57:10] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1212.eqiad.wmnet [13:57:12] (03PS2) 10Tiziano Fogli: thanos/rec_rules: add prometheus_ingested_metrics rec rules group [puppet] - 10https://gerrit.wikimedia.org/r/1248006 (https://phabricator.wikimedia.org/T415317) [13:57:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet [13:57:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11673294 (10ops-monitoring-bot) Host an-worker1212.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:57:35] (03PS1) 10Gergő Tisza: Fix $wgJwtSessionCookieIssuer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248007 (https://phabricator.wikimedia.org/T415007) [13:57:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11673295 (10ops-monitoring-bot) Draining ganeti4005.ulsfo.wmnet of running VMs [13:58:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248007 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [13:58:08] (03CR) 10Marostegui: "So how can not all of them be affected? My point is, this will fix those 122?, but why aren't the others not affected by the current issue" [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [13:58:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet [13:58:48] (03CR) 10Eevans: wmnet: add linked-artifacts CNAME record for k8s ingress (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1247172 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [13:58:51] (03PS3) 10Tiziano Fogli: thanos/rec_rules: add prometheus_ingested_metrics rec rules group [puppet] - 10https://gerrit.wikimedia.org/r/1248006 (https://phabricator.wikimedia.org/T415317) [13:59:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet [13:59:13] (03PS10) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [13:59:29] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11673299 (10ops-monitoring-bot) Draining ganeti4005.ulsfo.wmnet of running VMs [13:59:54] (03PS2) 10Eevans: wmnet: add linked-artifacts CNAME record for k8s ingress [dns] - 10https://gerrit.wikimedia.org/r/1247172 (https://phabricator.wikimedia.org/T414112) [14:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T1400). Please do the needful. [14:00:05] nya_1F616EMO, Sergi0, and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:15] o/ [14:00:18] (03PS2) 10Arnaudb: Revert^2 "gerrit: move gerrit-replica behind CDN" [dns] - 10https://gerrit.wikimedia.org/r/1248002 (https://phabricator.wikimedia.org/T418108) [14:00:27] o/ [14:00:38] (03CR) 10Eevans: wmnet: add linked-artifacts CNAME record for k8s ingress (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1247172 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [14:00:44] (03PS11) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [14:01:35] (03PS1) 10Dreamy Jazz: Define $wgWikimediaMessagesHasLiquidThreadsLogs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248008 (https://phabricator.wikimedia.org/T417425) [14:02:03] (03PS4) 10Tiziano Fogli: thanos/rec_rules: add prometheus_ingested_metrics rec rules group [puppet] - 10https://gerrit.wikimedia.org/r/1248006 (https://phabricator.wikimedia.org/T415317) [14:02:12] (03CR) 10Jelto: "lgtm, but lets coordinate with @hashar@free.fr" [dns] - 10https://gerrit.wikimedia.org/r/1248002 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [14:02:21] @nya_1F616EMO are you around? [14:03:40] (03CR) 10Fabfur: "this should be fixed" [puppet] - 10https://gerrit.wikimedia.org/r/1247992 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [14:03:42] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247992 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [14:04:24] (03CR) 10Jelto: [C:03+1] Revert^2 "gerrit: move gerrit-replica behind CDN" [dns] - 10https://gerrit.wikimedia.org/r/1248002 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [14:04:25] I'm gonna start self-deploying my change [14:05:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248008 (https://phabricator.wikimedia.org/T417425) (owner: 10Dreamy Jazz) [14:05:23] jouncebot: refresh [14:05:24] I refreshed my knowledge about deployments. [14:05:33] jouncebot: nowandnext [14:05:33] For the next 0 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T1400) [14:05:33] In 0 hour(s) and 54 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T1500) [14:05:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247566 (https://phabricator.wikimedia.org/T416748) (owner: 10Michael Große) [14:06:04] (03CR) 10Elukey: [C:03+2] admin_ng: reduce logs emitted by knative components on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247995 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [14:06:13] (03CR) 10Btullis: Apply the new VAP to several namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [14:06:30] (03Merged) 10jenkins-bot: Enable new HTML confirmation emails for all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247566 (https://phabricator.wikimedia.org/T416748) (owner: 10Michael Große) [14:07:04] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1247566|Enable new HTML confirmation emails for all (T416748)]] [14:07:08] T416748: Release improved verification email to all wikis (WE1.1.22 FY2025-26) - https://phabricator.wikimedia.org/T416748 [14:07:51] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:08:04] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:08:33] (03CR) 10Klausman: [C:03+1] admin_ng: reduce logs emitted by knative components on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247995 (https://phabricator.wikimedia.org/T416384) (owner: 10Elukey) [14:08:37] (03PS1) 10Dreamy Jazz: Hooks: Fix liquidthreads log type definition bugs [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248009 (https://phabricator.wikimedia.org/T417425) [14:08:39] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:08:49] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:09:10] !log sgimeno@deploy2002 migr, sgimeno: Backport for [[gerrit:1247566|Enable new HTML confirmation emails for all (T416748)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:09:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1212.eqiad.wmnet [14:09:20] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1213.eqiad.wmnet [14:09:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248009 (https://phabricator.wikimedia.org/T417425) (owner: 10Dreamy Jazz) [14:09:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11673365 (10ops-monitoring-bot) Host an-worker1213.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:10:25] FIRING: [3x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:10:49] !log sgimeno@deploy2002 migr, sgimeno: Continuing with sync [14:12:21] (03PS8) 10Arnaudb: gerrit: alerting downtime update [cookbooks] - 10https://gerrit.wikimedia.org/r/1239003 (https://phabricator.wikimedia.org/T418264) [14:13:19] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:13:38] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:14:50] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247566|Enable new HTML confirmation emails for all (T416748)]] (duration: 07m 46s) [14:14:54] T416748: Release improved verification email to all wikis (WE1.1.22 FY2025-26) - https://phabricator.wikimedia.org/T416748 [14:15:16] @tgr_ all yours [14:16:29] thx [14:17:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248000 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [14:17:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [14:18:35] (03Merged) 10jenkins-bot: Revert "Enable JWT session cookie for bot passwords (all wikis) (attempt #2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248000 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [14:19:04] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1248000|Revert "Enable JWT session cookie for bot passwords (all wikis) (attempt #2)" (T415007 T418999)]] [14:19:09] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [14:19:09] T418999: Remove trailing slash in issuer for bot password JWT cookies - https://phabricator.wikimedia.org/T418999 [14:19:45] (03CR) 10Arnaudb: [C:03+2] gerrit: alerting downtime update [cookbooks] - 10https://gerrit.wikimedia.org/r/1239003 (https://phabricator.wikimedia.org/T418264) (owner: 10Arnaudb) [14:21:18] !log tgr@deploy2002 tgr: Backport for [[gerrit:1248000|Revert "Enable JWT session cookie for bot passwords (all wikis) (attempt #2)" (T415007 T418999)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:21:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1213.eqiad.wmnet [14:21:36] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1214.eqiad.wmnet [14:21:54] (03PS1) 10Zabe: CategoryViewer: Fall back to empty string in case of missing nextpage [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248011 (https://phabricator.wikimedia.org/T418934) [14:22:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11673448 (10ops-monitoring-bot) Host an-worker1214.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:22:32] !log tgr@deploy2002 tgr: Continuing with sync [14:23:21] (03CR) 10CI reject: [V:04-1] Hooks: Fix liquidthreads log type definition bugs [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248009 (https://phabricator.wikimedia.org/T417425) (owner: 10Dreamy Jazz) [14:23:30] (03PS1) 10Gergő Tisza: Enable JWT session cookie for bot passwords (all wikis) (attempt #3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248012 (https://phabricator.wikimedia.org/T415007) [14:23:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248012 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [14:24:48] (03CR) 10Dreamy Jazz: "recheck" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248009 (https://phabricator.wikimedia.org/T417425) (owner: 10Dreamy Jazz) [14:25:46] (03Merged) 10jenkins-bot: gerrit: alerting downtime update [cookbooks] - 10https://gerrit.wikimedia.org/r/1239003 (https://phabricator.wikimedia.org/T418264) (owner: 10Arnaudb) [14:26:12] (03CR) 10Dreamy Jazz: [C:03+2] Hooks: Fix liquidthreads log type definition bugs [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248009 (https://phabricator.wikimedia.org/T417425) (owner: 10Dreamy Jazz) [14:26:24] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248000|Revert "Enable JWT session cookie for bot passwords (all wikis) (attempt #2)" (T415007 T418999)]] (duration: 07m 19s) [14:26:25] I'm going to +2 my backport early to get it moving through CI [14:26:28] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [14:26:29] T418999: Remove trailing slash in issuer for bot password JWT cookies - https://phabricator.wikimedia.org/T418999 [14:26:35] (03CR) 10Arnaudb: [C:03+2] Revert^2 "gerrit: move gerrit-replica behind CDN" [dns] - 10https://gerrit.wikimedia.org/r/1248002 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [14:26:38] (Oh it seems the previous one just finished anyway :D ) [14:26:46] !log arnaudb@dns1004 START - running authdns-update [14:27:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [14:27:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248009 (https://phabricator.wikimedia.org/T417425) (owner: 10Dreamy Jazz) [14:27:59] !log arnaudb@dns1004 END - running authdns-update [14:27:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248008 (https://phabricator.wikimedia.org/T417425) (owner: 10Dreamy Jazz) [14:28:05] !ack [14:28:06] no value provided for parameter incident and no default available [14:28:06] All incidents are already acked. [14:28:34] (03PS1) 10Arnaudb: Revert^3 "gerrit: move gerrit-replica behind CDN" [dns] - 10https://gerrit.wikimedia.org/r/1248013 [14:28:55] (03Merged) 10jenkins-bot: Define $wgWikimediaMessagesHasLiquidThreadsLogs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248008 (https://phabricator.wikimedia.org/T417425) (owner: 10Dreamy Jazz) [14:29:25] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams and A:cp - 3.0 upgrade () [14:30:11] (03PS2) 10Arnaudb: Revert^3 "gerrit: move gerrit-replica behind CDN" [dns] - 10https://gerrit.wikimedia.org/r/1248013 [14:30:30] !log btullis@puppetserver1001 conftool action : get/pooled; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-worker1024.eqiad.wmnet [14:31:01] !log btullis@puppetserver1001 conftool action : set/weight=1; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-worker1024.eqiad.wmnet [14:31:08] !log btullis@puppetserver1001 conftool action : set/weight=1; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-worker1025.eqiad.wmnet [14:31:20] !log btullis@puppetserver1001 conftool action : set/weight=1; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-worker1028.eqiad.wmnet [14:31:25] (03PS3) 10Arnaudb: Revert^3 "gerrit: move gerrit-replica behind CDN" [dns] - 10https://gerrit.wikimedia.org/r/1248013 [14:31:29] !log btullis@puppetserver1001 conftool action : set/pooled=yes; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-worker1024.eqiad.wmnet [14:31:34] !log btullis@puppetserver1001 conftool action : set/pooled=yes; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-worker1025.eqiad.wmnet [14:31:41] !log btullis@puppetserver1001 conftool action : set/pooled=yes; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-worker1028.eqiad.wmnet [14:32:05] !log btullis@puppetserver1001 conftool action : get/pooled=no; selector: service=kubesvc,cluster=dse-k8s,dc=eqiad,name=dse-k8s-worker1028.eqiad.wmnet [14:32:36] (03PS4) 10Arnaudb: Revert^3 "gerrit: move gerrit-replica behind CDN" [dns] - 10https://gerrit.wikimedia.org/r/1248013 [14:32:47] (03Abandoned) 10AOkoth: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.9 [puppet] - 10https://gerrit.wikimedia.org/r/1244678 (https://phabricator.wikimedia.org/T418483) (owner: 10AOkoth) [14:33:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1214.eqiad.wmnet [14:33:28] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1215.eqiad.wmnet [14:33:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11673501 (10ops-monitoring-bot) Host an-worker1215.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:34:20] (03PS5) 10Arnaudb: Revert^3 "gerrit: move gerrit-replica behind CDN" [dns] - 10https://gerrit.wikimedia.org/r/1248013 [14:35:42] (03Abandoned) 10Arnaudb: Revert^3 "gerrit: move gerrit-replica behind CDN" [dns] - 10https://gerrit.wikimedia.org/r/1248013 (owner: 10Arnaudb) [14:36:49] (03Merged) 10jenkins-bot: Hooks: Fix liquidthreads log type definition bugs [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248009 (https://phabricator.wikimedia.org/T417425) (owner: 10Dreamy Jazz) [14:37:23] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1248009|Hooks: Fix liquidthreads log type definition bugs (T417425 T419006)]], [[gerrit:1248008|Define $wgWikimediaMessagesHasLiquidThreadsLogs (T417425)]] [14:37:23] (03PS1) 10Aqu: dse-k8s-services Blunderbuss: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248015 (https://phabricator.wikimedia.org/T415874) [14:37:28] T417425: Threaded discussion log appears in all wikis - https://phabricator.wikimedia.org/T417425 [14:37:29] T419006: LiquidThreads: WikimediaMessages extension loaded check is using wrong name causing millions of translation-problem errors - https://phabricator.wikimedia.org/T419006 [14:39:26] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1248009|Hooks: Fix liquidthreads log type definition bugs (T417425 T419006)]], [[gerrit:1248008|Define $wgWikimediaMessagesHasLiquidThreadsLogs (T417425)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:39:34] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11673532 (10Papaul) @MatthewVernon thank you. Yes it will be at 15:00 UTC [14:40:23] (03CR) 10Btullis: [C:03+2] dse-k8s-services Blunderbuss: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248015 (https://phabricator.wikimedia.org/T415874) (owner: 10Aqu) [14:40:37] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [14:40:56] (03PS2) 10David Caro: legacy_redirector: remove some disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/1247963 (https://phabricator.wikimedia.org/T418829) [14:41:13] (03CR) 10David Caro: "Is this what you had in mind?" [puppet] - 10https://gerrit.wikimedia.org/r/1247963 (https://phabricator.wikimedia.org/T418829) (owner: 10David Caro) [14:42:13] !log ayounsi@cumin1003 conftool action : set/pooled=no; selector: name=cirrussearch1120.eqiad.wmnet [14:42:27] !log ayounsi@cumin1003 conftool action : set/pooled=no; selector: name=cirrussearch1121.eqiad.wmnet [14:42:28] (03Merged) 10jenkins-bot: dse-k8s-services Blunderbuss: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248015 (https://phabricator.wikimedia.org/T415874) (owner: 10Aqu) [14:42:35] !log ayounsi@cumin1003 conftool action : set/pooled=no; selector: name=cirrussearch1122.eqiad.wmnet [14:44:04] !log updating CR firewall policy with https://gerrit.wikimedia.org/r/c/operations/homer/public/+/970275 [14:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:29] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp1115.eqiad.wmnet [reason: T418772 - BGP maintenance] [14:44:32] T418772: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772 [14:44:34] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248009|Hooks: Fix liquidthreads log type definition bugs (T417425 T419006)]], [[gerrit:1248008|Define $wgWikimediaMessagesHasLiquidThreadsLogs (T417425)]] (duration: 07m 11s) [14:44:35] Damnit, how can I miss two windows... [14:44:38] T417425: Threaded discussion log appears in all wikis - https://phabricator.wikimedia.org/T417425 [14:44:39] T419006: LiquidThreads: WikimediaMessages extension loaded check is using wrong name causing millions of translation-problem errors - https://phabricator.wikimedia.org/T419006 [14:44:55] nya_1F616EMO: still in time [14:44:58] nya_1F616EMO: Window isn't over yet if you can be around [14:45:04] Oh okay, I can do it now [14:45:11] Sorry about that [14:45:28] Np, I've only just finished my deploy so no one had to hang around for the rest of the window :D [14:45:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1215.eqiad.wmnet [14:45:42] WikimediaDebug and browser ready [14:45:44] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1216.eqiad.wmnet [14:45:56] (03PS5) 10Jelto: gerrit: limit access to http/https/ssh in firewall [puppet] - 10https://gerrit.wikimedia.org/r/1238400 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn) [14:45:56] (03CR) 10Jelto: [V:03+1 C:03+1] "I'm removing WIP status boldly, see comment" [puppet] - 10https://gerrit.wikimedia.org/r/1238400 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn) [14:46:03] I can do the deploy of this unless anyone ese wants to? [14:46:05] *else [14:46:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11673559 (10ops-monitoring-bot) Host an-worker1216.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:46:17] Reviewing the change now.... [14:48:36] Looks fine to me. Proceeding to deploy [14:48:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [14:49:35] (03Merged) 10jenkins-bot: zhwiki: Remove all rights from accountcreator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [14:50:06] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1244373|zhwiki: Remove all rights from accountcreator (T418089)]] [14:50:10] T418089: Remove "accountcreator" and allow "event-organizer" to add and remove "event participant" in zhwiki - https://phabricator.wikimedia.org/T418089 [14:50:14] (03CR) 10Gergő Tisza: "Should we also remove [the right to remove people from certain groups](https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/3ff486" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [14:50:33] (03CR) 10Majavah: [C:03+1] "yes, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1247963 (https://phabricator.wikimedia.org/T418829) (owner: 10David Caro) [14:52:10] !log dreamyjazz@deploy2002 dreamyjazz, 1f616emo: Backport for [[gerrit:1244373|zhwiki: Remove all rights from accountcreator (T418089)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:52:29] It works by testing via k8s-mwdebug [14:52:52] !log jmm@cumin2002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [14:53:17] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:53:42] (03CR) 10Dreamy Jazz: "It was already done by 892116ac28ec38f4c279476abe70ab48ad044eec. AFAICS I don't see the definition in the master branch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [14:54:28] !log dreamyjazz@deploy2002 dreamyjazz, 1f616emo: Continuing with sync [14:54:32] Thanks for testing [14:54:35] :-D [14:54:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:55:11] oh ouch [14:55:44] Changes seen in prod now, thanks Dreamy_Jazz [14:56:21] Thanks, it's still deploying but should be done shortly [14:56:28] !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on dse-k8s-worker[1010-1011,1013,1018-1019].eqiad.wmnet with reason: Adding 10 Gbps NIC [14:57:45] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [14:57:49] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1216.eqiad.wmnet [14:57:52] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1217.eqiad.wmnet [14:57:54] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [14:58:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11673623 (10ops-monitoring-bot) Host an-worker1217.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:58:18] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244373|zhwiki: Remove all rights from accountcreator (T418089)]] (duration: 08m 12s) [14:58:21] T418089: Remove "accountcreator" and allow "event-organizer" to add and remove "event participant" in zhwiki - https://phabricator.wikimedia.org/T418089 [14:58:34] (03CR) 10Vgutierrez: [C:04-1] "please get rid of the per host override for the new cp hosts in codfw as well" [puppet] - 10https://gerrit.wikimedia.org/r/1247992 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [14:58:53] !log Afternoon UTC backport window done [14:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T1500) [15:00:56] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-02-25-124326 to 2026-03-04-123739 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247999 (https://phabricator.wikimedia.org/T413728) [15:01:02] (03PS2) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-02-25-131752 to 2026-02-28-010106 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247998 (https://phabricator.wikimedia.org/T417024) [15:01:21] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-02-25-131752 to 2026-02-28-010106 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247998 (https://phabricator.wikimedia.org/T417024) (owner: 10Jforrester) [15:02:17] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:03:34] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-02-25-131752 to 2026-02-28-010106 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247998 (https://phabricator.wikimedia.org/T417024) (owner: 10Jforrester) [15:03:39] (03PS1) 10Zabe: Stop writing to il_to on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248021 (https://phabricator.wikimedia.org/T415787) [15:04:26] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:05:16] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:05:19] !log upgrading cloudlb* to Bird 2.18 T413740 [15:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:21] T413740: Backport and test Bird 2.18 - https://phabricator.wikimedia.org/T413740 [15:05:26] (03PS1) 10Gehel: wdqs: remove duplciate endpoints in federation allow list [puppet] - 10https://gerrit.wikimedia.org/r/1248022 (https://phabricator.wikimedia.org/T417573) [15:05:44] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:05:51] (03PS2) 10Gehel: wdqs: remove duplicate endpoints in federation allow list [puppet] - 10https://gerrit.wikimedia.org/r/1248022 (https://phabricator.wikimedia.org/T417573) [15:07:00] (03CR) 10Mszwarc: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247990 (https://phabricator.wikimedia.org/T418580) (owner: 10Mszwarc) [15:08:04] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:08:23] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:08:26] (03PS1) 10Gehel: wdqs: sort federation allow list alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/1248023 [15:09:07] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:09:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1217.eqiad.wmnet [15:09:29] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1218.eqiad.wmnet [15:09:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11673717 (10ops-monitoring-bot) Host an-worker1218.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [15:10:08] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-02-25-124326 to 2026-03-04-123739 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247999 (https://phabricator.wikimedia.org/T413728) (owner: 10Jforrester) [15:10:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad [15:10:34] !log lsw1-d7-eqiad# tools network-instance default protocols bgp neighbor 10.64.128.17 reset-peer - T418772 [15:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:37] T418772: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772 [15:12:07] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-02-25-124326 to 2026-03-04-123739 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247999 (https://phabricator.wikimedia.org/T413728) (owner: 10Jforrester) [15:13:12] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:13:17] (03PS2) 10Gehel: wdqs: sort federation allow list alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/1248023 [15:13:44] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:14:32] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:15:13] (03PS1) 10Jforrester: wikifunctions: Run check-wf-services for v2 in prod too [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248025 (https://phabricator.wikimedia.org/T414589) [15:15:27] (03CR) 10Jforrester: [C:03+2] wikifunctions: Run check-wf-services for v2 in prod too [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248025 (https://phabricator.wikimedia.org/T414589) (owner: 10Jforrester) [15:15:37] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:15:47] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11673739 (10elukey) @Jdforrester-WMF something odd that I found while looking at the WF dashoboard: [[ https://grafana... [15:15:49] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:16:13] !log ayounsi@cumin1003 conftool action : set/pooled=yes; selector: name=cirrussearch1122.eqiad.wmnet [15:16:24] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:16:27] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp1115.eqiad.wmnet [reason: T418772 - BGP maintenance] [15:16:29] !log ayounsi@cumin1003 conftool action : set/pooled=yes; selector: name=cirrussearch1121.eqiad.wmnet [15:16:30] T418772: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772 [15:16:38] !log ayounsi@cumin1003 conftool action : set/pooled=yes; selector: name=cirrussearch1120.eqiad.wmnet [15:17:36] (03Merged) 10jenkins-bot: wikifunctions: Run check-wf-services for v2 in prod too [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248025 (https://phabricator.wikimedia.org/T414589) (owner: 10Jforrester) [15:18:49] (03PS1) 10Andrew Bogott: toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 [15:19:15] !log cgoubert@cumin1003 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [15:21:43] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1218.eqiad.wmnet [15:21:46] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1219.eqiad.wmnet [15:22:03] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11673768 (10ssingh) Per T410411, we no longer need at least `pybal-high-traffic1-ulsfo.wikimedia.org` and `pybal-high-traffic2-ulsfo.wikimedia.org` in u... [15:22:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11673770 (10ops-monitoring-bot) Host an-worker1219.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [15:22:15] !log cgoubert@cumin1003 END (ERROR) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=97) rolling restart_daemons on A:swift-fe-eqiad [15:22:17] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [15:22:27] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [15:24:06] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11673773 (10ssingh) >>! In T418993#11672854, @ayounsi wrote: > Note that this is partly blocked on {T418971} > > As we're planning on using 198.35.26.96/27 for `public1-virtual-ulsfo`... [15:24:35] !log cgoubert@cumin1003 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on P{ms-fe10[14-24].*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [15:24:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:26:06] (03CR) 10CI reject: [V:04-1] toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (owner: 10Andrew Bogott) [15:26:22] (03PS2) 10Jforrester: wikifunctions: Add required metadata fields to chart definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237904 (https://phabricator.wikimedia.org/T412693) [15:26:29] (03CR) 10Jforrester: [C:03+2] wikifunctions: Add required metadata fields to chart definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237904 (https://phabricator.wikimedia.org/T412693) (owner: 10Jforrester) [15:27:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, ... [15:27:51] 445419311 80ms 10Gbps wave) {#2013}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr2-eqiad:9804&var-interface=xe-3%2F2%2F1 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [15:28:27] !incidents [15:28:28] 7711 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [15:28:28] 7712 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [15:28:28] 7710 (RESOLVED) Host db1232 (paged) [15:28:28] 7709 (RESOLVED) Host db1223 (paged) [15:28:28] 7705 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule@main) [15:28:29] 7704 (RESOLVED) ProbeDown sre (208.80.153.224 ip4 text-https:443 probes/service http_text-https_ip4 codfw) [15:28:42] (03Merged) 10jenkins-bot: wikifunctions: Add required metadata fields to chart definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237904 (https://phabricator.wikimedia.org/T412693) (owner: 10Jforrester) [15:29:23] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on P{ms-fe10[14-24].*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [15:29:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet [15:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T1500) [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T1530) [15:31:23] !log aqu@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [15:32:51] !log aqu@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [15:33:30] (03PS2) 10Ssingh: wikimedia.org/wikipedia.org: bump TTL for NS and glue records [dns] - 10https://gerrit.wikimedia.org/r/1247626 [15:33:39] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1219.eqiad.wmnet [15:34:18] (03PS4) 10Fabfur: hiera: cleanup per-dc hiera files for haproxy30 [puppet] - 10https://gerrit.wikimedia.org/r/1247992 (https://phabricator.wikimedia.org/T417253) [15:34:55] 06SRE: Most photo thumbnails on officewiki "Contact list" do not display (HTTP 429) - https://phabricator.wikimedia.org/T419023 (10matmarex) 03NEW [15:35:08] (03CR) 10Fabfur: hiera: cleanup per-dc hiera files for haproxy30 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1247992 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [15:35:11] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247992 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [15:36:02] (03CR) 10Ssingh: "rebase, no code change" [dns] - 10https://gerrit.wikimedia.org/r/1247626 (owner: 10Ssingh) [15:36:04] (03CR) 10Ssingh: [C:03+2] wikimedia.org/wikipedia.org: bump TTL for NS and glue records [dns] - 10https://gerrit.wikimedia.org/r/1247626 (owner: 10Ssingh) [15:36:15] !log sukhe@dns1004 START - running authdns-update [15:37:41] !log sukhe@dns1004 END - running authdns-update [15:38:34] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 3 others: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745#11673841 (10Ladsgroup) 05Open→03Resolved [15:39:02] 06SRE: Most photo thumbnails on officewiki "Contact list" do not display (HTTP 429) - https://phabricator.wikimedia.org/T419023#11673849 (10Ladsgroup) →14Duplicate dup:03T418323 [15:39:12] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:39:53] 06SRE: Most photo thumbnails on officewiki "Contact list" do not display (HTTP 429) - https://phabricator.wikimedia.org/T419023#11673859 (10Ladsgroup) To emphasize: This is standard size thumb so it's not due to {T414805} it's another rate limit (that I‌ have no idea why) [15:43:28] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11673869 (10elukey) @Jdforrester-WMF @ecarg I compared some other graphs: https://w.wiki/JAR3 So afaics the overall t... [15:46:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[56] - https://phabricator.wikimedia.org/T418903#11673877 (10RobH) This should indeed be for 4 servers, anyhostnames you like I made some assumptions! So this should be 4 hosts [15:46:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[56] - https://phabricator.wikimedia.org/T418903#11673880 (10RobH) [15:46:59] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11673879 (10ayounsi) cirrussearch repooled [15:49:03] (03CR) 10JHathaway: [C:03+1] Install systemd-timesyncd universally [puppet] - 10https://gerrit.wikimedia.org/r/1243756 (owner: 10Muehlenhoff) [15:52:44] (03CR) 10JHathaway: [C:03+1] Add separate puppetserver hooks for the private repo [puppet] - 10https://gerrit.wikimedia.org/r/1247976 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:57:28] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11673942 (10elukey) Better view https://w.wiki/JARx This is basically showing up a ratio using rate() and 1h interval... [15:57:45] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11673943 (10Jdforrester-WMF) It's possible that some bot was running a lot of requests against us and got blocked? Cer... [15:57:57] (03CR) 10RLazarus: [C:03+2] mw-experimental: Increase tracing sampling from 1% to 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247706 (owner: 10RLazarus) [15:59:36] !log push pfw policies - T418402 [15:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] Deploy window DC Switchover Live Test (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T1600) [16:00:07] (03Merged) 10jenkins-bot: mw-experimental: Increase tracing sampling from 1% to 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247706 (owner: 10RLazarus) [16:00:22] (03PS1) 10Scott French: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248035 [16:00:32] (03CR) 10Vgutierrez: [C:03+1] hiera: cleanup per-dc hiera files for haproxy30 [puppet] - 10https://gerrit.wikimedia.org/r/1247992 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [16:00:44] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11673959 (10Jdforrester-WMF) This metric is on the MW<->orchestrator connection, which is normally dominated by non-ev... [16:01:03] (03CR) 10Fabfur: [C:03+2] hiera: cleanup per-dc hiera files for haproxy30 [puppet] - 10https://gerrit.wikimedia.org/r/1247992 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [16:01:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/2/4 (Transport: cr1-drmrs:xe-0/1/1 (Telxius, ... [16:01:51] CRT-008647 86ms 10G wave) {#5229}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr2-eqiad:9804&var-interface=xe-3%2F2%2F4 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [16:02:15] !ack [16:02:16] 7713 (ACKED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Transport: cr1-drmrs:xe-0/1/1 (Telxius, CRT-008647 86ms 10G wave) {#5229} xe-3/2/4 gnmi eqiad) [16:06:24] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks for datacenter switchover from eqiad to codfw [16:06:41] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) for datacenter switchover from eqiad to codfw [16:07:56] hey folks, starting the switchover live test now [16:08:54] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:15] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1007.eqiad.wmnet with OS bookworm [16:10:36] !log remove ganeti4005 from ganeti/ulsfo cluster T418993 [16:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:39] T418993: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993 [16:10:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet [16:10:54] as part of the live test, we're testing out the scap locking functionality, so scap will be locked for the duration [16:11:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11674019 (10ops-monitoring-bot) Draining ganeti4006.ulsfo.wmnet of running VMs [16:11:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/2/4 (Transport: cr1-drmrs:xe-0/1/1 (Telxius, ... [16:11:51] CRT-008647 86ms 10G wave) {#5229}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr2-eqiad:9804&var-interface=xe-3%2F2%2F4 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [16:12:15] PROBLEM - ganeti-noded running on ganeti4005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [16:12:42] !ack [16:12:43] no value provided for parameter incident and no default available [16:12:43] All incidents are already acked. [16:13:11] PROBLEM - ganeti-confd running on ganeti4005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [16:13:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet [16:13:43] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.00-lock-scap for datacenter switchover from eqiad to codfw [16:13:45] !log root@deploy2002 Locking from deployment [ALL REPOSITORIES]: Datacenter switchover from eqiad to codfw - T418133 [16:13:47] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.00-lock-scap (exit_code=0) for datacenter switchover from eqiad to codfw [16:13:49] T418133: Northward Datacenter Switchover Live Test (March 2026; codfw to eqiad) - https://phabricator.wikimedia.org/T418133 [16:13:50] FIRING: ProbeDown: Service ganeti4005:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:14:10] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl for datacenter switchover from eqiad to codfw [16:14:15] !log upgrading cloudservices* to Bird 2.18 T413740 [16:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:18] T413740: Backport and test Bird 2.18 - https://phabricator.wikimedia.org/T413740 [16:16:29] 06SRE, 10LDAP-Access-Requests: Grant Access to ops for ebernhardson - https://phabricator.wikimedia.org/T419029 (10EBernhardson) 03NEW [16:17:12] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11674042 (10ayounsi) >>! In T418993#11673773, @ssingh wrote: >>>! In T418993#11672854, @ayounsi wrote: >> Note that this is partly blocked on {T418971} >> >> As we're planning on using... [16:17:55] (03PS1) 10Kamila Součková: benthos: add chart metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248039 (https://phabricator.wikimedia.org/T412693) [16:19:17] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:19:59] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) for datacenter switchover from eqiad to codfw [16:20:16] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance for datacenter switchover from eqiad to codfw [16:20:16] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) for datacenter switchover from eqiad to codfw [16:20:31] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1007.eqiad.wmnet with reason: host reimage [16:22:20] (03CR) 10BCornwall: [C:03+1] wdqs: remove query-legacy-full.wikidata.org - end of life [dns] - 10https://gerrit.wikimedia.org/r/1247926 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel) [16:22:41] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.02-set-readonly for datacenter switchover from eqiad to codfw [16:22:42] !log blake@cumin1003 [DRY-RUN] MediaWiki read-only period starts at: 2026-03-04 16:22:41.755892 [16:23:03] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) for datacenter switchover from eqiad to codfw [16:23:09] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly for datacenter switchover from eqiad to codfw [16:23:53] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) for datacenter switchover from eqiad to codfw [16:24:00] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki for datacenter switchover from eqiad to codfw [16:24:14] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) for datacenter switchover from eqiad to codfw [16:24:20] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite for datacenter switchover from eqiad to codfw [16:24:22] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1007.eqiad.wmnet with reason: host reimage [16:24:27] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) for datacenter switchover from eqiad to codfw [16:24:33] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite for datacenter switchover from eqiad to codfw [16:24:40] !log blake@cumin1003 [DRY-RUN] MediaWiki read-only period ends at: 2026-03-04 16:24:40.502004 [16:24:42] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) for datacenter switchover from eqiad to codfw [16:25:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11674075 (10MoritzMuehlenhoff) [16:25:23] 06SRE, 06Traffic, 06Traffic-Icebox, 07HTTPS, 07Upstream: Support Encrypted Client Hello (ECH) on Wikimedia servers - https://phabricator.wikimedia.org/T205378#11674077 (10Naruse_shiroha) Nah, you linked to the wrong number. 9848 is for ECH configuration distribution, ECH itself is in 9849. https://datat... [16:25:29] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner for datacenter switchover from eqiad to codfw [16:25:30] !log root@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: sync [16:25:53] !log root@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: sync [16:25:55] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner (exit_code=0) for datacenter switchover from eqiad to codfw [16:26:03] 06SRE, 06Traffic, 06Traffic-Icebox, 07HTTPS, 07Upstream: Support Encrypted Client Hello (ECH) on Wikimedia servers - https://phabricator.wikimedia.org/T205378#11674078 (10Naruse_shiroha) [16:26:03] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance for datacenter switchover from eqiad to codfw [16:26:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11674079 (10MoritzMuehlenhoff) [16:26:04] !log root@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [16:26:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11674082 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None [16:26:08] !log root@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [16:26:10] !log root@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [16:26:16] 06SRE, 06Traffic, 06Traffic-Icebox, 07HTTPS, 07Upstream: Support Encrypted Client Hello (ECH) on Wikimedia servers - https://phabricator.wikimedia.org/T205378#11674081 (10Naruse_shiroha) [16:26:18] !log root@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [16:26:20] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) for datacenter switchover from eqiad to codfw [16:26:27] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl for datacenter switchover from eqiad to codfw [16:26:42] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11674086 (10ayounsi) >>! In T408892#11671699, @Papaul wrote: > @ayounsi prior of deleting the sandbox1-ulsfo range 198.35.26.240/28 I will have to delete the... [16:26:59] (03PS1) 10CDobbins: Revert "cp2047: Disable performance tweaks" [puppet] - 10https://gerrit.wikimedia.org/r/1248041 [16:27:04] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) for datacenter switchover from eqiad to codfw [16:27:12] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters for datacenter switchover from eqiad to codfw [16:27:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet [16:27:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet [16:27:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11674090 (10ops-monitoring-bot) Draining ganeti4006.ulsfo.wmnet of running VMs [16:28:01] (03CR) 10Kamila Součková: [C:03+1] shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248035 (owner: 10Scott French) [16:28:39] 06SRE, 10LDAP-Access-Requests: Grant Access to ops for ebernhardson - https://phabricator.wikimedia.org/T419029#11674093 (10Gehel) `ops` gives root access across all hosts, so that might be a bit too much to go through validation. We might want to introduce a new group (`dse-k8s-admins`?) with more restricted... [16:28:50] RESOLVED: ProbeDown: Service ganeti4005:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:28:53] 06SRE, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Grant Access to ops for ebernhardson - https://phabricator.wikimedia.org/T419029#11674094 (10Gehel) [16:33:54] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:13] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:36:49] (03CR) 10BCornwall: [C:03+1] Revert "cp2047: Disable performance tweaks" [puppet] - 10https://gerrit.wikimedia.org/r/1248041 (owner: 10CDobbins) [16:39:06] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) for datacenter switchover from eqiad to codfw [16:39:21] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.09-unlock-scap for datacenter switchover from eqiad to codfw [16:39:22] (03CR) 10Federico Ceratto: "I'm using using the puppet compiler from the CR to find out and I also found hieradata/regex.yaml - the regex on the `db` hostnames is too" [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [16:39:22] !log root@deploy2002 Forcefully removing global lock: Datacenter switchover from eqiad to codfw - T418133 [16:39:23] !log root@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Datacenter switchover from eqiad to codfw - T418133 (duration: 25m 37s) [16:39:24] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.09-unlock-scap (exit_code=0) for datacenter switchover from eqiad to codfw [16:39:26] T418133: Northward Datacenter Switchover Live Test (March 2026; codfw to eqiad) - https://phabricator.wikimedia.org/T418133 [16:39:50] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1007.eqiad.wmnet with OS bookworm [16:43:03] (03CR) 10CDobbins: [C:03+2] Revert "cp2047: Disable performance tweaks" [puppet] - 10https://gerrit.wikimedia.org/r/1248041 (owner: 10CDobbins) [16:44:56] the live test is complete and scap is unlocked - thanks! [16:45:17] Great job bjensen [16:45:23] \i/ [16:45:28] :D [16:45:29] \m/ [16:45:33] nice job as always folks [16:45:58] ggwp bjensen! [16:47:27] (03PS1) 10Daniel Kinzler: rest-gateway: adjust rate limits as discussed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248046 [16:47:52] (03CR) 10Marostegui: "Mmm now that I see it those regex aren't capturing db12* and db22* which we have already, so that needs fixing. I guess that's the main is" [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [16:48:18] (03PS1) 10Andrew Bogott: toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 [16:49:37] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: expose x-wmf-ratelimit-class in response [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) (owner: 10Daniel Kinzler) [16:51:21] (03PS2) 10Andrew Bogott: toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 [16:51:21] (03PS2) 10Andrew Bogott: toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 [16:51:59] (03Merged) 10jenkins-bot: rest gateway: expose x-wmf-ratelimit-class in response [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) (owner: 10Daniel Kinzler) [16:53:59] (03PS3) 10Andrew Bogott: toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 [16:54:02] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:54:13] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:54:27] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:55:02] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:55:55] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:56:56] (03CR) 10CI reject: [V:04-1] toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (owner: 10Andrew Bogott) [16:57:30] (03CR) 10CI reject: [V:04-1] toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (owner: 10Andrew Bogott) [17:00:02] (03CR) 10Cwhite: [C:03+2] statsite: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1247913 (owner: 10Muehlenhoff) [17:00:05] (03CR) 10CI reject: [V:04-1] toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (owner: 10Andrew Bogott) [17:00:21] (03PS3) 10Muehlenhoff: statsite: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1247913 [17:01:17] (03CR) 10Cwhite: [C:03+2] statsite: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1247913 (owner: 10Muehlenhoff) [17:02:23] (03CR) 10Gmodena: [C:03+1] wdqs: sort federation allow list alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/1248023 (owner: 10Gehel) [17:02:51] (03CR) 10Gmodena: [C:03+1] wdqs: remove duplicate endpoints in federation allow list [puppet] - 10https://gerrit.wikimedia.org/r/1248022 (https://phabricator.wikimedia.org/T417573) (owner: 10Gehel) [17:04:06] (03CR) 10Dzahn: [C:03+1] gerrit: limit access to http/https/ssh in firewall [puppet] - 10https://gerrit.wikimedia.org/r/1238400 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn) [17:06:06] (03PS6) 10Dzahn: gerrit: limit access to http/https/ssh in firewall [puppet] - 10https://gerrit.wikimedia.org/r/1238400 (https://phabricator.wikimedia.org/T411895) [17:06:12] (03CR) 10Dzahn: gerrit: limit access to http/https/ssh in firewall (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1238400 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn) [17:06:44] (03PS3) 10Federico Ceratto: mariadb: fix regexp in hieradata/regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) [17:06:48] (03CR) 10Dzahn: [C:03+1] gerrit: limit access to http/https/ssh in firewall [puppet] - 10https://gerrit.wikimedia.org/r/1238400 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn) [17:09:33] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: adjust rate limits as discussed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248046 (owner: 10Daniel Kinzler) [17:09:49] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp2047.codfw.wmnet with OS trixie [17:10:15] (03PS3) 10Dzahn: zuul::main: build full chain of trust for Java Netty TLS [puppet] - 10https://gerrit.wikimedia.org/r/1244969 (https://phabricator.wikimedia.org/T395938) [17:11:44] (03Merged) 10jenkins-bot: rest-gateway: adjust rate limits as discussed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248046 (owner: 10Daniel Kinzler) [17:13:01] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:14:04] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1244969/8195/zuul2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1244969 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:15:02] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:15:55] (03PS3) 10Andrew Bogott: toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 [17:15:55] (03PS4) 10Andrew Bogott: toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 [17:18:09] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:18:37] (03PS4) 10Dzahn: zookeeper::server: allow Hiera to override $extra_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/1244939 (https://phabricator.wikimedia.org/T395938) [17:18:52] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:21:46] (03CR) 10CI reject: [V:04-1] toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (owner: 10Andrew Bogott) [17:23:06] (03CR) 10CI reject: [V:04-1] toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (owner: 10Andrew Bogott) [17:23:40] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [17:23:45] (03CR) 10Dzahn: [V:04-1] "parameter 'extra_java_opts' expects a String value, got Undef" [puppet] - 10https://gerrit.wikimedia.org/r/1244939 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:23:58] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [17:25:48] (03PS5) 10Dzahn: zookeeper::server: allow Hiera to override $extra_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/1244939 (https://phabricator.wikimedia.org/T395938) [17:27:56] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2047.codfw.wmnet with reason: host reimage [17:30:35] (03PS4) 10Andrew Bogott: toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 [17:30:35] (03PS5) 10Andrew Bogott: toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 [17:33:03] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2047.codfw.wmnet with reason: host reimage [17:33:07] (03CR) 10Dzahn: [V:03+1] "noop - https://puppet-compiler.wmflabs.org/output/1244939/8198/" [puppet] - 10https://gerrit.wikimedia.org/r/1244939 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:34:05] (03CR) 10Dzahn: [C:03+2] zookeeper::server: allow Hiera to override $extra_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/1244939 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:34:41] (03PS2) 10Muehlenhoff: Add library hint for mbedtls [puppet] - 10https://gerrit.wikimedia.org/r/1247991 [17:35:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1238400 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn) [17:36:44] (03CR) 10CI reject: [V:04-1] toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (owner: 10Andrew Bogott) [17:37:05] (03CR) 10Dzahn: [C:03+2] "noop confirmed: an-conf1004, druid1011, flink-zk2001,.." [puppet] - 10https://gerrit.wikimedia.org/r/1244939 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:37:21] (03CR) 10CI reject: [V:04-1] toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (owner: 10Andrew Bogott) [17:39:12] (03PS2) 10Dzahn: zuul::main: add debugging extra_java_opts: "-Djavax.net.debug=ssl,handshake" [puppet] - 10https://gerrit.wikimedia.org/r/1244944 (https://phabricator.wikimedia.org/T395938) [17:39:58] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] CategoryViewer: Fall back to empty string in case of missing nextpage [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248011 (https://phabricator.wikimedia.org/T418934) (owner: 10Zabe) [17:45:02] (03CR) 10Marostegui: [C:03+1] mariadb: fix regexp in hieradata/regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [17:47:33] (03CR) 10Dzahn: "hrmm... we would remove other extra_opts added for prometheus.. but I don't want to refactor everything - it's currently an anti-pattern -" [puppet] - 10https://gerrit.wikimedia.org/r/1244944 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:50:39] 06SRE, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Grant Access to ops for ebernhardson - https://phabricator.wikimedia.org/T419029#11674567 (10BTullis) > Most recent example would be evaluating/mitigating over-aggressive default readahead settings in dse-k8s. For reference, I'm wo... [17:51:50] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Grant Access to ops for ebernhardson - https://phabricator.wikimedia.org/T419029#11674578 (10taavi) [17:54:30] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2047.codfw.wmnet with OS trixie [17:58:09] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Grant Access to ops for ebernhardson - https://phabricator.wikimedia.org/T419029#11674647 (10Dzahn) There is both, an LDAP group called "ops" (does many things, for example right to merge things in operations/puppet, many web-based... [17:59:59] (03PS5) 10Andrew Bogott: toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 [17:59:59] (03PS6) 10Andrew Bogott: toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 [18:00:03] (03PS3) 10Jasmine: install_server: use UEFI for new control plane nodes wikikube-ctrl200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1247695 (https://phabricator.wikimedia.org/T390861) [18:00:05] swfrench-wmf: gettimeofday() says it's time for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T1800) [18:01:30] (03CR) 10Jasmine: install_server: use UEFI for new control plane nodes wikikube-ctrl200[4-5] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1247695 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [18:01:44] o/ [18:02:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11674657 (10ssingh) >>! In T418993#11674042, @ayounsi wrote: >>>! In T418993#11673773, @ssingh wrote: >>>>! In T418993#11672854, @ayounsi wrote: >>> Note that this is partly blocked on... [18:04:32] (03CR) 10Scott French: [C:03+2] shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248035 (owner: 10Scott French) [18:06:01] (03CR) 10CI reject: [V:04-1] toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (owner: 10Andrew Bogott) [18:06:20] (03CR) 10CI reject: [V:04-1] toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (owner: 10Andrew Bogott) [18:06:23] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Grant Access to ops for ebernhardson - https://phabricator.wikimedia.org/T419029#11674673 (10taavi) [18:06:52] (03Merged) 10jenkins-bot: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248035 (owner: 10Scott French) [18:09:39] (03PS1) 10Btullis: Remove the cpufrequtils class from the hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1248070 (https://phabricator.wikimedia.org/T415002) [18:10:40] FIRING: [2x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:11:02] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8204/co" [puppet] - 10https://gerrit.wikimedia.org/r/1248070 (https://phabricator.wikimedia.org/T415002) (owner: 10Btullis) [18:11:07] (03CR) 10Btullis: Remove the cpufrequtils class from the hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1248070 (https://phabricator.wikimedia.org/T415002) (owner: 10Btullis) [18:12:07] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [18:12:34] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [18:13:06] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [18:13:19] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [18:13:51] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [18:14:04] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [18:14:36] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:15:09] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:15:39] (03PS1) 10Dzahn: zookeeper::server: if extra_java_opts are set, keep prometheus opts [puppet] - 10https://gerrit.wikimedia.org/r/1248072 (https://phabricator.wikimedia.org/T395938) [18:15:41] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [18:16:00] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [18:16:31] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [18:16:42] (03CR) 10Dzahn: [C:03+2] "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/1248072" [puppet] - 10https://gerrit.wikimedia.org/r/1244939 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:16:51] (03PS1) 10Muehlenhoff: Record LDAP access for nicholasperry [puppet] - 10https://gerrit.wikimedia.org/r/1248073 [18:16:57] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [18:20:01] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for nicholasperry [puppet] - 10https://gerrit.wikimedia.org/r/1248073 (owner: 10Muehlenhoff) [18:23:18] (03CR) 10Dzahn: [V:03+1 C:03+2] "noop - https://puppet-compiler.wmflabs.org/output/1248072/8205/" [puppet] - 10https://gerrit.wikimedia.org/r/1248072 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:24:31] (03PS3) 10Dzahn: zuul::main: add debugging extra_java_opts: "-Djavax.net.debug=ssl,handshake" [puppet] - 10https://gerrit.wikimedia.org/r/1244944 (https://phabricator.wikimedia.org/T395938) [18:26:17] (03CR) 10Dzahn: [C:03+1] gerrit: dns cache wipe update [cookbooks] - 10https://gerrit.wikimedia.org/r/1247534 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [18:28:32] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1244944/8206/" [puppet] - 10https://gerrit.wikimedia.org/r/1244944 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:31:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247651 (https://phabricator.wikimedia.org/T415902) (owner: 10Mmartorana) [18:31:23] (03PS6) 10Andrew Bogott: toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 [18:31:23] (03PS7) 10Andrew Bogott: toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 [18:32:22] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:32:52] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:36:34] (03PS39) 10CDobbins: prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [18:36:39] (03CR) 10CDobbins: prometheus: add pooled host check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [18:37:29] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:37:33] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:37:34] (03CR) 10CI reject: [V:04-1] toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (owner: 10Andrew Bogott) [18:37:56] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [18:38:07] (03CR) 10CI reject: [V:04-1] toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (owner: 10Andrew Bogott) [18:38:34] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [18:39:05] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [18:39:28] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new, 07Essential-Work: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11674820 (10Jdforrester-WMF) [18:39:36] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [18:39:52] (03PS1) 10Mmartorana: Confirmemail: Log delay between email sent and confirmation [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248075 (https://phabricator.wikimedia.org/T415902) [18:40:08] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [18:40:24] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [18:40:56] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:40:59] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:41:19] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: ms-fe1013 reports a backplane error - https://phabricator.wikimedia.org/T419010#11674834 (10VRiley-WMF) a:03VRiley-WMF [18:41:30] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [18:41:55] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [18:42:26] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [18:43:29] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [18:45:19] (03PS2) 10Dzahn: zookeeper: support TLS by loading Netty jars into class path [puppet] - 10https://gerrit.wikimedia.org/r/1244927 (https://phabricator.wikimedia.org/T395938) [18:46:34] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [18:47:16] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS trixie [18:47:18] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [18:47:49] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [18:48:25] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [18:48:57] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [18:49:11] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [18:49:42] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:49:58] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:50:30] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [18:50:41] (03CR) 10Dzahn: [V:04-1] "Error: Could not call 'find' on 'catalog': no parameter named 'ensure'" [puppet] - 10https://gerrit.wikimedia.org/r/1244927 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:50:54] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [18:51:25] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [18:52:36] (03CR) 10CI reject: [V:04-1] Confirmemail: Log delay between email sent and confirmation [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248075 (https://phabricator.wikimedia.org/T415902) (owner: 10Mmartorana) [18:52:39] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [18:53:15] (03PS3) 10Dzahn: zookeeper: support TLS by loading Netty jars into class path [puppet] - 10https://gerrit.wikimedia.org/r/1244927 (https://phabricator.wikimedia.org/T395938) [18:53:56] (03CR) 10Mmartorana: "recheck" [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248075 (https://phabricator.wikimedia.org/T415902) (owner: 10Mmartorana) [18:58:14] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp204[45678].* [19:00:04] jeena and dduvall: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T1900). [19:01:42] (03CR) 10Joal: [C:03+1] Remove the cpufrequtils class from the hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1248070 (https://phabricator.wikimedia.org/T415002) (owner: 10Btullis) [19:02:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248075 (https://phabricator.wikimedia.org/T415902) (owner: 10Mmartorana) [19:02:38] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2043.codfw.wmnet with reason: host reimage [19:03:45] Holding the train due to a blocker [19:03:49] (03PS1) 10Clare Ming: Add synthetic AAA experiment [extensions/TestKitchen] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248080 (https://phabricator.wikimedia.org/T418614) [19:03:54] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1244927/8208/" [puppet] - 10https://gerrit.wikimedia.org/r/1244927 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:04:30] (03PS1) 10Clare Ming: Add synthetic AAA experiment [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248081 (https://phabricator.wikimedia.org/T418614) [19:04:44] (03CR) 10Dzahn: [C:03+2] zookeeper: support TLS by loading Netty jars into class path [puppet] - 10https://gerrit.wikimedia.org/r/1244927 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:04:54] !log brett@puppetserver1001 conftool action : set/weight=100; selector: name=cp204[45678].* [19:04:57] (03Abandoned) 10Clare Ming: Add synthetic AAA experiment [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248081 (https://phabricator.wikimedia.org/T418614) (owner: 10Clare Ming) [19:06:07] (03Restored) 10Clare Ming: Add synthetic AAA experiment [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248081 (https://phabricator.wikimedia.org/T418614) (owner: 10Clare Ming) [19:06:09] !log brett@puppetserver1001 conftool action : set/weight=1; selector: name=cp204[45678].* [19:06:37] (03PS2) 10Clare Ming: Add synthetic AAA experiment [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248081 (https://phabricator.wikimedia.org/T418614) [19:07:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248081 (https://phabricator.wikimedia.org/T418614) (owner: 10Clare Ming) [19:07:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/TestKitchen] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248080 (https://phabricator.wikimedia.org/T418614) (owner: 10Clare Ming) [19:08:54] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1012 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:09:07] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2043.codfw.wmnet with reason: host reimage [19:09:54] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1012 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:10:35] (03CR) 10Dzahn: "keep in mind this needs a deployment of admin_ng (ignore if you already did that)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247615 (owner: 10AOkoth) [19:12:37] I will be backporting a change now to unblock the train [19:13:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248011 (https://phabricator.wikimedia.org/T418934) (owner: 10Zabe) [19:22:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad A/B switch cabling documentation - https://phabricator.wikimedia.org/T418018#11674997 (10RobH) >>! In T418018#11648884, @Papaul wrote: > @RobH like @ayounsi mentioned today everything for row A/B should be QSFP-100G CWMD4 like in... [19:22:16] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [19:22:34] (touching mw-experimental only, no conflict with the scap backport) [19:22:41] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [19:22:56] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [19:23:20] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [19:25:01] (03PS1) 10Dzahn: create role skeleton for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1248082 (https://phabricator.wikimedia.org/T418521) [19:27:02] (03PS1) 10Dzahn: installserver: add contint[1-2]003 to preseed regex [puppet] - 10https://gerrit.wikimedia.org/r/1248083 (https://phabricator.wikimedia.org/T418521) [19:27:45] (03CR) 10Dzahn: [C:03+2] installserver: add contint[1-2]003 to preseed regex [puppet] - 10https://gerrit.wikimedia.org/r/1248083 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [19:30:05] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2043.codfw.wmnet with OS trixie [19:30:10] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Grant Access to ops for ebernhardson - https://phabricator.wikimedia.org/T419029#11675020 (10EBernhardson) I was thinking of access as not solving the current issue, as we have a plan forward for that, but as more of addressing poss... [19:32:53] (03Merged) 10jenkins-bot: CategoryViewer: Fall back to empty string in case of missing nextpage [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248011 (https://phabricator.wikimedia.org/T418934) (owner: 10Zabe) [19:33:24] !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1248011|CategoryViewer: Fall back to empty string in case of missing nextpage (T418934)]] [19:33:27] T418934: TypeError: MediaWiki\Category\CategoryViewer::pagingLinks(): Argument #2 ($last) must be of type string, null given, called in /srv/mediawiki/php-1.46.0-wmf.18/includes/Category/CategoryViewer.php on line 599 - https://phabricator.wikimedia.org/T418934 [19:34:18] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp2043.* [19:34:30] !log brett@puppetserver1001 conftool action : set/weight=1; selector: name=cp2043.* [19:35:27] !log jhuneidi@deploy2002 zabe, jhuneidi: Backport for [[gerrit:1248011|CategoryViewer: Fall back to empty string in case of missing nextpage (T418934)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:36:05] (03PS1) 10Ebernhardson: Introduce a Semantic Search query route and builder [extensions/CirrusSearch] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248084 (https://phabricator.wikimedia.org/T413969) [19:36:48] (03PS1) 10Ebernhardson: Wire up semantic query building [extensions/CirrusSearch] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248085 (https://phabricator.wikimedia.org/T413969) [19:37:12] (03PS2) 10Ebernhardson: Wire up semantic query building [extensions/CirrusSearch] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248085 (https://phabricator.wikimedia.org/T413969) [19:37:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CirrusSearch] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248084 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [19:38:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CirrusSearch] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248085 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [19:38:22] zabe: are you available to do any testing for the backport? [19:38:30] yes sure [19:39:12] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:39:41] ok visiting https://www.mediawiki.org/w/index.php?from=D&title=Category:Languages_pages/nl reproduces the error [19:39:49] and with mwdebug enabled it looks good [19:39:56] jeena: lgtm [19:40:05] great, thank you so much! [19:40:14] !log jhuneidi@deploy2002 zabe, jhuneidi: Continuing with sync [19:40:27] (03PS7) 10Andrew Bogott: toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 [19:40:27] (03PS8) 10Andrew Bogott: toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 [19:42:06] (03PS1) 10Bking: dse-k8s: Enable active/active for dse-k8s clusters (2/2) [dns] - 10https://gerrit.wikimedia.org/r/1248087 (https://phabricator.wikimedia.org/T396478) [19:43:11] !log cdobbins@cumin2002 conftool action : set/pooled=yes:weight=1; selector: name=cp2049.codfw.wmnet [19:43:12] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: ms-fe1013 reports a backplane error - https://phabricator.wikimedia.org/T419010#11675073 (10VRiley-WMF) We can take a look at this. Is it okay to take it down? [19:44:10] !log cdobbins@cumin2002 conftool action : set/pooled=yes:weight=1; selector: name=cp205[0-8].codfw.wmnet [19:44:12] !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248011|CategoryViewer: Fall back to empty string in case of missing nextpage (T418934)]] (duration: 10m 47s) [19:44:18] T418934: TypeError: MediaWiki\Category\CategoryViewer::pagingLinks(): Argument #2 ($last) must be of type string, null given, called in /srv/mediawiki/php-1.46.0-wmf.18/includes/Category/CategoryViewer.php on line 599 - https://phabricator.wikimedia.org/T418934 [19:44:35] (03PS2) 10Bking: dse-k8s: Enable active/active for dse-k8s clusters (2/2) [dns] - 10https://gerrit.wikimedia.org/r/1248087 (https://phabricator.wikimedia.org/T396478) [19:46:57] (03CR) 10Ssingh: dse-k8s: Enable active/active for dse-k8s clusters (2/2) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1248087 (https://phabricator.wikimedia.org/T396478) (owner: 10Bking) [19:48:11] (03CR) 10CI reject: [V:04-1] toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (owner: 10Andrew Bogott) [19:48:49] Now rolling train to group1 [19:49:11] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248091 (https://phabricator.wikimedia.org/T413809) [19:49:13] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248091 (https://phabricator.wikimedia.org/T413809) (owner: 10TrainBranchBot) [19:50:24] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248091 (https://phabricator.wikimedia.org/T413809) (owner: 10TrainBranchBot) [19:56:16] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.18 refs T413809 [19:56:19] T413809: 1.46.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T413809 [19:57:20] (03PS3) 10Bking: dse-k8s: Enable active/active for dse-k8s clusters (2/2) [dns] - 10https://gerrit.wikimedia.org/r/1248087 (https://phabricator.wikimedia.org/T396478) [20:00:10] (03CR) 10Bking: dse-k8s: Enable active/active for dse-k8s clusters (2/2) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1248087 (https://phabricator.wikimedia.org/T396478) (owner: 10Bking) [20:02:04] (03PS3) 10Scott French: envoy: Support using envoy-drain-tool [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247185 (https://phabricator.wikimedia.org/T364245) [20:02:04] (03CR) 10Scott French: "Adding Reuven, who is also reviewing the initial version of the tool in [0]. Thank you, Reuven!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247185 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [20:16:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad row A/B switch upgrade - https://phabricator.wikimedia.org/T418012#11675191 (10RobH) [20:19:27] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:26:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad A/B switch cabling documentation - https://phabricator.wikimedia.org/T418018#11675231 (10RobH) >>! In T418018#11674997, @RobH wrote: >>>! In T418018#11648884, @Papaul wrote: >> @RobH like @ayounsi mentioned today everything for r... [20:28:16] 10ops-codfw, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#11675234 (10Scott_French) [20:34:13] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:40:19] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#11675250 (10Scott_French) [20:49:26] (03PS1) 10Bartosz Dziewoński: filebackend: Remove outdated comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248095 [20:53:30] jeena: the train is already finished, right? I'd start early then, I have two patches that could use some time between them. [20:53:54] tgr_: That's right, you can go ahead now [20:54:13] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:56:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248007 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [20:57:02] (03Merged) 10jenkins-bot: Fix $wgJwtSessionCookieIssuer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248007 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [20:57:33] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1248007|Fix $wgJwtSessionCookieIssuer (T415007 T418999)]] [20:57:38] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [20:57:38] T418999: Remove trailing slash in issuer for bot password JWT cookies - https://phabricator.wikimedia.org/T418999 [20:59:38] !log tgr@deploy2002 tgr: Backport for [[gerrit:1248007|Fix $wgJwtSessionCookieIssuer (T415007 T418999)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T2100). [21:00:05] cwhite, tgr, cjming, and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:32] o/ [21:00:45] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Swiftly running write operations on files can result in the MediaWiki DB getting out of sync with Swift, resulting in "A non-identical file already exists at
errors" on undelete - https://phabricator.wikimedia.org/T387340#11675344 (10Ppper... [21:00:52] I started a bit early, since I have two patches that need some testing time in between [21:01:01] mine might take awhile (or maybe not, sometimes extension merges are faster), but i can go last and wait as appropriate. [21:01:23] o/ [21:02:01] o/ [21:03:01] !log tgr@deploy2002 tgr: Continuing with sync [21:04:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:07:25] mine can go out together -- presumably we're all self-deploying? i can deploy if someone needs a deployer [21:07:28] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248007|Fix $wgJwtSessionCookieIssuer (T415007 T418999)]] (duration: 09m 55s) [21:07:33] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [21:07:33] T418999: Remove trailing slash in issuer for bot password JWT cookies - https://phabricator.wikimedia.org/T418999 [21:09:04] (03Abandoned) 10Bking: dse-k8s: Enable active/active for dse-k8s clusters (2/2) [dns] - 10https://gerrit.wikimedia.org/r/1248087 (https://phabricator.wikimedia.org/T396478) (owner: 10Bking) [21:09:11] feel free to go next [21:09:25] or I can deploy the other patches [21:09:33] (03PS1) 10Bking: dse-k8s: Enable active/active for dse-k8s clusters (2/2) [dns] - 10https://gerrit.wikimedia.org/r/1248097 (https://phabricator.wikimedia.org/T396478) [21:09:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:10:24] (03CR) 10CI reject: [V:04-1] dse-k8s: Enable active/active for dse-k8s clusters (2/2) [dns] - 10https://gerrit.wikimedia.org/r/1248097 (https://phabricator.wikimedia.org/T396478) (owner: 10Bking) [21:10:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245473 (https://phabricator.wikimedia.org/T418612) (owner: 10Cwhite) [21:11:56] (03Merged) 10jenkins-bot: logging: set poolcounter channel log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245473 (https://phabricator.wikimedia.org/T418612) (owner: 10Cwhite) [21:12:28] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1245473|logging: set poolcounter channel log level to info (T418612)]] [21:12:32] T418612: Audit mwlog storage and retention - https://phabricator.wikimedia.org/T418612 [21:12:59] (03PS2) 10Bking: dse-k8s: Enable active/active for dse-k8s clusters (2/2) [dns] - 10https://gerrit.wikimedia.org/r/1248097 (https://phabricator.wikimedia.org/T396478) [21:14:30] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments1009 - https://phabricator.wikimedia.org/T416253#11675412 (10VRiley-WMF) [21:14:37] !log tgr@deploy2002 tgr, cwhite: Backport for [[gerrit:1245473|logging: set poolcounter channel log level to info (T418612)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:15:14] cwhite: doesn't need testing I assume? [21:15:41] (03PS8) 10Andrew Bogott: toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 [21:15:41] (03PS9) 10Andrew Bogott: toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 [21:17:02] tgr_: don't see any trouble on mwdebug. good to proceed [21:17:36] !log tgr@deploy2002 tgr, cwhite: Continuing with sync [21:17:41] (03CR) 10Gergő Tisza: [C:03+2] Add synthetic AAA experiment [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248081 (https://phabricator.wikimedia.org/T418614) (owner: 10Clare Ming) [21:17:42] (03CR) 10Gergő Tisza: [C:03+2] Add synthetic AAA experiment [extensions/TestKitchen] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248080 (https://phabricator.wikimedia.org/T418614) (owner: 10Clare Ming) [21:18:47] tgr: thanks! [21:21:33] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1245473|logging: set poolcounter channel log level to info (T418612)]] (duration: 09m 04s) [21:21:37] T418612: Audit mwlog storage and retention - https://phabricator.wikimedia.org/T418612 [21:22:24] (03CR) 10CI reject: [V:04-1] toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (owner: 10Andrew Bogott) [21:24:13] (03Merged) 10jenkins-bot: Add synthetic AAA experiment [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248081 (https://phabricator.wikimedia.org/T418614) (owner: 10Clare Ming) [21:24:15] (03Merged) 10jenkins-bot: Add synthetic AAA experiment [extensions/TestKitchen] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248080 (https://phabricator.wikimedia.org/T418614) (owner: 10Clare Ming) [21:26:52] (03PS10) 10Andrew Bogott: toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 [21:27:11] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1248081|Add synthetic AAA experiment (T418614)]], [[gerrit:1248080|Add synthetic AAA experiment (T418614)]] [21:27:15] T418614: Run a Synthetic A/A/A test (JS + PHP SDKs, Minerva only) - https://phabricator.wikimedia.org/T418614 [21:29:14] !log tgr@deploy2002 cjming, tgr: Backport for [[gerrit:1248081|Add synthetic AAA experiment (T418614)]], [[gerrit:1248080|Add synthetic AAA experiment (T418614)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:29:32] wow that was fast [21:30:28] !log bking@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1028.eqiad.wmnet [21:31:35] gtg [21:32:26] (03PS1) 10Scott French: Add new conf200[789] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1248108 (https://phabricator.wikimedia.org/T418914) [21:32:30] !log tgr@deploy2002 cjming, tgr: Continuing with sync [21:32:50] tysm! [21:33:39] (03CR) 10CI reject: [V:04-1] toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (owner: 10Andrew Bogott) [21:35:10] !log bking@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1028.eqiad.wmnet [21:36:22] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248081|Add synthetic AAA experiment (T418614)]], [[gerrit:1248080|Add synthetic AAA experiment (T418614)]] (duration: 09m 11s) [21:36:26] T418614: Run a Synthetic A/A/A test (JS + PHP SDKs, Minerva only) - https://phabricator.wikimedia.org/T418614 [21:39:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248012 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [21:40:21] (03Merged) 10jenkins-bot: Enable JWT session cookie for bot passwords (all wikis) (attempt #3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248012 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [21:40:48] (03CR) 10Ssingh: [C:03+1] "Looks good. We can go over it once again before the actual deploy." [dns] - 10https://gerrit.wikimedia.org/r/1248097 (https://phabricator.wikimedia.org/T396478) (owner: 10Bking) [21:40:52] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1248012|Enable JWT session cookie for bot passwords (all wikis) (attempt #3) (T415007 T418999)]] [21:40:59] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [21:41:00] T418999: Remove trailing slash in issuer for bot password JWT cookies - https://phabricator.wikimedia.org/T418999 [21:41:46] (03CR) 10Gergő Tisza: [C:03+2] Introduce a Semantic Search query route and builder [extensions/CirrusSearch] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248084 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [21:41:47] (03CR) 10Gergő Tisza: [C:03+2] Wire up semantic query building [extensions/CirrusSearch] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248085 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [21:42:06] tgr_: thanks! [21:43:00] !log tgr@deploy2002 tgr: Backport for [[gerrit:1248012|Enable JWT session cookie for bot passwords (all wikis) (attempt #3) (T415007 T418999)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:44:05] !log tgr@deploy2002 tgr: Continuing with sync [21:46:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad row A/B switch upgrade - https://phabricator.wikimedia.org/T418012#11675615 (10RobH) [21:46:58] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on dse-k8s-worker1028.eqiad.wmnet with reason: broken networking [21:47:56] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248012|Enable JWT session cookie for bot passwords (all wikis) (attempt #3) (T415007 T418999)]] (duration: 07m 05s) [21:48:18] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [21:48:18] T418999: Remove trailing slash in issuer for bot password JWT cookies - https://phabricator.wikimedia.org/T418999 [21:48:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad row A/B switch upgrade - https://phabricator.wikimedia.org/T418012#11675636 (10RobH) [21:48:51] (03CR) 10Kamila Součková: [C:03+1] Add new conf200[789] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1248108 (https://phabricator.wikimedia.org/T418914) (owner: 10Scott French) [21:49:37] (03CR) 10Dzahn: [C:03+1] "not deploying it right now but if anyone feels like just doing it - you can do it" [puppet] - 10https://gerrit.wikimedia.org/r/1238400 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn) [21:51:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CirrusSearch] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248084 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [21:51:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CirrusSearch] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248085 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [21:51:58] (03Merged) 10jenkins-bot: Introduce a Semantic Search query route and builder [extensions/CirrusSearch] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248084 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [21:54:07] (03Merged) 10jenkins-bot: Wire up semantic query building [extensions/CirrusSearch] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248085 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [21:54:39] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1248084|Introduce a Semantic Search query route and builder (T413969)]], [[gerrit:1248085|Wire up semantic query building (T413969)]] [21:54:42] T413969: Make semantic search accessible through Action API - https://phabricator.wikimedia.org/T413969 [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T2200) [22:10:14] (03PS1) 10Dzahn: ci::website: support 2 different websites, integration vs zuul-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1248118 (https://phabricator.wikimedia.org/T418521) [22:10:40] FIRING: [2x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:10:48] (03CR) 10CI reject: [V:04-1] ci::website: support 2 different websites, integration vs zuul-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1248118 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [22:13:10] (03PS4) 10Bking: dse-k8s-ingress: Enable active-active [puppet] - 10https://gerrit.wikimedia.org/r/1238832 (https://phabricator.wikimedia.org/T396478) [22:14:24] !log tgr@deploy2002 tgr, ebernhardson: Backport for [[gerrit:1248084|Introduce a Semantic Search query route and builder (T413969)]], [[gerrit:1248085|Wire up semantic query building (T413969)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:14:28] T413969: Make semantic search accessible through Action API - https://phabricator.wikimedia.org/T413969 [22:14:44] checking [22:15:02] (03PS3) 10Bking: dse-k8s: Enable active/active for dse-k8s clusters (2/2) [dns] - 10https://gerrit.wikimedia.org/r/1248097 (https://phabricator.wikimedia.org/T396478) [22:16:10] !log tgr@deploy2002 tgr, ebernhardson: Continuing with sync [22:17:23] (03PS5) 10Bking: dse-k8s-ingress: Enable active-active [puppet] - 10https://gerrit.wikimedia.org/r/1238832 (https://phabricator.wikimedia.org/T396478) [22:18:03] (03CR) 10CI reject: [V:04-1] dse-k8s-ingress: Enable active-active [puppet] - 10https://gerrit.wikimedia.org/r/1238832 (https://phabricator.wikimedia.org/T396478) (owner: 10Bking) [22:19:08] uh, what? pretty sure I didn't press anything [22:19:15] (03PS4) 10Bking: dse-k8s: Enable active/active for dse-k8s clusters (2/2) [dns] - 10https://gerrit.wikimedia.org/r/1248097 (https://phabricator.wikimedia.org/T396478) [22:19:23] tgr_: i pushed the button, i guess i should have said something [22:19:33] it was ready for checking so i verified and then continued [22:19:45] oh, didn't realize yu can do that [22:19:59] i'm not sure if it's just because it's my patch, or if generally anyone can push the button [22:20:11] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 3 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11675775 (10Ladsgroup) >>! In T414805#11668230, @Ladsgroup wrote: > Top "file formats" for the non-standard sizes with enwiki as refe... [22:20:59] Anyone with SpiderPig rights *can* interact with the sessions as they're happening. [22:21:07] But that doesn't mean they *should*. :-) [22:21:24] seems accident-prone [22:21:27] oh well [22:21:33] It's more for "take over James's deploy as his laptop just fell off the Internet". [22:21:43] fair enough, i'll be a bit more patient next time [22:21:51] I guess that makes sense [22:22:25] I'm not complaining, I just didn't understand what's happening [22:22:32] * James_F nods. [22:22:48] Maybe there should be an "are you sure?" prompt for non-self-started actions? [22:22:50] * James_F files. [22:25:10] T419084 [22:25:10] T419084: Ask users if they're sure they want to interact with a SpiderPig session's button when they didn't initiate it - https://phabricator.wikimedia.org/T419084 [22:33:07] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248084|Introduce a Semantic Search query route and builder (T413969)]], [[gerrit:1248085|Wire up semantic query building (T413969)]] (duration: 38m 28s) [22:33:10] T413969: Make semantic search accessible through Action API - https://phabricator.wikimedia.org/T413969 [22:35:59] !log UTC late deploys done [22:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:15] I wonder what was the difference between the TestKitchen backports that took 10 min and the CirrusSearch backports that took 40 [22:40:41] that is curious, no great ideas [22:41:48] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1248084 caused a full i18n rebuild which takes far longer to sync out [22:42:25] That'd do it, yes. [22:42:52] Maybe SpiderPig should look for any changes to .*/i18n/.* and triple-check if you're sure it needs backporting. :-) [22:43:05] lol [22:43:12] (03PS1) 10Zabe: NewFilesPager: Properly support file schema migration read new [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248123 (https://phabricator.wikimedia.org/T419062) [22:45:31] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-03-04-123739 to 2026-03-04-220825 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248124 (https://phabricator.wikimedia.org/T416756) [22:45:42] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-03-04-123739 to 2026-03-04-220825 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248124 (https://phabricator.wikimedia.org/T416756) (owner: 10Jforrester) [22:46:42] (03PS1) 10Zabe: NewFilesPager: Properly support file schema migration read new [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248125 (https://phabricator.wikimedia.org/T419062) [22:47:45] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-03-04-123739 to 2026-03-04-220825 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248124 (https://phabricator.wikimedia.org/T416756) (owner: 10Jforrester) [22:50:57] (03PS1) 10Dzahn: ci::website/ci::httpd: move monitoring to website, not httpd [puppet] - 10https://gerrit.wikimedia.org/r/1248127 (https://phabricator.wikimedia.org/T418521) [22:51:15] (03CR) 10Jforrester: "Please remember to follow the development policy about commit messages. This doesn't "rename" anything, it drops an entire extension from " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247547 (https://phabricator.wikimedia.org/T416865) (owner: 10Santiago Faci) [22:51:36] (03CR) 10CI reject: [V:04-1] ci::website/ci::httpd: move monitoring to website, not httpd [puppet] - 10https://gerrit.wikimedia.org/r/1248127 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [22:54:35] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [22:55:01] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [22:55:20] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [22:55:55] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [22:55:57] (03PS2) 10Dzahn: ci::website/ci::httpd: move monitoring to website, not httpd [puppet] - 10https://gerrit.wikimedia.org/r/1248127 (https://phabricator.wikimedia.org/T418521) [22:56:58] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [22:57:27] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [22:57:48] (03PS2) 10Dzahn: ci::website: support 2 different websites, integration vs zuul-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1248118 (https://phabricator.wikimedia.org/T418521) [22:58:40] (03PS11) 10Andrew Bogott: toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T2300) [23:04:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad A/B switch cabling documentation - https://phabricator.wikimedia.org/T418018#11675991 (10Papaul) Please see below for the spine to spine port information |Switch|Interface|Switch|Interface| |ssw1-a1-eqiad|ethernet-1/31|ssw1-f1-e... [23:04:49] (03CR) 10CI reject: [V:04-1] toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (owner: 10Andrew Bogott) [23:39:13] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:45:02] 10ops-magru: Inbound errors on interface cr1-magru:xe-0/1/1 (Transport: cr2-eqiad:xe-1/0/1:3 (Telxius, CRT-008508) {#70089}) - https://phabricator.wikimedia.org/T413409#11676075 (10phaultfinder) [23:49:09] jouncebot: nowandnext [23:49:09] For the next 0 hour(s) and 10 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260304T2300) [23:49:09] In 7 hour(s) and 10 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T0700) [23:49:09] In 7 hour(s) and 10 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T0700) [23:49:52] (03CR) 10Zabe: [C:03+2] NewFilesPager: Properly support file schema migration read new [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248125 (https://phabricator.wikimedia.org/T419062) (owner: 10Zabe) [23:49:53] (03CR) 10Zabe: [C:03+2] NewFilesPager: Properly support file schema migration read new [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248123 (https://phabricator.wikimedia.org/T419062) (owner: 10Zabe) [23:50:06] (03PS1) 10Dzahn: zuul::main: add zuul client cert to full chain of trust [puppet] - 10https://gerrit.wikimedia.org/r/1248137 (https://phabricator.wikimedia.org/T395938) [23:59:11] (03CR) 10CI reject: [V:04-1] NewFilesPager: Properly support file schema migration read new [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248125 (https://phabricator.wikimedia.org/T419062) (owner: 10Zabe)