[00:01:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T418465)', diff saved to https://phabricator.wikimedia.org/P89419 and previous config saved to /var/cache/conftool/dbconfig/20260302-000143-marostegui.json [00:01:47] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [00:02:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1253.eqiad.wmnet with reason: Maintenance [00:02:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1253 (T418465)', diff saved to https://phabricator.wikimedia.org/P89420 and previous config saved to /var/cache/conftool/dbconfig/20260302-000208-marostegui.json [00:04:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:04:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T418465)', diff saved to https://phabricator.wikimedia.org/P89421 and previous config saved to /var/cache/conftool/dbconfig/20260302-000425-marostegui.json [00:09:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:11:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:14:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [00:19:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P89422 and previous config saved to /var/cache/conftool/dbconfig/20260302-001933-marostegui.json [00:22:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:27:17] RESOLVED: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:32:46] FIRING: Traffic bill over quota: Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:34:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P89423 and previous config saved to /var/cache/conftool/dbconfig/20260302-003441-marostegui.json [00:38:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1246895 [00:38:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1246895 (owner: 10TrainBranchBot) [00:49:15] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1246895 (owner: 10TrainBranchBot) [00:49:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T418465)', diff saved to https://phabricator.wikimedia.org/P89424 and previous config saved to /var/cache/conftool/dbconfig/20260302-004950-marostegui.json [00:49:53] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [00:50:07] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [00:52:46] RESOLVED: Traffic bill over quota: Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [01:09:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1246897 [01:09:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1246897 (owner: 10TrainBranchBot) [01:17:25] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:19:28] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:25:47] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1246897 (owner: 10TrainBranchBot) [01:49:13] FIRING: CertAlmostExpired: Certificate for service grafana:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#grafana:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:00:47] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:08:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:00] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1162 - https://phabricator.wikimedia.org/T418550#11661659 (10Jclark-ctr) 05Open→03Resolved [02:14:01] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 13s) [02:33:23] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:40:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:50:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:51:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:16:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:19:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:20:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:24:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:25:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:29:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:34:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:38:23] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:39:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:43:23] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:44:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:45:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:54:19] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [03:55:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:57:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:02:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:07:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:12:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:18:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:23:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:24:44] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:29:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:31:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:33:23] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:34:44] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:41:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:44:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:47:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:49:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:50:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:52:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:54:44] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:54:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:55:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:59:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:59:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:08:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:13:23] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:13:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:14:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:16:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:18:23] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:18:25] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:19:28] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:19:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:21:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:28:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:33:23] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:33:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:34:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:39:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:39:44] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:42:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:49:13] FIRING: CertAlmostExpired: Certificate for service grafana:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#grafana:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:56:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:58:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2220.codfw.wmnet with reason: Maintenance [05:58:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1181.eqiad.wmnet with reason: Maintenance [06:01:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:01:49] (03PS1) 10Marostegui: Revert "db1162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1246898 [06:02:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2152.codfw.wmnet with reason: Maintenance [06:02:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2152 (T418465)', diff saved to https://phabricator.wikimedia.org/P89425 and previous config saved to /var/cache/conftool/dbconfig/20260302-060245-marostegui.json [06:02:49] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [06:02:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1167.eqiad.wmnet with reason: Maintenance [06:03:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [06:03:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1167 (T418465)', diff saved to https://phabricator.wikimedia.org/P89426 and previous config saved to /var/cache/conftool/dbconfig/20260302-060317-marostegui.json [06:03:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2179 with weight 0 T418080', diff saved to https://phabricator.wikimedia.org/P89427 and previous config saved to /var/cache/conftool/dbconfig/20260302-060317-marostegui.json [06:03:23] T418080: Switchover s4 master (db2240 -> db2179) - https://phabricator.wikimedia.org/T418080 [06:03:46] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1242168 (https://phabricator.wikimedia.org/T418080) (owner: 10Gerrit maintenance bot) [06:03:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 42 hosts with reason: Primary switchover s4 T418080 [06:04:02] (03CR) 10Marostegui: [C:03+2] Revert "db1162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1246898 (owner: 10Marostegui) [06:05:24] (03PS1) 10Marostegui: Revert "conftool: remove ats-be from cp20[43-58]" [puppet] - 10https://gerrit.wikimedia.org/r/1246899 [06:06:00] (03CR) 10Marostegui: [C:03+2] Revert "conftool: remove ats-be from cp20[43-58]" [puppet] - 10https://gerrit.wikimedia.org/r/1246899 (owner: 10Marostegui) [06:06:53] !log Starting s4 codfw failover from db2240 to db2179 - T418080 [06:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s4 codfw as read-only for maintenance - T418080', diff saved to https://phabricator.wikimedia.org/P89428 and previous config saved to /var/cache/conftool/dbconfig/20260302-061252-marostegui.json [06:12:56] T418080: Switchover s4 master (db2240 -> db2179) - https://phabricator.wikimedia.org/T418080 [06:13:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2179 to s4 primary and set section read-write T418080', diff saved to https://phabricator.wikimedia.org/P89429 and previous config saved to /var/cache/conftool/dbconfig/20260302-061316-marostegui.json [06:13:43] (03CR) 10Marostegui: [C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1242169 (https://phabricator.wikimedia.org/T418080) (owner: 10Gerrit maintenance bot) [06:13:47] !log marostegui@dns1004 START - running authdns-update [06:14:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2240 T418080', diff saved to https://phabricator.wikimedia.org/P89430 and previous config saved to /var/cache/conftool/dbconfig/20260302-061428-marostegui.json [06:15:15] !log marostegui@dns1004 END - running authdns-update [06:16:21] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1244: After schema change [06:17:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2240.codfw.wmnet with reason: Maintenance [06:18:03] (03PS1) 10Marostegui: db2240: Long schema change [puppet] - 10https://gerrit.wikimedia.org/r/1246901 (https://phabricator.wikimedia.org/T418080) [06:18:25] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:19:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T418465)', diff saved to https://phabricator.wikimedia.org/P89432 and previous config saved to /var/cache/conftool/dbconfig/20260302-061922-marostegui.json [06:19:23] (03CR) 10Marostegui: [C:03+2] db2240: Long schema change [puppet] - 10https://gerrit.wikimedia.org/r/1246901 (https://phabricator.wikimedia.org/T418080) (owner: 10Marostegui) [06:19:26] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [06:19:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T418465)', diff saved to https://phabricator.wikimedia.org/P89433 and previous config saved to /var/cache/conftool/dbconfig/20260302-061957-marostegui.json [06:22:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:25:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:34:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P89435 and previous config saved to /var/cache/conftool/dbconfig/20260302-063430-marostegui.json [06:35:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P89436 and previous config saved to /var/cache/conftool/dbconfig/20260302-063506-marostegui.json [06:45:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:46:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:49:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P89438 and previous config saved to /var/cache/conftool/dbconfig/20260302-064938-marostegui.json [06:50:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P89439 and previous config saved to /var/cache/conftool/dbconfig/20260302-065014-marostegui.json [06:51:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:52:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:01:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1244: After schema change [07:04:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T418465)', diff saved to https://phabricator.wikimedia.org/P89441 and previous config saved to /var/cache/conftool/dbconfig/20260302-070447-marostegui.json [07:04:50] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [07:05:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2154.codfw.wmnet with reason: Maintenance [07:05:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2154 (T418465)', diff saved to https://phabricator.wikimedia.org/P89442 and previous config saved to /var/cache/conftool/dbconfig/20260302-070512-marostegui.json [07:05:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T418465)', diff saved to https://phabricator.wikimedia.org/P89443 and previous config saved to /var/cache/conftool/dbconfig/20260302-070523-marostegui.json [07:05:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:06:27] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11661728 (10Marostegui) dbproxy1029 can be moved anytime too. [07:10:39] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11661729 (10Tacsipacsi) >>! In T414805#11591625, @Ladsgroup wrote: > This image is in a standard size and passes through our rate lim... [07:20:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1172.eqiad.wmnet with reason: Maintenance [07:20:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1172 (T418465)', diff saved to https://phabricator.wikimedia.org/P89444 and previous config saved to /var/cache/conftool/dbconfig/20260302-072058-marostegui.json [07:21:01] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [07:22:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T418465)', diff saved to https://phabricator.wikimedia.org/P89445 and previous config saved to /var/cache/conftool/dbconfig/20260302-072224-marostegui.json [07:22:27] (03PS1) 10Kosta Harlan: HCaptchaEnterpriseHealthChecker: Add configurable retry count and delay [extensions/ConfirmEdit] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1246904 (https://phabricator.wikimedia.org/T418477) [07:22:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1246904 (https://phabricator.wikimedia.org/T418477) (owner: 10Kosta Harlan) [07:22:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:25:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:37:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P89446 and previous config saved to /var/cache/conftool/dbconfig/20260302-073732-marostegui.json [07:37:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T418465)', diff saved to https://phabricator.wikimedia.org/P89447 and previous config saved to /var/cache/conftool/dbconfig/20260302-073745-marostegui.json [07:37:48] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [07:48:45] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11661786 (10Jelto) [07:52:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P89448 and previous config saved to /var/cache/conftool/dbconfig/20260302-075241-marostegui.json [07:52:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P89449 and previous config saved to /var/cache/conftool/dbconfig/20260302-075252-marostegui.json [07:59:13] (03CR) 10Fabfur: [C:03+2] hiera: set haproxy version to 3.0 in upload@magru [puppet] - 10https://gerrit.wikimedia.org/r/1244689 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [08:00:01] (03CR) 10Muehlenhoff: [C:03+2] Enable Java 21 on build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1245280 (https://phabricator.wikimedia.org/T418109) (owner: 10Muehlenhoff) [08:00:05] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T0800). [08:00:05] katherine_g, matthiasmullie, and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:19] o/ [08:01:31] hi [08:02:07] katherine_g: do you need someone to deploy your change? [08:02:42] kostajh: i do have spider pig access so I can go ahead if deployers are ok with that [08:03:08] katherine_g: looking at your change, wondering if making a dblist for this would be more maintainable [08:03:32] but as you already have a patch, no objections from me in shipping it [08:03:45] so yes, please go ahead and I'll deploy when you're finished [08:04:24] kostajh: agree, the only thing is they may change from each other as the communities give us feedback [08:04:24] o/ [08:04:27] starting now [08:05:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kgraessle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240672 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [08:05:43] !log start upgrading haproxy to 3.0 on A:cp-upload_magru (T417253) [08:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:46] T417253: Upgrade to HAProxy 3.0 on cache (bullseye) hosts - https://phabricator.wikimedia.org/T417253 [08:05:56] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_magru and A:cp - 3.0 upgrade () [08:06:41] (03Merged) 10jenkins-bot: Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240672 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [08:07:20] !log kgraessle@deploy2002 Started scap sync-world: Backport for [[gerrit:1240672|Enable revert risk filters for first batch of wikis: < 1000 monthly edits (T411485)]] [08:07:23] T411485: Enable revert risk filters for first batch of wikis: < 1000 monthly edits - https://phabricator.wikimedia.org/T411485 [08:07:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T418465)', diff saved to https://phabricator.wikimedia.org/P89450 and previous config saved to /var/cache/conftool/dbconfig/20260302-080748-marostegui.json [08:07:52] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [08:08:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P89451 and previous config saved to /var/cache/conftool/dbconfig/20260302-080800-marostegui.json [08:08:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2161.codfw.wmnet with reason: Maintenance [08:08:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2161 (T418465)', diff saved to https://phabricator.wikimedia.org/P89452 and previous config saved to /var/cache/conftool/dbconfig/20260302-080813-marostegui.json [08:11:14] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11661859 (10SLyngshede-WMF) @A_smart_kitten I think that's fair, disabling an account isn't exactly... [08:11:50] (03CR) 10Silvan Heintze: [C:04-1] "I think this is still missing the `namspaces` setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245364 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon) [08:17:39] (03PS3) 10Fabfur: hiera: set haproxy version to 3.0 on all remaining magru hosts [puppet] - 10https://gerrit.wikimedia.org/r/1245274 (https://phabricator.wikimedia.org/T417253) [08:18:13] (03CR) 10Fabfur: hiera: set haproxy version to 3.0 on all remaining magru hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1245274 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [08:19:12] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbproxy1029.eqiad.wmnet with reason: Maintenance [08:22:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbproxy1028.eqiad.wmnet with reason: Maintenance [08:23:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T418465)', diff saved to https://phabricator.wikimedia.org/P89453 and previous config saved to /var/cache/conftool/dbconfig/20260302-082309-marostegui.json [08:23:12] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [08:23:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1177.eqiad.wmnet with reason: Maintenance [08:23:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1177 (T418465)', diff saved to https://phabricator.wikimedia.org/P89454 and previous config saved to /var/cache/conftool/dbconfig/20260302-082333-marostegui.json [08:24:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T418465)', diff saved to https://phabricator.wikimedia.org/P89455 and previous config saved to /var/cache/conftool/dbconfig/20260302-082414-marostegui.json [08:27:06] (03PS1) 10Vgutierrez: cache::haproxy: Reduce hard-stop-after timeout to 1m [puppet] - 10https://gerrit.wikimedia.org/r/1246910 [08:28:29] (03CR) 10Fabfur: [C:03+1] cache::haproxy: Reduce hard-stop-after timeout to 1m [puppet] - 10https://gerrit.wikimedia.org/r/1246910 (owner: 10Vgutierrez) [08:29:47] (03PS1) 10Vgutierrez: Revert^2 "conftool: remove ats-be from cp20[43-58]" [puppet] - 10https://gerrit.wikimedia.org/r/1246911 [08:30:03] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11661903 (10Jelto) Unfortunately the provided SSH key can not be used, you copied the fingerprint and not the public key. Can you update the task with the actual public key... [08:30:17] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11661905 (10Jelto) [08:30:32] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dbproxy1028.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:30:37] !log kgraessle@deploy2002 kgraessle: Backport for [[gerrit:1240672|Enable revert risk filters for first batch of wikis: < 1000 monthly edits (T411485)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:30:40] T411485: Enable revert risk filters for first batch of wikis: < 1000 monthly edits - https://phabricator.wikimedia.org/T411485 [08:30:45] (03CR) 10Vgutierrez: [C:03+2] Revert^2 "conftool: remove ats-be from cp20[43-58]" [puppet] - 10https://gerrit.wikimedia.org/r/1246911 (owner: 10Vgutierrez) [08:31:31] !log kgraessle@deploy2002 kgraessle: Continuing with sync [08:32:03] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11661910 (10Jelto) @Milimetric can you ping your manager to sign this access request too? [08:32:27] (03CR) 10Joal: [C:03+1] Add a /srv/spark managed directory on dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1245413 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [08:33:34] elukey@cumin1003 provision (PID 1110409) is awaiting input [08:33:35] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Reduce hard-stop-after timeout to 1m [puppet] - 10https://gerrit.wikimedia.org/r/1246910 (owner: 10Vgutierrez) [08:39:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P89456 and previous config saved to /var/cache/conftool/dbconfig/20260302-083922-marostegui.json [08:40:03] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11661930 (10Marostegui) [08:40:05] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11661931 (10MoritzMuehlenhoff) [08:40:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T418465)', diff saved to https://phabricator.wikimedia.org/P89457 and previous config saved to /var/cache/conftool/dbconfig/20260302-084010-marostegui.json [08:40:14] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [08:41:44] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1028.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:42:11] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dbproxy1029.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:44:27] (03CR) 10Joal: "Small nits in code, plus a question: it seems the tests you add are for a lot more than spark-only restrictions. Is it worth adding a note" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [08:44:32] !log kgraessle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1240672|Enable revert risk filters for first batch of wikis: < 1000 monthly edits (T411485)]] (duration: 37m 12s) [08:44:35] T411485: Enable revert risk filters for first batch of wikis: < 1000 monthly edits - https://phabricator.wikimedia.org/T411485 [08:45:30] !log installing libxml2 security updates [08:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:46] matthiasmullie: kostajh: over to you all, I'm done :) [08:45:53] thanks! [08:46:02] matthiasmullie: do you want to go next, or should I? [08:46:05] (03CR) 10Joal: "Another question: what about the /usr/share/GeoIP hostPath?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [08:46:12] either wfm [08:46:58] ok, I'll start then [08:47:31] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1029.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:48:16] (03CR) 10Joal: [C:03+1] Add an analytics PSP permitting access to certain hostPaths [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245367 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [08:48:52] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_magru and A:cp - 3.0 upgrade () [08:49:19] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11661941 (10elukey) Command used: sudo cookbook sre.hosts.provision --no-dhcp --no-users --no-switch --legacy $host For dbproxy1028 th... [08:49:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1246904 (https://phabricator.wikimedia.org/T418477) (owner: 10Kosta Harlan) [08:50:30] (03CR) 10Joal: [C:03+1] "Note: dse-k8s-codfw is mentioned in commit message but the code change doesn't touch it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [08:51:30] (03PS1) 10Tiziano Fogli: slothslos: normalize file extensions during copy [puppet] - 10https://gerrit.wikimedia.org/r/1247005 (https://phabricator.wikimedia.org/T414579) [08:51:33] (03Merged) 10jenkins-bot: HCaptchaEnterpriseHealthChecker: Add configurable retry count and delay [extensions/ConfirmEdit] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1246904 (https://phabricator.wikimedia.org/T418477) (owner: 10Kosta Harlan) [08:51:48] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11661946 (10Marostegui) [08:51:52] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1246904|HCaptchaEnterpriseHealthChecker: Add configurable retry count and delay (T418477)]] [08:51:55] T418477: Improve resilience of hCaptcha API URL loading to transient network issues - https://phabricator.wikimedia.org/T418477 [08:52:02] (03CR) 10CI reject: [V:04-1] slothslos: normalize file extensions during copy [puppet] - 10https://gerrit.wikimedia.org/r/1247005 (https://phabricator.wikimedia.org/T414579) (owner: 10Tiziano Fogli) [08:54:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P89458 and previous config saved to /var/cache/conftool/dbconfig/20260302-085430-marostegui.json [08:55:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P89459 and previous config saved to /var/cache/conftool/dbconfig/20260302-085519-marostegui.json [08:55:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:56:36] (03PS2) 10Itamar Givon: Add configurations for graphql usage survey and its pipeline tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245364 (https://phabricator.wikimedia.org/T414476) [08:57:53] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1246904|HCaptchaEnterpriseHealthChecker: Add configurable retry count and delay (T418477)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:57:56] T418477: Improve resilience of hCaptcha API URL loading to transient network issues - https://phabricator.wikimedia.org/T418477 [08:58:37] checking [08:59:08] (03CR) 10Matthias Mullie: [C:03+2] Limit additional whitespace to sticky header version only [extensions/MobileFrontend] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245265 (https://phabricator.wikimedia.org/T416598) (owner: 10Matthias Mullie) [09:00:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:02:08] !log kharlan@deploy2002 kharlan: Continuing with sync [09:02:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:05:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:05:55] (03CR) 10JMeybohm: [C:03+1] api-gateway: Add api-gateway-ro to certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244700 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [09:07:17] (03PS2) 10Tiziano Fogli: slothslos: normalize file extensions during copy [puppet] - 10https://gerrit.wikimedia.org/r/1247005 (https://phabricator.wikimedia.org/T414579) [09:08:01] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1246904|HCaptchaEnterpriseHealthChecker: Add configurable retry count and delay (T418477)]] (duration: 16m 09s) [09:08:04] T418477: Improve resilience of hCaptcha API URL loading to transient network issues - https://phabricator.wikimedia.org/T418477 [09:08:15] (03CR) 10JMeybohm: [C:03+1] profile::kafka::broker: support new confluent distributions [puppet] - 10https://gerrit.wikimedia.org/r/1239135 (https://phabricator.wikimedia.org/T416670) (owner: 10Elukey) [09:08:23] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:09:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T418465)', diff saved to https://phabricator.wikimedia.org/P89460 and previous config saved to /var/cache/conftool/dbconfig/20260302-090938-marostegui.json [09:09:42] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [09:09:56] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2163.codfw.wmnet with reason: Maintenance [09:10:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2163 (T418465)', diff saved to https://phabricator.wikimedia.org/P89461 and previous config saved to /var/cache/conftool/dbconfig/20260302-091003-marostegui.json [09:10:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P89462 and previous config saved to /var/cache/conftool/dbconfig/20260302-091027-marostegui.json [09:13:23] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:13:45] (03Merged) 10jenkins-bot: Limit additional whitespace to sticky header version only [extensions/MobileFrontend] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1245265 (https://phabricator.wikimedia.org/T416598) (owner: 10Matthias Mullie) [09:15:06] !log mlitn@deploy2002 Started scap sync-world: Backport for [[gerrit:1245265|Limit additional whitespace to sticky header version only (T416598)]] [09:15:09] T416598: Minerva ToC UI tweaks - https://phabricator.wikimedia.org/T416598 [09:16:49] !log mlitn@deploy2002 mlitn: Backport for [[gerrit:1245265|Limit additional whitespace to sticky header version only (T416598)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:19:28] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:21:58] !log mlitn@deploy2002 mlitn: Continuing with sync [09:21:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245364 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon) [09:25:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T418465)', diff saved to https://phabricator.wikimedia.org/P89463 and previous config saved to /var/cache/conftool/dbconfig/20260302-092535-marostegui.json [09:25:39] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [09:25:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1178.eqiad.wmnet with reason: Maintenance [09:26:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1178 (T418465)', diff saved to https://phabricator.wikimedia.org/P89464 and previous config saved to /var/cache/conftool/dbconfig/20260302-092600-marostegui.json [09:26:08] !log mlitn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1245265|Limit additional whitespace to sticky header version only (T416598)]] (duration: 11m 02s) [09:26:10] T416598: Minerva ToC UI tweaks - https://phabricator.wikimedia.org/T416598 [09:26:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T418465)', diff saved to https://phabricator.wikimedia.org/P89465 and previous config saved to /var/cache/conftool/dbconfig/20260302-092610-marostegui.json [09:30:44] 10ops-eqiad, 06SRE, 07SRE-Unowned, 06DC-Ops: maps1012 not reachable - https://phabricator.wikimedia.org/T418711 (10MoritzMuehlenhoff) 03NEW [09:31:29] (03CR) 10JavierMonton: [C:03+2] stream: mediawiki.page_html_content_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245410 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [09:32:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:33:05] (03CR) 10Federico Ceratto: [C:03+1] "I'm not sure what would make orchestrator pick up the wrong records e.g. a CNAME, but I checked the output and tested the script on dborch" [puppet] - 10https://gerrit.wikimedia.org/r/1245393 (https://phabricator.wikimedia.org/T272347) (owner: 10Marostegui) [09:33:58] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:34:04] (03Merged) 10jenkins-bot: stream: mediawiki.page_html_content_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245410 (https://phabricator.wikimedia.org/T418467) (owner: 10JavierMonton) [09:34:08] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:34:34] !log installing gnu TLS security updates [09:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:41] (03CR) 10Marostegui: "We saw that in the past and it was a pain to debug until we found it. We weren't sure about what caused it, but it did a few times. Althou" [puppet] - 10https://gerrit.wikimedia.org/r/1245393 (https://phabricator.wikimedia.org/T272347) (owner: 10Marostegui) [09:34:45] (03CR) 10Marostegui: [C:03+2] orchestrator: Monitor for non-FQDNs in the host resolve cache [puppet] - 10https://gerrit.wikimedia.org/r/1245393 (https://phabricator.wikimedia.org/T272347) (owner: 10Marostegui) [09:35:30] (03PS1) 10Brouberol: trafficserver: enable turnilo-next.w.o redirection to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1247013 (https://phabricator.wikimedia.org/T416113) [09:35:42] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:35:50] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:35:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:36:56] (03CR) 10Brouberol: "The app responds to `turnilo-next.discovery.wmnet:30443`" [puppet] - 10https://gerrit.wikimedia.org/r/1247013 (https://phabricator.wikimedia.org/T416113) (owner: 10Brouberol) [09:39:54] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:41:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P89466 and previous config saved to /var/cache/conftool/dbconfig/20260302-094118-marostegui.json [09:42:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T418465)', diff saved to https://phabricator.wikimedia.org/P89467 and previous config saved to /var/cache/conftool/dbconfig/20260302-094236-marostegui.json [09:42:40] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [09:46:29] (03PS1) 10Slyngshede: C:mtail extend trafficserver_backend_requests_seconds buckets [puppet] - 10https://gerrit.wikimedia.org/r/1247015 (https://phabricator.wikimedia.org/T411584) [09:47:31] 10SRE-swift-storage, 06Data-Persistence, 10Prod-Kubernetes, 06ServiceOps new, and 5 others: Fix thumbor discovery records and make swift use them - https://phabricator.wikimedia.org/T397618#11662096 (10MLechvien-WMF) a:05Clement_Goubert→03None [09:49:13] FIRING: CertAlmostExpired: Certificate for service grafana:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#grafana:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:51:00] (03CR) 10JMeybohm: [C:04-1] "I'm afraid this will not work since PSP is already disabled in dse-k8s-eqiad." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245367 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [09:53:04] (03CR) 10Fabfur: [C:03+2] hiera: set haproxy version to 3.0 on all remaining magru hosts [puppet] - 10https://gerrit.wikimedia.org/r/1245274 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [09:53:44] 06SRE, 06Infrastructure-Foundations, 10netops: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11662141 (10ayounsi) To not have it become obsolete too fast, I suggest regrouping the transit providers in a single "peerings and transits" cloud (or one per esams core router) Same w... [09:54:44] (03CR) 10Brouberol: [C:03+1] Remove the site.pp definitions for decommissioned an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1242516 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [09:55:05] !log start upgrading haproxy to 3.0 on A:cp-text_magru (T417253) [09:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:08] T417253: Upgrade to HAProxy 3.0 on cache (bullseye) hosts - https://phabricator.wikimedia.org/T417253 [09:55:13] (03CR) 10Brouberol: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1242271 (owner: 10Muehlenhoff) [09:55:35] (03CR) 10Elukey: "Follow up - I had a chat with Federico on IRC and we talked httpx vs spicerack's apiclient (to avoid a new dependency) and also about the " [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [09:56:07] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1247013 (https://phabricator.wikimedia.org/T416113) (owner: 10Brouberol) [09:56:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P89468 and previous config saved to /var/cache/conftool/dbconfig/20260302-095627-marostegui.json [09:57:21] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_magru and A:cp - 3.0 upgrade () [09:57:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P89469 and previous config saved to /var/cache/conftool/dbconfig/20260302-095744-marostegui.json [09:57:52] (03CR) 10Brouberol: [C:03+2] trafficserver: enable turnilo-next.w.o redirection to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1247013 (https://phabricator.wikimedia.org/T416113) (owner: 10Brouberol) [09:59:02] (03CR) 10JMeybohm: [C:04-1] Add a ValidatingAdmissionPolicy permitting access to /srv/spark (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [09:59:08] (03CR) 10Silvan Heintze: [C:03+1] "LGTM, thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245364 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon) [10:00:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:01:13] (03CR) 10JMeybohm: [C:04-1] Apply the new PSP and VAP to several namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [10:02:46] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11662185 (10MoritzMuehlenhoff) [10:02:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:08:59] !log installing intel-microcode bugfix updates on Bookworm hosts [10:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:20] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11662211 (10MatthewVernon) So that's a "thumbnail the same size as original, rather than original" issue (the original image is 1074p... [10:11:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T418465)', diff saved to https://phabricator.wikimedia.org/P89470 and previous config saved to /var/cache/conftool/dbconfig/20260302-101135-marostegui.json [10:11:38] (03CR) 10Gergő Tisza: rest-gateway: use rlc claim from cookie with bearer token (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [10:11:39] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [10:11:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2164.codfw.wmnet with reason: Maintenance [10:12:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2164 (T418465)', diff saved to https://phabricator.wikimedia.org/P89471 and previous config saved to /var/cache/conftool/dbconfig/20260302-101200-marostegui.json [10:12:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P89472 and previous config saved to /var/cache/conftool/dbconfig/20260302-101252-marostegui.json [10:16:11] (03CR) 10Jelto: [C:03+2] trafficserver: Add gerrit-replica backend [puppet] - 10https://gerrit.wikimedia.org/r/1240603 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [10:17:44] (03PS2) 10Muehlenhoff: Remove two spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1243129 [10:17:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:18:01] (03CR) 10Muehlenhoff: [C:03+2] Run the cephosd spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1242271 (owner: 10Muehlenhoff) [10:19:31] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete mediawiki spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1240720 (owner: 10Muehlenhoff) [10:19:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:23:19] (03CR) 10Elukey: [C:03+2] profile::kafka::broker: support new confluent distributions [puppet] - 10https://gerrit.wikimedia.org/r/1239135 (https://phabricator.wikimedia.org/T416670) (owner: 10Elukey) [10:28:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T418465)', diff saved to https://phabricator.wikimedia.org/P89473 and previous config saved to /var/cache/conftool/dbconfig/20260302-102800-marostegui.json [10:28:04] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [10:28:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1192.eqiad.wmnet with reason: Maintenance [10:28:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1192 (T418465)', diff saved to https://phabricator.wikimedia.org/P89474 and previous config saved to /var/cache/conftool/dbconfig/20260302-102825-marostegui.json [10:31:57] (03CR) 10Btullis: [C:03+1] dse-k8s: define the opensearch-operator-3 namespace to all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243046 (https://phabricator.wikimedia.org/T418176) (owner: 10Brouberol) [10:32:19] (03CR) 10Btullis: [C:03+1] deployment_server: provision the dse-k8s opensearch-operator-3 kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1243041 (https://phabricator.wikimedia.org/T418176) (owner: 10Brouberol) [10:32:23] (03CR) 10Vgutierrez: "in both text and upload the `no_activity_timeout_out` is set to 180s, I'd cover that timespan with the buckets" [puppet] - 10https://gerrit.wikimedia.org/r/1247015 (https://phabricator.wikimedia.org/T411584) (owner: 10Slyngshede) [10:32:43] (03CR) 10Btullis: [C:03+1] growhbook: disable frontend telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243023 (https://phabricator.wikimedia.org/T418211) (owner: 10Brouberol) [10:34:30] (03CR) 10Btullis: "Oh, right. I see. OK. I think that we have to go with the upgrade to 1.31." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245367 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [10:36:47] (03CR) 10Hnowlan: [C:03+1] "lgtm - I was tempted to ask if there is someone associated with this entry that we should ask for consensus, but this has been set to debu" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245473 (https://phabricator.wikimedia.org/T418612) (owner: 10Cwhite) [10:37:57] (03PS1) 10Elukey: apt::package_from_component: propagate ensure to apt config [puppet] - 10https://gerrit.wikimedia.org/r/1247022 [10:38:37] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247022 (owner: 10Elukey) [10:39:33] (03CR) 10Brouberol: [C:03+2] deployment_server: provision the dse-k8s opensearch-operator-3 kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1243041 (https://phabricator.wikimedia.org/T418176) (owner: 10Brouberol) [10:39:48] (03CR) 10Brouberol: [C:03+2] growhbook: disable frontend telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243023 (https://phabricator.wikimedia.org/T418211) (owner: 10Brouberol) [10:39:54] (03CR) 10Brouberol: [C:03+2] dse-k8s: define the opensearch-operator-3 namespace to all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243046 (https://phabricator.wikimedia.org/T418176) (owner: 10Brouberol) [10:40:07] (03CR) 10Btullis: Add a ValidatingAdmissionPolicy permitting access to /srv/spark (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [10:40:54] (03CR) 10Elukey: [C:03+1] Remove two spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1243129 (owner: 10Muehlenhoff) [10:41:20] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [10:43:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P89475 and previous config saved to /var/cache/conftool/dbconfig/20260302-104310-marostegui.json [10:43:31] (03CR) 10Brouberol: Add a ValidatingAdmissionPolicy permitting access to /srv/spark (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [10:44:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T418465)', diff saved to https://phabricator.wikimedia.org/P89476 and previous config saved to /var/cache/conftool/dbconfig/20260302-104446-marostegui.json [10:44:50] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [10:44:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:45:10] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Package Confluent Platform 7.5.x / Kafka 3.5 - https://phabricator.wikimedia.org/T416670#11662335 (10elukey) 05Open→03Resolved a:03elukey Tested in pontoon, I was able to upgrade and rollback via puppet. The only detail worth to mention is th... [10:45:22] (03PS2) 10Gkyziridis: changeprop: Add revertrisk-multilingual model ti changeprop staging configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1238692 (https://phabricator.wikimedia.org/T415892) [10:45:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:45:34] (03CR) 10CI reject: [V:04-1] changeprop: Add revertrisk-multilingual model ti changeprop staging configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1238692 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [10:46:02] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_magru and A:cp - 3.0 upgrade () [10:46:22] (03CR) 10Gkyziridis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1238692 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [10:47:24] (03CR) 10JMeybohm: [C:04-1] Add a ValidatingAdmissionPolicy permitting access to /srv/spark (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [10:47:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:48:07] (03Merged) 10jenkins-bot: dse-k8s: define the opensearch-operator-3 namespace to all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243046 (https://phabricator.wikimedia.org/T418176) (owner: 10Brouberol) [10:48:27] (03CR) 10JMeybohm: [C:04-1] Add a ValidatingAdmissionPolicy permitting access to /srv/spark (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [10:49:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:50:22] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:50:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:50:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [10:51:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [10:51:20] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [10:51:36] (03CR) 10Muehlenhoff: [C:03+1] "Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1247022 (owner: 10Elukey) [10:51:56] (03PS3) 10Gkyziridis: changeprop: Add revertrisk-multilingual model ti changeprop staging configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1238692 (https://phabricator.wikimedia.org/T415892) [10:52:04] (03CR) 10CI reject: [V:04-1] changeprop: Add revertrisk-multilingual model ti changeprop staging configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1238692 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [10:52:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:52:40] 06SRE, 06Infrastructure-Foundations: Create a cookbook to execute Kafka rolling upgrades - https://phabricator.wikimedia.org/T417035#11662347 (10elukey) From the tests in T416670, the procedure should be: * Disable puppet on all target brokers. * Merge a puppet change that sets the new target distribution, li... [10:52:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:53:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:53:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:54:39] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [10:55:01] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [10:57:01] (03CR) 10Elukey: [C:03+2] apt::package_from_component: propagate ensure to apt config [puppet] - 10https://gerrit.wikimedia.org/r/1247022 (owner: 10Elukey) [10:57:29] (03PS1) 10Gkyziridis: changeprop: Add revertrisk-multilingual model ti changeprop staging configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247025 (https://phabricator.wikimedia.org/T415892) [10:57:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin2003.codfw.wmnet [10:58:14] (03Abandoned) 10Gkyziridis: changeprop: Add revertrisk-multilingual model ti changeprop staging configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1238692 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [10:58:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P89477 and previous config saved to /var/cache/conftool/dbconfig/20260302-105818-marostegui.json [10:58:20] (03PS4) 10Btullis: Add a ValidatingAdmissionPolicy for use with analytics workloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) [10:58:20] (03PS5) 10Btullis: Apply the new VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) [10:59:12] (03PS1) 10Jelto: cache:text: add gerrit-replica to alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/1247026 (https://phabricator.wikimedia.org/T418108) [10:59:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P89478 and previous config saved to /var/cache/conftool/dbconfig/20260302-105955-marostegui.json [11:00:01] (03CR) 10Arnaudb: [C:03+1] cache:text: add gerrit-replica to alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/1247026 (https://phabricator.wikimedia.org/T418108) (owner: 10Jelto) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T1100) [11:03:03] (03PS2) 10Gkyziridis: changeprop: Add revertrisk-multilingual model ti changeprop staging configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247025 (https://phabricator.wikimedia.org/T415892) [11:03:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin2003.codfw.wmnet [11:03:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:04:18] (03PS1) 10Vgutierrez: cache::haproxy: calculate X-I-B for C and E [puppet] - 10https://gerrit.wikimedia.org/r/1247027 (https://phabricator.wikimedia.org/T417825) [11:06:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:10:25] (03PS3) 10Zabe: multiversion: Stop setting MW_USE_CONFIG_SCHEMA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1246880 (https://phabricator.wikimedia.org/T304460) [11:12:47] jouncebot: nowandnext [11:12:47] For the next 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T1100) [11:12:47] In 2 hour(s) and 47 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T1400) [11:13:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T418465)', diff saved to https://phabricator.wikimedia.org/P89479 and previous config saved to /var/cache/conftool/dbconfig/20260302-111327-marostegui.json [11:13:31] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [11:13:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2166.codfw.wmnet with reason: Maintenance [11:13:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2166 (T418465)', diff saved to https://phabricator.wikimedia.org/P89480 and previous config saved to /var/cache/conftool/dbconfig/20260302-111351-marostegui.json [11:15:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P89481 and previous config saved to /var/cache/conftool/dbconfig/20260302-111502-marostegui.json [11:17:52] (03Abandoned) 10Btullis: Add an analytics PSP permitting access to certain hostPaths [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245367 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [11:18:43] (03CR) 10Slyngshede: [C:03+2] C:mtail update trafficserver_backend_requests_seconds [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) (owner: 10Slyngshede) [11:21:10] jouncebot: nowandnext [11:21:11] For the next 0 hour(s) and 38 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T1100) [11:21:11] In 2 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T1400) [11:21:32] zabe: Are you deploying? I have a private code change I want to deploy [11:21:44] If you are not deploying, I'll use scap [11:21:45] Dreamy_Jazz: nope, feel free to go ahead [11:21:48] Thanks [11:23:16] (03CR) 10Elukey: [C:03+2] Apply role to pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [11:25:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-jumbo1001.eqiad.wmnet [11:28:39] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11662413 (10Tacsipacsi) I didn’t realize it’s the same size as the original. However, I did notice that – unlike the thumbnail I foun... [11:29:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T418465)', diff saved to https://phabricator.wikimedia.org/P89482 and previous config saved to /var/cache/conftool/dbconfig/20260302-112937-marostegui.json [11:29:41] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [11:30:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T418465)', diff saved to https://phabricator.wikimedia.org/P89483 and previous config saved to /var/cache/conftool/dbconfig/20260302-113010-marostegui.json [11:30:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1193.eqiad.wmnet with reason: Maintenance [11:30:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-jumbo1001.eqiad.wmnet [11:30:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1193 (T418465)', diff saved to https://phabricator.wikimedia.org/P89484 and previous config saved to /var/cache/conftool/dbconfig/20260302-113034-marostegui.json [11:34:02] (03CR) 10Muehlenhoff: [C:03+2] Remove two spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1243129 (owner: 10Muehlenhoff) [11:36:44] (03PS1) 10Elukey: admin_ng: use pki1002 for cfssl-issuer in Wikikube staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247032 (https://phabricator.wikimedia.org/T416664) [11:37:16] (03PS1) 10Muehlenhoff: cloudceph: Run the spec tests on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1247033 [11:37:21] (03PS1) 10Fabfur: varnish: add headers to x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1247034 (https://phabricator.wikimedia.org/T417864) [11:38:03] (03Abandoned) 10Fabfur: cache::haproxy: save x-ratelimit-class content for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [11:39:06] I'm done with my deploys [11:44:29] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247032 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [11:44:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P89485 and previous config saved to /var/cache/conftool/dbconfig/20260302-114446-marostegui.json [11:47:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T418465)', diff saved to https://phabricator.wikimedia.org/P89486 and previous config saved to /var/cache/conftool/dbconfig/20260302-114706-marostegui.json [11:47:10] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [11:48:09] (03PS1) 10Kevin Bazira: ml-services: add policy-violation isvc to the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247035 (https://phabricator.wikimedia.org/T418350) [11:48:57] (03PS1) 10Brouberol: turnilo: allow egress to druid-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247037 (https://phabricator.wikimedia.org/T416121) [11:48:57] (03PS1) 10Giuseppe Lavagetto: cache::haproxy: fix inclusion of traffic class check [puppet] - 10https://gerrit.wikimedia.org/r/1247036 [11:51:43] (03CR) 10Fabfur: [C:03+1] "lgtm, good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1247036 (owner: 10Giuseppe Lavagetto) [11:53:15] (03PS5) 10Btullis: Add a ValidatingAdmissionPolicy for use with analytics workloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) [11:53:16] (03PS6) 10Btullis: Apply the new VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) [11:53:48] (03PS2) 10Slyngshede: C:mtail extend trafficserver_backend_requests_seconds buckets [puppet] - 10https://gerrit.wikimedia.org/r/1247015 (https://phabricator.wikimedia.org/T411584) [11:54:32] (03PS3) 10Federico Ceratto: sre.mysql.clone: record clone runs into Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) [11:54:46] (03CR) 10Btullis: [C:03+1] turnilo: allow egress to druid-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247037 (https://phabricator.wikimedia.org/T416121) (owner: 10Brouberol) [11:54:58] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11662497 (10Ladsgroup) [11:56:45] (03CR) 10Brouberol: [C:03+2] turnilo: allow egress to druid-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247037 (https://phabricator.wikimedia.org/T416121) (owner: 10Brouberol) [11:57:31] (03PS1) 10Ayounsi: Nokia: don't prepend local-as when local-as is used [homer/public] - 10https://gerrit.wikimedia.org/r/1247039 (https://phabricator.wikimedia.org/T417817) [11:57:41] RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Mon 30 Mar 2026 11:52:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:57:57] RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Mon 30 Mar 2026 11:52:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:58:17] (03PS2) 10Ayounsi: Nokia: don't prepend local-as when local-as is used [homer/public] - 10https://gerrit.wikimedia.org/r/1247039 (https://phabricator.wikimedia.org/T417817) [11:58:53] (03PS3) 10Slyngshede: C:mtail extend trafficserver_backend_requests_seconds buckets [puppet] - 10https://gerrit.wikimedia.org/r/1247015 (https://phabricator.wikimedia.org/T411584) [11:58:58] RESOLVED: CertAlmostExpired: Certificate for service grafana:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#grafana:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:59:49] (03PS1) 10Brouberol: httpd-cas: install the wmf-certificates image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247040 (https://phabricator.wikimedia.org/T417990) [11:59:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P89487 and previous config saved to /var/cache/conftool/dbconfig/20260302-115953-marostegui.json [12:00:12] (03CR) 10JMeybohm: [C:03+1] "LGTM in general but please deploy and validate on staging-codfw first before going to staging-eqiad (or split this patch in two to be extr" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247032 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [12:00:29] (03CR) 10Slyngshede: "Do we want fine grained bucket? We could do 150, 180, but I suppose we're already so slow at this point the it matters very little." [puppet] - 10https://gerrit.wikimedia.org/r/1247015 (https://phabricator.wikimedia.org/T411584) (owner: 10Slyngshede) [12:00:39] (03PS4) 10Slyngshede: C:mtail extend trafficserver_backend_requests_seconds buckets [puppet] - 10https://gerrit.wikimedia.org/r/1247015 (https://phabricator.wikimedia.org/T411584) [12:00:39] (03CR) 10Ayounsi: [C:03+2] Nokia: don't prepend local-as when local-as is used [homer/public] - 10https://gerrit.wikimedia.org/r/1247039 (https://phabricator.wikimedia.org/T417817) (owner: 10Ayounsi) [12:00:49] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1150.eqiad.wmnet [12:01:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11662526 (10ops-monitoring-bot) Host an-worker1150.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:01:33] (03PS1) 10Brouberol: turnilo: bump the httpd-cas image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247041 (https://phabricator.wikimedia.org/T417990) [12:01:42] (03CR) 10CI reject: [V:04-1] turnilo: bump the httpd-cas image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247041 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [12:02:11] (03Merged) 10jenkins-bot: Nokia: don't prepend local-as when local-as is used [homer/public] - 10https://gerrit.wikimedia.org/r/1247039 (https://phabricator.wikimedia.org/T417817) (owner: 10Ayounsi) [12:02:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P89488 and previous config saved to /var/cache/conftool/dbconfig/20260302-120214-marostegui.json [12:03:18] (03PS1) 10Brouberol: Weekly rebuild of cert-manager - 20260301 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247042 [12:03:35] (03PS2) 10Brouberol: Weekly rebuild of cert-manager - 20260301 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247042 [12:06:11] 06SRE, 10MinT, 10Prod-Kubernetes, 06ServiceOps new, and 3 others: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058#11662536 (10KartikMistry) [12:06:17] (03PS6) 10Btullis: Add a ValidatingAdmissionPolicy for use with analytics workloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) [12:06:17] (03PS7) 10Btullis: Apply the new VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) [12:06:29] (03PS2) 10Brouberol: httpd-cas: install the wmf-certificates image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247040 (https://phabricator.wikimedia.org/T417990) [12:07:19] (03CR) 10Brouberol: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247041 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [12:08:01] (03CR) 10Btullis: [C:03+1] httpd-cas: install the wmf-certificates image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247040 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [12:08:58] (03CR) 10Brouberol: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247041 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [12:10:23] (03CR) 10Btullis: [C:03+1] turnilo: bump the httpd-cas image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247041 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [12:11:00] (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM, thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247035 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [12:12:29] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1150.eqiad.wmnet [12:12:32] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1151.eqiad.wmnet [12:12:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11662570 (10ops-monitoring-bot) Host an-worker1151.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:14:18] (03CR) 10Btullis: [V:03+1 C:03+2] Add a /srv/spark managed directory on dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1245413 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [12:15:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T418465)', diff saved to https://phabricator.wikimedia.org/P89489 and previous config saved to /var/cache/conftool/dbconfig/20260302-121501-marostegui.json [12:15:05] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [12:15:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2167.codfw.wmnet with reason: Maintenance [12:15:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2167 (T418465)', diff saved to https://phabricator.wikimedia.org/P89490 and previous config saved to /var/cache/conftool/dbconfig/20260302-121525-marostegui.json [12:16:00] (03CR) 10Brouberol: [C:03+2] httpd-cas: install the wmf-certificates image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247040 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [12:16:03] (03CR) 10Brouberol: [V:03+2 C:03+2] httpd-cas: install the wmf-certificates image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247040 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [12:16:23] (03CR) 10Btullis: Add a ValidatingAdmissionPolicy for use with analytics workloads (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [12:16:48] (03CR) 10Brouberol: [C:03+2] turnilo: bump the httpd-cas image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247041 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [12:17:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P89491 and previous config saved to /var/cache/conftool/dbconfig/20260302-121722-marostegui.json [12:20:01] (03CR) 10Kevin Bazira: [C:03+2] ml-services: add policy-violation isvc to the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247035 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [12:20:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [12:20:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [12:21:03] (03PS2) 10Vgutierrez: cache::haproxy: calculate X-I-B for C and E [puppet] - 10https://gerrit.wikimedia.org/r/1247027 (https://phabricator.wikimedia.org/T417825) [12:21:58] (03Merged) 10jenkins-bot: ml-services: add policy-violation isvc to the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247035 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [12:23:59] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:24:02] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1151.eqiad.wmnet [12:24:05] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1152.eqiad.wmnet [12:24:08] (03PS7) 10Btullis: Add a ValidatingAdmissionPolicy for use with analytics workloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) [12:24:08] (03PS8) 10Btullis: Apply the new VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) [12:24:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11662614 (10ops-monitoring-bot) Host an-worker1152.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:25:01] (03PS9) 10Btullis: Apply the new VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) [12:26:23] (03CR) 10Btullis: Apply the new VAP to several namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [12:26:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [12:27:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [12:28:00] (03PS1) 10Brouberol: turnilo: add druid missing port to configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247056 (https://phabricator.wikimedia.org/T416113) [12:29:01] (03PS1) 10Kosta Harlan: IPInfo: Set log level to "info" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247057 (https://phabricator.wikimedia.org/T374718) [12:29:19] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [12:29:39] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [12:29:44] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:29:44] FIRING: SLOMetricAbsent: wdqs-scholarly-availability esams - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:30:03] (03PS2) 10Kosta Harlan: IPInfo: Set log level to "info" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247057 (https://phabricator.wikimedia.org/T374718) [12:30:27] (03CR) 10Btullis: "Do not forget to reboot after running the puppet to change the role, since this will install the latest backported kernel for bookworm." [puppet] - 10https://gerrit.wikimedia.org/r/1245422 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis) [12:30:44] (03CR) 10Brouberol: Add the five new dse-k8s-worker nodes to the cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1245422 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis) [12:31:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T418465)', diff saved to https://phabricator.wikimedia.org/P89492 and previous config saved to /var/cache/conftool/dbconfig/20260302-123108-marostegui.json [12:31:12] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [12:32:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T418465)', diff saved to https://phabricator.wikimedia.org/P89493 and previous config saved to /var/cache/conftool/dbconfig/20260302-123229-marostegui.json [12:32:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1203.eqiad.wmnet with reason: Maintenance [12:32:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1203 (T418465)', diff saved to https://phabricator.wikimedia.org/P89494 and previous config saved to /var/cache/conftool/dbconfig/20260302-123253-marostegui.json [12:33:23] FIRING: [5x] SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:35:28] jouncebot: nowandnext [12:35:28] No deployments scheduled for the next 1 hour(s) and 24 minute(s) [12:35:28] In 1 hour(s) and 24 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T1400) [12:35:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1152.eqiad.wmnet [12:35:35] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1153.eqiad.wmnet [12:35:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247057 (https://phabricator.wikimedia.org/T374718) (owner: 10Kosta Harlan) [12:35:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11662646 (10ops-monitoring-bot) Host an-worker1153.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:38:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:40:47] (03PS2) 10Btullis: Add the five new dse-k8s-worker nodes to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1245422 (https://phabricator.wikimedia.org/T418582) [12:42:14] (03CR) 10Brouberol: [C:03+1] Add the five new dse-k8s-worker nodes to the cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1245422 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis) [12:43:07] (03CR) 10Joal: [C:03+1] turnilo: add druid missing port to configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247056 (https://phabricator.wikimedia.org/T416113) (owner: 10Brouberol) [12:43:31] (03CR) 10Brouberol: [C:03+2] turnilo: add druid missing port to configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247056 (https://phabricator.wikimedia.org/T416113) (owner: 10Brouberol) [12:43:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:46:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P89497 and previous config saved to /var/cache/conftool/dbconfig/20260302-124615-marostegui.json [12:46:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1153.eqiad.wmnet [12:46:56] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1154.eqiad.wmnet [12:47:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11662672 (10ops-monitoring-bot) Host an-worker1154.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:48:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:49:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T418465)', diff saved to https://phabricator.wikimedia.org/P89498 and previous config saved to /var/cache/conftool/dbconfig/20260302-124917-marostegui.json [12:49:21] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [12:50:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:51:14] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for maps1012.mgmt:22 - https://phabricator.wikimedia.org/T418663#11662686 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Loose Cable Reaseated [12:52:52] (03PS2) 10Zabe: filtered_tables: Drop old categorylinks columns [puppet] - 10https://gerrit.wikimedia.org/r/1239484 (https://phabricator.wikimedia.org/T299951) [12:52:56] (03CR) 10Ladsgroup: [V:03+2 C:03+2] filtered_tables: Drop old categorylinks columns [puppet] - 10https://gerrit.wikimedia.org/r/1239484 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [12:53:08] (03PS3) 10Zabe: maintain-views: Drop view for old categorylinks columns [puppet] - 10https://gerrit.wikimedia.org/r/1239483 (https://phabricator.wikimedia.org/T417492) [12:53:12] (03CR) 10Ladsgroup: [V:03+2 C:03+2] maintain-views: Drop view for old categorylinks columns [puppet] - 10https://gerrit.wikimedia.org/r/1239483 (https://phabricator.wikimedia.org/T417492) (owner: 10Zabe) [12:56:09] RECOVERY - Host maps1012 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [12:56:21] 10ops-eqiad, 06SRE, 07SRE-Unowned, 06DC-Ops: maps1012 not reachable - https://phabricator.wikimedia.org/T418711#11662701 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Found that the power cables not seated completely Reseated Verified mgmt came up and booted to login [12:56:40] (03CR) 10Btullis: [C:03+2] Add the five new dse-k8s-worker nodes to the cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1245422 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis) [12:57:07] PROBLEM - Postgres Replication Lag on maps1012 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 85518295288 and 258060 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:58:22] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1154.eqiad.wmnet [12:58:25] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1155.eqiad.wmnet [12:58:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11662728 (10ops-monitoring-bot) Host an-worker1155.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:58:52] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2356.codfw.wmnet [12:58:54] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2356.codfw.wmnet [13:00:09] !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views [13:00:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:01:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P89499 and previous config saved to /var/cache/conftool/dbconfig/20260302-130122-marostegui.json [13:03:12] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: no limits for wmcs for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243807 (owner: 10Daniel Kinzler) [13:03:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:04:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P89500 and previous config saved to /var/cache/conftool/dbconfig/20260302-130424-marostegui.json [13:05:16] jouncebot: nowandnext [13:05:16] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [13:05:16] In 0 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T1400) [13:05:24] labs only, just rebase [13:05:29] (03CR) 10Ladsgroup: [C:03+2] labs: Adopt same thumbnail steps and buckets as production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244672 (https://phabricator.wikimedia.org/T69525) (owner: 10Krinkle) [13:06:24] (03Merged) 10jenkins-bot: labs: Adopt same thumbnail steps and buckets as production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244672 (https://phabricator.wikimedia.org/T69525) (owner: 10Krinkle) [13:06:52] (03PS1) 10Kevin Bazira: ml-services: reallocate k8s resources to enable policy-violation isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247061 (https://phabricator.wikimedia.org/T418350) [13:07:01] !log ladsgroup@cumin1003 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [13:07:53] (03PS2) 10Arnaudb: gerrit: alerting downtime update [cookbooks] - 10https://gerrit.wikimedia.org/r/1239003 (https://phabricator.wikimedia.org/T416917) [13:08:35] FIRING: MailmanOutQueueHigh: Mailman out queue on lists1004:9100 has more than 50 messages - https://wikitech.wikimedia.org/wiki/Mailman/Runbooks#MailmanOutQueueHigh - https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?forceLogin&from=now-3h&orgId=1&to=now&viewPanel=2 - https://alerts.wikimedia.org/?q=alertname%3DMailmanOutQueueHigh [13:08:51] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1024.eqiad.wmnet [13:09:44] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:09:44] FIRING: [5x] SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:12:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1155.eqiad.wmnet [13:12:36] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1156.eqiad.wmnet [13:13:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11662788 (10ops-monitoring-bot) Host an-worker1156.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:13:11] (03CR) 10CI reject: [V:04-1] gerrit: alerting downtime update [cookbooks] - 10https://gerrit.wikimedia.org/r/1239003 (https://phabricator.wikimedia.org/T416917) (owner: 10Arnaudb) [13:13:23] FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:14:45] !log installing libcap2 updates from Bookworm point release [13:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1024.eqiad.wmnet [13:16:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T418465)', diff saved to https://phabricator.wikimedia.org/P89502 and previous config saved to /var/cache/conftool/dbconfig/20260302-131630-marostegui.json [13:16:33] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [13:16:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2181.codfw.wmnet with reason: Maintenance [13:16:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2181 (T418465)', diff saved to https://phabricator.wikimedia.org/P89503 and previous config saved to /var/cache/conftool/dbconfig/20260302-131653-marostegui.json [13:17:02] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1024.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:19:28] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:19:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P89504 and previous config saved to /var/cache/conftool/dbconfig/20260302-131932-marostegui.json [13:19:37] (03CR) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244696 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [13:22:02] FIRING: [3x] KubernetesCalicoDown: dse-k8s-worker1024.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:22:26] (03PS2) 10Anzx: lawiki: add Adumbratio (draft) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247063 (https://phabricator.wikimedia.org/T418706) [13:22:27] (03CR) 10Daniel Kinzler: rest gateway: expose headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) (owner: 10Daniel Kinzler) [13:22:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247063 (https://phabricator.wikimedia.org/T418706) (owner: 10Anzx) [13:23:06] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11662828 (10MoritzMuehlenhoff) [13:23:23] FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:24:04] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1156.eqiad.wmnet [13:24:07] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1157.eqiad.wmnet [13:24:09] (03CR) 10Daniel Kinzler: rest-gateway: use rlc claim from cookie with bearer token (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [13:24:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11662834 (10ops-monitoring-bot) Host an-worker1157.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:24:44] FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:26:50] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1097 [13:27:09] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1097 [13:27:28] (03CR) 10Daniel Kinzler: rest-gateway: use rlc claim from cookie with bearer token (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [13:27:34] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:28:49] !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views [13:32:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T418465)', diff saved to https://phabricator.wikimedia.org/P89505 and previous config saved to /var/cache/conftool/dbconfig/20260302-133247-marostegui.json [13:32:51] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [13:33:17] jclark@cumin1003 netbox (PID 1327745) is awaiting input [13:34:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T418465)', diff saved to https://phabricator.wikimedia.org/P89506 and previous config saved to /var/cache/conftool/dbconfig/20260302-133440-marostegui.json [13:34:56] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1214.eqiad.wmnet with reason: Maintenance [13:35:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1214 (T418465)', diff saved to https://phabricator.wikimedia.org/P89507 and previous config saved to /var/cache/conftool/dbconfig/20260302-133503-marostegui.json [13:35:16] !log ladsgroup@cumin1003 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [13:35:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1157.eqiad.wmnet [13:35:46] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1158.eqiad.wmnet [13:36:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11662870 (10ops-monitoring-bot) Host an-worker1158.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:37:44] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt ms-be1097 - jclark@cumin1003" [13:38:36] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt ms-be1097 - jclark@cumin1003" [13:38:36] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:38:52] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1097 [13:38:59] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1097 [13:39:54] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:40:39] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1097.eqiad.wmnet with OS bullseye [13:40:46] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11662916 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-be1097.eqiad.wmnet with OS bullseye [13:41:46] (03CR) 10Giuseppe Lavagetto: [C:03+1] cache::haproxy: calculate X-I-B for C and E [puppet] - 10https://gerrit.wikimedia.org/r/1247027 (https://phabricator.wikimedia.org/T417825) (owner: 10Vgutierrez) [13:42:56] (03PS1) 10Kgraessle: Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) [13:43:23] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::haproxy: fix inclusion of traffic class check [puppet] - 10https://gerrit.wikimedia.org/r/1247036 (owner: 10Giuseppe Lavagetto) [13:44:30] (03CR) 10CI reject: [V:04-1] Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [13:45:49] (03PS1) 10Brouberol: kafka-mirrormkaer: ensure consumer group names are the same than on the puppetized kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247066 (https://phabricator.wikimedia.org/T417407) [13:46:31] (03PS2) 10Kgraessle: Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) [13:47:02] RESOLVED: [2x] KubernetesCalicoDown: dse-k8s-worker1025.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:47:34] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1158.eqiad.wmnet [13:47:37] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1159.eqiad.wmnet [13:47:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P89508 and previous config saved to /var/cache/conftool/dbconfig/20260302-134754-marostegui.json [13:48:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11662937 (10ops-monitoring-bot) Host an-worker1159.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:49:09] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: calculate X-I-B for C and E [puppet] - 10https://gerrit.wikimedia.org/r/1247027 (https://phabricator.wikimedia.org/T417825) (owner: 10Vgutierrez) [13:50:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T418465)', diff saved to https://phabricator.wikimedia.org/P89509 and previous config saved to /var/cache/conftool/dbconfig/20260302-135021-marostegui.json [13:50:25] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [13:53:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:54:00] (03PS2) 10Tiziano Fogli: ldap_users_sync.py: format code [puppet] - 10https://gerrit.wikimedia.org/r/1243062 (https://phabricator.wikimedia.org/T418118) [13:54:00] (03PS6) 10Tiziano Fogli: ldap_users_sync.py: add non-blocking errors handling [puppet] - 10https://gerrit.wikimedia.org/r/1243063 (https://phabricator.wikimedia.org/T418118) [13:54:35] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1025.eqiad.wmnet [13:54:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:54:52] (03CR) 10CI reject: [V:04-1] ldap_users_sync.py: add non-blocking errors handling [puppet] - 10https://gerrit.wikimedia.org/r/1243063 (https://phabricator.wikimedia.org/T418118) (owner: 10Tiziano Fogli) [13:57:08] (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: reallocate k8s resources to enable policy-violation isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247061 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [13:57:14] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1097.eqiad.wmnet with reason: host reimage [13:58:23] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:58:54] (03PS1) 10Zabe: ImageListPager: Properly support file schema migration read new [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247068 (https://phabricator.wikimedia.org/T418327) [13:59:01] (03PS1) 10Volans: admin: add temporary key for cmooney [puppet] - 10https://gerrit.wikimedia.org/r/1247069 [13:59:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:59:38] (03PS4) 10Arnaudb: gerrit: alerting downtime update [cookbooks] - 10https://gerrit.wikimedia.org/r/1239003 (https://phabricator.wikimedia.org/T418264) [13:59:38] (03CR) 10Volans: "Key verified out of band too" [puppet] - 10https://gerrit.wikimedia.org/r/1247069 (owner: 10Volans) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T1400). nyaa~ [14:00:05] itamarWMDE, kostajh, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:27] My patch is a no-op, so whoever is deploying, could you please sync it out? [14:00:34] o/ [14:00:38] (03Abandoned) 10Tiziano Fogli: ldap_users_sync.py: add non-blocking errors handling [puppet] - 10https://gerrit.wikimedia.org/r/1243063 (https://phabricator.wikimedia.org/T418118) (owner: 10Tiziano Fogli) [14:00:40] well, it's not a no-op, but it doesn't need verification, rather [14:00:47] (03CR) 10Cathal Mooney: [C:03+1] "Thanks <3" [puppet] - 10https://gerrit.wikimedia.org/r/1247069 (owner: 10Volans) [14:00:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1025.eqiad.wmnet [14:00:57] (03CR) 10Elukey: kafka-mirrormkaer: ensure consumer group names are the same than on the puppetized kafka clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247066 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [14:00:58] o/ [14:01:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1159.eqiad.wmnet [14:01:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1005.wikimedia.org [14:01:47] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1160.eqiad.wmnet [14:01:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:02:03] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1028.eqiad.wmnet [14:02:14] (03CR) 10Brouberol: kafka-mirrormkaer: ensure consumer group names are the same than on the puppetized kafka clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247066 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [14:02:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11662974 (10ops-monitoring-bot) Host an-worker1160.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:03:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P89510 and previous config saved to /var/cache/conftool/dbconfig/20260302-140302-marostegui.json [14:03:32] (03CR) 10Marostegui: sre.mysql.clone: record clone runs into Zarcillo (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [14:03:38] o/ [14:03:49] (03CR) 10Brouberol: kafka-mirrormkaer: ensure consumer group names are the same than on the puppetized kafka clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247066 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [14:04:06] I can deploy :) [14:04:18] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1097.eqiad.wmnet with reason: host reimage [14:04:27] thanks Lucas_WMDE! [14:04:34] I wrote a note above, my patch can go out without verification [14:04:50] ack [14:05:03] Thank you Lucas_WMDE! [14:05:09] (03PS3) 10Anzx: lawiki: add Adumbratio (draft) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247063 (https://phabricator.wikimedia.org/T418706) [14:05:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P89511 and previous config saved to /var/cache/conftool/dbconfig/20260302-140529-marostegui.json [14:05:46] (03CR) 10Lucas Werkmeister (WMDE): [C:04-1] lawiki: add Adumbratio (draft) namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247063 (https://phabricator.wikimedia.org/T418706) (owner: 10Anzx) [14:05:47] (03PS4) 10Anzx: lawiki: add Adumbratio (draft) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247063 (https://phabricator.wikimedia.org/T418706) [14:05:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1005.wikimedia.org [14:05:54] (03CR) 10Elukey: kafka-mirrormkaer: ensure consumer group names are the same than on the puppetized kafka clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247066 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [14:05:59] (03CR) 10Lucas Werkmeister (WMDE): lawiki: add Adumbratio (draft) namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247063 (https://phabricator.wikimedia.org/T418706) (owner: 10Anzx) [14:06:17] (03CR) 10Anzx: lawiki: add Adumbratio (draft) namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247063 (https://phabricator.wikimedia.org/T418706) (owner: 10Anzx) [14:06:20] (03CR) 10Lucas Werkmeister (WMDE): lawiki: add Adumbratio (draft) namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247063 (https://phabricator.wikimedia.org/T418706) (owner: 10Anzx) [14:06:35] I’ll start with anzx [14:06:57] ok [14:07:17] ceterum censeo diffConfig would be more readable if it had trailing commas https://integration.wikimedia.org/ci/job/operations-mw-config-php83-composer-diffConfig/876/console [14:08:09] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/7 UP : OSPFv3: 4/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:08:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1028.eqiad.wmnet [14:08:45] (03CR) 10Elukey: [C:03+2] "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247032 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [14:08:50] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] lawiki: add Adumbratio (draft) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247063 (https://phabricator.wikimedia.org/T418706) (owner: 10Anzx) [14:09:05] (03CR) 10Volans: [C:03+2] admin: add temporary key for cmooney [puppet] - 10https://gerrit.wikimedia.org/r/1247069 (owner: 10Volans) [14:09:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247063 (https://phabricator.wikimedia.org/T418706) (owner: 10Anzx) [14:09:09] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:09:10] FIRING: BFDdown: BFD session down between cr2-esams and 208.80.153.217 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:09:28] kostajh: I think I’d rather deploy your config change separately, in case the log volume turns out to be excessive on mwdebug [14:10:01] (03Merged) 10jenkins-bot: lawiki: add Adumbratio (draft) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247063 (https://phabricator.wikimedia.org/T418706) (owner: 10Anzx) [14:10:11] !log elukey@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [14:10:21] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1247063|lawiki: add Adumbratio (draft) namespace (T418706)]] [14:10:24] T418706: New “Draft” namespace for lawiki: “Adumbratio” - https://phabricator.wikimedia.org/T418706 [14:10:31] !log elukey@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [14:10:35] Lucas_WMDE: sounds fine [14:11:11] (03CR) 10Marostegui: sre.mysql.clone: record clone runs into Zarcillo (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [14:11:51] !log elukey@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [14:11:55] !log elukey@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [14:12:11] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, anzx: Backport for [[gerrit:1247063|lawiki: add Adumbratio (draft) namespace (T418706)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:12:17] looking [14:13:05] !log installing libcap2 updates from Trixie point release [14:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:21] (03CR) 10Brouberol: kafka-mirrormkaer: ensure consumer group names are the same than on the puppetized kafka clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247066 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [14:13:30] Lucas_WMDE: namespace appears, ok to sync [14:13:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1160.eqiad.wmnet [14:13:36] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1161.eqiad.wmnet [14:13:47] thanks! [14:13:48] !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views [14:13:49] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, anzx: Continuing with sync [14:14:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11663010 (10ops-monitoring-bot) Host an-worker1161.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:14:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people2004.codfw.wmnet [14:14:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:15:46] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11663021 (10ABran-WMF) [14:16:41] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11663025 (10ABran-WMF) after discussing with @Vgutierrez and @Fabfur, it seems we don't need to add a VIP for that specific use case. [14:17:49] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247063|lawiki: add Adumbratio (draft) namespace (T418706)]] (duration: 07m 27s) [14:17:52] T418706: New “Draft” namespace for lawiki: “Adumbratio” - https://phabricator.wikimedia.org/T418706 [14:18:04] Lucas_WMDE: Thanks for deploying, could you please run namespacedupes [14:18:11] yup [14:18:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T418465)', diff saved to https://phabricator.wikimedia.org/P89512 and previous config saved to /var/cache/conftool/dbconfig/20260302-141810-marostegui.json [14:18:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2004.codfw.wmnet [14:18:14] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [14:18:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2195.codfw.wmnet with reason: Maintenance [14:18:31] ladsgroup@cumin1003 update-views (PID 1358428) is awaiting input [14:18:33] !log elukey@puppetserver1001 conftool action : set/pooled=no; selector: name=ms-fe1013.eqiad.wmnet [14:18:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2195 (T418465)', diff saved to https://phabricator.wikimedia.org/P89513 and previous config saved to /var/cache/conftool/dbconfig/20260302-141834-marostegui.json [14:18:47] !log lucaswerkmeister-wmde@deploy2002 mwscript-k8s job started: namespaceDupes lawiki --fix # T418706 [14:19:26] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:19:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245364 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon) [14:20:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P89514 and previous config saved to /var/cache/conftool/dbconfig/20260302-142037-marostegui.json [14:20:39] (03Merged) 10jenkins-bot: Add configurations for graphql usage survey and its pipeline tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1245364 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon) [14:20:56] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1245364|Add configurations for graphql usage survey and its pipeline tests (T414476)]] [14:21:00] T414476: 📚 Add QuickSurvey to the dedicate page on Wikidata for GraphQL - https://phabricator.wikimedia.org/T414476 [14:21:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:22:43] !log lucaswerkmeister-wmde@deploy2002 itamar, lucaswerkmeister-wmde: Backport for [[gerrit:1245364|Add configurations for graphql usage survey and its pipeline tests (T414476)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:23:06] !log ladsgroup@cumin1003 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [14:23:14] itamarWMDE: can you test the change on mwdebug? [14:23:23] aye [14:23:38] It works, still generating logspam? [14:24:38] elukey@cumin1003 provision (PID 1363207) is awaiting input [14:24:49] yeah, there’s quite a bit of “Message blob for wikibase.vector.scopedTypeaheadSearch should have been preloaded” :/ [14:24:52] Lucas_WMDE: If everything is alright on your side. We're good to sync [14:24:58] is that the same message seen earlier? [14:25:04] nope [14:25:07] (actually, is that even related?) [14:25:12] doesn't seem related either [14:25:12] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1161.eqiad.wmnet [14:25:15] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1162.eqiad.wmnet [14:25:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:25:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11663058 (10ops-monitoring-bot) Host an-worker1162.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:26:02] !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views [14:26:17] apparently that’s T409033 [14:26:17] T409033: "Message blob for wikibase.vector.scopedTypeaheadSearch should have been preloaded" - https://phabricator.wikimedia.org/T409033 [14:26:48] !log lucaswerkmeister-wmde@deploy2002 itamar, lucaswerkmeister-wmde: Continuing with sync [14:26:50] let’s go ahead [14:26:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:26:58] (03CR) 10Zabe: Stop writing to il_to on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240320 (https://phabricator.wikimedia.org/T415787) (owner: 10Zabe) [14:27:15] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:28:39] (03PS1) 10STran: Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) [14:28:58] (03CR) 10STran: [C:04-2] "waiting on approval" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [14:29:50] (03CR) 10CI reject: [V:04-1] Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [14:30:06] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11663088 (10elukey) @VRiley-WMF Hi! I've depooled ms-fe1013 but the provision script fails, since Redfish doesn't expose any BIOS optio... [14:30:20] jclark@cumin1003 reimage (PID 1334922) is awaiting input [14:30:40] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:30:40] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1245364|Add configurations for graphql usage survey and its pipeline tests (T414476)]] (duration: 09m 44s) [14:30:44] T414476: 📚 Add QuickSurvey to the dedicate page on Wikidata for GraphQL - https://phabricator.wikimedia.org/T414476 [14:30:56] Lucas_WMDE: Thank you for the deployment [14:30:59] np [14:31:08] kostajh: your change is next ^^ [14:31:08] !log ladsgroup@cumin1003 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [14:31:16] Lucas_WMDE: cool [14:31:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247057 (https://phabricator.wikimedia.org/T374718) (owner: 10Kosta Harlan) [14:32:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:32:32] (03Merged) 10jenkins-bot: IPInfo: Set log level to "info" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247057 (https://phabricator.wikimedia.org/T374718) (owner: 10Kosta Harlan) [14:32:50] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1247057|IPInfo: Set log level to "info" (T374718)]] [14:32:53] T374718: Allow Special:IPInfo to return IP information of arbitrary addresses for users with the correct permissions - https://phabricator.wikimedia.org/T374718 [14:33:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T418465)', diff saved to https://phabricator.wikimedia.org/P89515 and previous config saved to /var/cache/conftool/dbconfig/20260302-143315-marostegui.json [14:33:21] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [14:33:23] (03PS2) 10STran: Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) [14:34:49] !log lucaswerkmeister-wmde@deploy2002 kharlan, lucaswerkmeister-wmde: Backport for [[gerrit:1247057|IPInfo: Set log level to "info" (T374718)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:35:24] kostajh: is there a way to trigger this message appearing in logstash? [14:35:29] yes, one second [14:35:40] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T418465)', diff saved to https://phabricator.wikimedia.org/P89516 and previous config saved to /var/cache/conftool/dbconfig/20260302-143544-marostegui.json [14:36:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1226.eqiad.wmnet with reason: Maintenance [14:36:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1226 (T418465)', diff saved to https://phabricator.wikimedia.org/P89517 and previous config saved to /var/cache/conftool/dbconfig/20260302-143608-marostegui.json [14:36:28] Lucas_WMDE: it's fine to sync out [14:36:30] (03PS3) 10Tiziano Fogli: ldap_users_sync.py: format code [puppet] - 10https://gerrit.wikimedia.org/r/1243062 (https://phabricator.wikimedia.org/T418118) [14:36:30] (03PS1) 10Tiziano Fogli: grafana/ldap_users_sync: delete a user if it has invalid metadata [puppet] - 10https://gerrit.wikimedia.org/r/1247076 (https://phabricator.wikimedia.org/T418118) [14:36:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:36:57] !log lucaswerkmeister-wmde@deploy2002 kharlan, lucaswerkmeister-wmde: Continuing with sync [14:36:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1162.eqiad.wmnet [14:36:59] alright, thanks [14:37:00] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet [14:37:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:37:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11663133 (10ops-monitoring-bot) Host an-worker1163.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:40:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:40:51] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247057|IPInfo: Set log level to "info" (T374718)]] (duration: 08m 01s) [14:40:54] T374718: Allow Special:IPInfo to return IP information of arbitrary addresses for users with the correct permissions - https://phabricator.wikimedia.org/T374718 [14:40:55] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:41:25] !log UTC afternoon backport+config window done [14:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:46:57] Lucas_WMDE: thanks! [14:47:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:47:37] np :) [14:48:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P89519 and previous config saved to /var/cache/conftool/dbconfig/20260302-144823-marostegui.json [14:48:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet [14:48:59] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet [14:49:11] (03PS2) 10Arnaudb: mailman: add lists to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1247078 (https://phabricator.wikimedia.org/T286066) [14:49:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11663164 (10ops-monitoring-bot) Host an-worker1164.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:49:53] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11663166 (10ABran-WMF) [14:50:40] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745 (10MatthewVernon) 03NEW [14:52:44] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11663186 (10ABran-WMF) [14:52:55] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:53:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T418465)', diff saved to https://phabricator.wikimedia.org/P89520 and previous config saved to /var/cache/conftool/dbconfig/20260302-145318-marostegui.json [14:53:22] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [14:54:53] (03PS4) 10Ebernhardson: cirrus: Add semantic search test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969) [14:55:57] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:56:47] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:57:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:57:41] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11663204 (10ABran-WMF) [14:58:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet [14:58:45] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet [14:58:46] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11663208 (10MoritzMuehlenhoff) [14:58:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11663209 (10ops-monitoring-bot) Host an-worker1165.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:59:49] (03PS4) 10Federico Ceratto: sre.mysql.clone: record clone runs into Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) [15:00:03] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11663212 (10ABran-WMF) I've discarded the "Create a new conftool entry" item of the task list, it was mirrored from the "gerrit behind... [15:00:30] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [15:00:31] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1097.eqiad.wmnet with OS bullseye [15:00:42] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11663215 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-be1097.eqiad.wmnet with OS bullseye completed: - ms-be1097 (*... [15:00:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:00:51] (03PS2) 10Brouberol: kafka-mirrormkaer: ensure consumer group names are the same than on the puppetized kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247066 (https://phabricator.wikimedia.org/T417407) [15:00:57] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:02:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:02:31] (03CR) 10Brouberol: kafka-mirrormkaer: ensure consumer group names are the same than on the puppetized kafka clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247066 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [15:02:55] (03CR) 10CI reject: [V:04-1] kafka-mirrormkaer: ensure consumer group names are the same than on the puppetized kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247066 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [15:03:12] (03CR) 10Federico Ceratto: sre.mysql.clone: record clone runs into Zarcillo (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [15:03:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P89522 and previous config saved to /var/cache/conftool/dbconfig/20260302-150330-marostegui.json [15:03:52] (03PS3) 10Brouberol: kafka-mirrormkaer: ensure consumer group names are the same than on the puppetized kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247066 (https://phabricator.wikimedia.org/T417407) [15:04:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:06:53] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:07:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:08:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P89523 and previous config saved to /var/cache/conftool/dbconfig/20260302-150826-marostegui.json [15:09:57] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:10:20] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet [15:10:23] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1166.eqiad.wmnet [15:10:27] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Eileen McFarland - https://phabricator.wikimedia.org/T418221#11663256 (10EMcFarland-WMF) @thcipriani I looked and my request for Spiderpig-access is still pending. Could you please grant me access? Thanks! [15:10:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11663257 (10ops-monitoring-bot) Host an-worker1166.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [15:12:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:12:25] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:12:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:16:57] (03PS5) 10Ebernhardson: cirrus: Add semantic search test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969) [15:18:23] (03PS1) 10Jcrespo: mysql: Remove mysql-io-pressure alerts [alerts] - 10https://gerrit.wikimedia.org/r/1247084 [15:18:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T418465)', diff saved to https://phabricator.wikimedia.org/P89524 and previous config saved to /var/cache/conftool/dbconfig/20260302-151838-marostegui.json [15:18:42] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [15:18:50] (03PS2) 10Jcrespo: mysql: Remove mysql-io-pressure alerts [alerts] - 10https://gerrit.wikimedia.org/r/1247084 [15:18:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2198.codfw.wmnet with reason: Maintenance [15:20:03] 06SRE, 10Bitu, 06Infrastructure-Foundations: wikimedia-l was signed up for a developer account - https://phabricator.wikimedia.org/T418201#11663284 (10SLyngshede-WMF) 05Open→03In progress p:05Triage→03High [15:20:11] (03PS3) 10Jcrespo: mysql: Remove mysql-io-pressure alerts [alerts] - 10https://gerrit.wikimedia.org/r/1247084 [15:20:26] (03CR) 10Jelto: gerrit: alerting downtime update (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1239003 (https://phabricator.wikimedia.org/T418264) (owner: 10Arnaudb) [15:20:46] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11663288 (10ayounsi) p:05Triage→03Low [15:21:24] (03CR) 10Marostegui: [C:03+1] mysql: Remove mysql-io-pressure alerts [alerts] - 10https://gerrit.wikimedia.org/r/1247084 (owner: 10Jcrespo) [15:22:03] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Remove Puppet 5 dependencies from PCC - https://phabricator.wikimedia.org/T418559#11663293 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:22:09] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1166.eqiad.wmnet [15:22:12] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1167.eqiad.wmnet [15:22:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:22:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11663294 (10ops-monitoring-bot) Host an-worker1167.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [15:22:40] 06SRE, 06Infrastructure-Foundations: Create nodejs 24 production images - https://phabricator.wikimedia.org/T418440#11663295 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:23:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P89525 and previous config saved to /var/cache/conftool/dbconfig/20260302-152334-marostegui.json [15:26:18] (03CR) 10Federico Ceratto: [C:03+1] "Discussed on IRC: we always place the alert back in future if needed. LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1247084 (owner: 10Jcrespo) [15:28:14] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11663336 (10AntiCompositeNumber) Original-size thumbnails must be supported as not all formats are web-safe. [15:29:31] (03CR) 10Kevin Bazira: [C:03+2] ml-services: reallocate k8s resources to enable policy-violation isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247061 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [15:30:00] (03CR) 10Jcrespo: [C:03+2] "100%- it was not about the idea, but about the current, outdated implementation. In the current state it caused more confusion than help." [alerts] - 10https://gerrit.wikimedia.org/r/1247084 (owner: 10Jcrespo) [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T1530) [15:31:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Restore db1226 full weight after schema change', diff saved to https://phabricator.wikimedia.org/P89526 and previous config saved to /var/cache/conftool/dbconfig/20260302-153100-marostegui.json [15:31:30] (03Merged) 10jenkins-bot: mysql: Remove mysql-io-pressure alerts [alerts] - 10https://gerrit.wikimedia.org/r/1247084 (owner: 10Jcrespo) [15:31:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1209.eqiad.wmnet with reason: Maintenance [15:31:52] (03Merged) 10jenkins-bot: ml-services: reallocate k8s resources to enable policy-violation isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247061 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [15:32:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2165.codfw.wmnet with reason: Maintenance [15:32:51] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:33:59] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1167.eqiad.wmnet [15:34:03] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1169.eqiad.wmnet [15:34:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11663398 (10ops-monitoring-bot) Host an-worker1169.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [15:34:55] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:37:15] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55854 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:37:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:37:48] (03PS3) 10Ayounsi: Add support for POP Nokia switches [homer/public] - 10https://gerrit.wikimedia.org/r/1243875 (https://phabricator.wikimedia.org/T408892) [15:38:21] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1020 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [15:38:36] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [15:39:38] (03CR) 10Ayounsi: [V:03+1] "Tested manually, NOOP on all the Nokia eqiad/codfw." [homer/public] - 10https://gerrit.wikimedia.org/r/1243875 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [15:39:55] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11663413 (10Ladsgroup) >>! In T414805#11663336, @AntiCompositeNumber wrote: > Original-size thumbnails must be supported as not all f... [15:40:25] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:40:33] (03Abandoned) 10Brouberol: Weekly rebuild of cert-manager - 20260301 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247042 (owner: 10Brouberol) [15:42:40] (03PS5) 10Arnaudb: gerrit: alerting downtime update [cookbooks] - 10https://gerrit.wikimedia.org/r/1239003 (https://phabricator.wikimedia.org/T418264) [15:45:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1169.eqiad.wmnet with reason: Maintenance [15:45:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1169 (T418465)', diff saved to https://phabricator.wikimedia.org/P89527 and previous config saved to /var/cache/conftool/dbconfig/20260302-154520-marostegui.json [15:45:23] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [15:45:25] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:45:37] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1169.eqiad.wmnet [15:46:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2141.codfw.wmnet with reason: Maintenance [15:47:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:48:21] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1020 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [15:48:29] (03PS3) 10CDanis: haproxy: silent-drop: lower limit [puppet] - 10https://gerrit.wikimedia.org/r/1196723 [15:49:47] (03PS5) 10Arnaudb: gerrit: alerting downtime update [cookbooks] - 10https://gerrit.wikimedia.org/r/1239003 (https://phabricator.wikimedia.org/T418264) [15:51:19] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11663515 (10AntiCompositeNumber) We intentionally don't serve WebP originals for browser support reasons at the time support was adde... [15:52:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:55:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2145.codfw.wmnet with reason: Maintenance [15:55:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2145 (T418465)', diff saved to https://phabricator.wikimedia.org/P89528 and previous config saved to /var/cache/conftool/dbconfig/20260302-155527-marostegui.json [15:55:30] (03PS2) 10RLazarus: alertmanager: Also add ServiceOps to mw-cron tasks for unstewarded components [puppet] - 10https://gerrit.wikimedia.org/r/1238369 (https://phabricator.wikimedia.org/T417020) (owner: 10A smart kitten) [15:55:31] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [15:55:45] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11663543 (10MoritzMuehlenhoff) [15:55:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T418465)', diff saved to https://phabricator.wikimedia.org/P89529 and previous config saved to /var/cache/conftool/dbconfig/20260302-155555-marostegui.json [15:56:19] !log installing glibc bugfix updates from trixie point release [15:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:57:48] (03CR) 10RLazarus: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1238369 (https://phabricator.wikimedia.org/T417020) (owner: 10A smart kitten) [15:58:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [15:59:35] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11663557 (10Ladsgroup) >>! In T414805#11663515, @AntiCompositeNumber wrote: > We intentionally don't serve WebP originals for browser... [16:05:04] !log fceratto@cumin1003 START - Cookbook sre.hosts.reimage for host db2230.codfw.wmnet with OS trixie [16:06:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T418465)', diff saved to https://phabricator.wikimedia.org/P89530 and previous config saved to /var/cache/conftool/dbconfig/20260302-160607-marostegui.json [16:06:11] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [16:08:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:08:39] (03PS3) 10RLazarus: alertmanager: Also add ServiceOps to mw-cron tasks for unstewarded components [puppet] - 10https://gerrit.wikimedia.org/r/1238369 (https://phabricator.wikimedia.org/T417020) (owner: 10A smart kitten) [16:09:20] (03CR) 10Elukey: [C:03+1] Add support for POP Nokia switches [homer/public] - 10https://gerrit.wikimedia.org/r/1243875 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [16:10:51] (03CR) 10Marostegui: sre.mysql.clone: record clone runs into Zarcillo (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [16:11:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P89531 and previous config saved to /var/cache/conftool/dbconfig/20260302-161102-marostegui.json [16:11:30] (03CR) 10RLazarus: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1238369 (https://phabricator.wikimedia.org/T417020) (owner: 10A smart kitten) [16:13:23] FIRING: [3x] JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:19:07] !log installing PAM security updates on Bookworm [16:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:41] (03CR) 10Herron: [C:03+1] slothslos: normalize file extensions during copy [puppet] - 10https://gerrit.wikimedia.org/r/1247005 (https://phabricator.wikimedia.org/T414579) (owner: 10Tiziano Fogli) [16:20:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:30] (03CR) 10Herron: [C:03+1] ldap_users_sync.py: format code [puppet] - 10https://gerrit.wikimedia.org/r/1243062 (https://phabricator.wikimedia.org/T418118) (owner: 10Tiziano Fogli) [16:20:59] (03CR) 10Federico Ceratto: sre.mysql.clone: record clone runs into Zarcillo (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [16:21:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P89532 and previous config saved to /var/cache/conftool/dbconfig/20260302-162115-marostegui.json [16:21:20] (03CR) 10Herron: [C:03+1] grafana/ldap_users_sync: delete a user if it has invalid metadata [puppet] - 10https://gerrit.wikimedia.org/r/1247076 (https://phabricator.wikimedia.org/T418118) (owner: 10Tiziano Fogli) [16:21:38] (03PS1) 10CDanis: haproxy: tweak policy [puppet] - 10https://gerrit.wikimedia.org/r/1247098 [16:21:44] !log fceratto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2230.codfw.wmnet with reason: host reimage [16:22:11] (03CR) 10Marostegui: sre.mysql.clone: record clone runs into Zarcillo (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [16:22:55] 10ops-codfw, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 3 others: codfw: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418545#11663798 (10Dzahn) a:03Dzahn [16:23:30] 07Puppet, 06collaboration-services, 10Gerrit: Gerrit git replication should not break when Puppet changes its config - https://phabricator.wikimedia.org/T416929#11663806 (10ABran-WMF) p:05Triage→03Low [16:25:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:25:50] (03PS1) 10JMeybohm: conftool-data: Fix YAML syntax [puppet] - 10https://gerrit.wikimedia.org/r/1247099 (https://phabricator.wikimedia.org/T418259) [16:25:52] (03PS1) 10JMeybohm: Add wikikube-worker[1350-1351] [puppet] - 10https://gerrit.wikimedia.org/r/1247100 (https://phabricator.wikimedia.org/T418259) [16:26:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P89533 and previous config saved to /var/cache/conftool/dbconfig/20260302-162610-marostegui.json [16:26:51] (03CR) 10Tiziano Fogli: [C:03+2] slothslos: normalize file extensions during copy [puppet] - 10https://gerrit.wikimedia.org/r/1247005 (https://phabricator.wikimedia.org/T414579) (owner: 10Tiziano Fogli) [16:27:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:29:23] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2230.codfw.wmnet with reason: host reimage [16:29:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:30:04] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T1630). [16:31:43] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11663880 (10MoritzMuehlenhoff) [16:32:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:33:23] FIRING: [4x] SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:33:23] FIRING: [3x] JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:49] (03CR) 10Ayounsi: [V:03+1 C:03+2] Add support for POP Nokia switches [homer/public] - 10https://gerrit.wikimedia.org/r/1243875 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [16:36:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P89534 and previous config saved to /var/cache/conftool/dbconfig/20260302-163622-marostegui.json [16:37:12] (03Merged) 10jenkins-bot: Add support for POP Nokia switches [homer/public] - 10https://gerrit.wikimedia.org/r/1243875 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [16:37:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:39:31] 10SRE-SLO, 06ServiceOps new, 07Essential-Work, 10iPoid-Service (iPoid 1.0), 06Product Safety and Integrity (Sprint (Mar 2 - Mar 20)): IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11663955 (10OKryva-WMF) [16:39:47] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10procurement: cloudgw2004-dev service implementation - https://phabricator.wikimedia.org/T418765 (10Andrew) 03NEW [16:39:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:39:54] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10procurement: cloudgw2004-dev service implementation - https://phabricator.wikimedia.org/T418765#11663972 (10Andrew) p:05Triage→03Medium [16:41:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T418465)', diff saved to https://phabricator.wikimedia.org/P89535 and previous config saved to /var/cache/conftool/dbconfig/20260302-164118-marostegui.json [16:41:22] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [16:41:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:41:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1184 (T418465)', diff saved to https://phabricator.wikimedia.org/P89536 and previous config saved to /var/cache/conftool/dbconfig/20260302-164141-marostegui.json [16:44:56] (03CR) 10A smart kitten: [C:03+1] "(Just noting that I don't immediately understand what the puppet-compiler failure relates to. I'm assuming someone'll tell me if I need to" [puppet] - 10https://gerrit.wikimedia.org/r/1238369 (https://phabricator.wikimedia.org/T417020) (owner: 10A smart kitten) [16:45:18] (03PS1) 10DDesouza: Undeploy Comparative Reader Research survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247105 (https://phabricator.wikimedia.org/T417829) [16:46:42] (03PS1) 10Herron: mwlog[12]003: apply role [puppet] - 10https://gerrit.wikimedia.org/r/1247106 (https://phabricator.wikimedia.org/T417002) [16:46:48] (03PS1) 10DDesouza: Undeploy Comparative Reader Research survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247107 (https://phabricator.wikimedia.org/T417834) [16:47:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:51:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T418465)', diff saved to https://phabricator.wikimedia.org/P89537 and previous config saved to /var/cache/conftool/dbconfig/20260302-165129-marostegui.json [16:51:33] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [16:51:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2146.codfw.wmnet with reason: Maintenance [16:51:53] (03PS1) 10Andrew Bogott: Add site/preseed for cloudcephosd2008-dev [puppet] - 10https://gerrit.wikimedia.org/r/1247108 (https://phabricator.wikimedia.org/T416396) [16:51:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2146 (T418465)', diff saved to https://phabricator.wikimedia.org/P89538 and previous config saved to /var/cache/conftool/dbconfig/20260302-165153-marostegui.json [16:52:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:52:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T418465)', diff saved to https://phabricator.wikimedia.org/P89539 and previous config saved to /var/cache/conftool/dbconfig/20260302-165240-marostegui.json [16:52:51] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2230.codfw.wmnet with OS trixie [16:54:20] (03CR) 10Andrew Bogott: [C:03+2] Add site/preseed for cloudcephosd2008-dev [puppet] - 10https://gerrit.wikimedia.org/r/1247108 (https://phabricator.wikimedia.org/T416396) (owner: 10Andrew Bogott) [16:54:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247107 (https://phabricator.wikimedia.org/T417834) (owner: 10DDesouza) [16:54:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247105 (https://phabricator.wikimedia.org/T417829) (owner: 10DDesouza) [16:58:40] (03PS6) 10Ebernhardson: cirrus: Add semantic search test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969) [17:01:50] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11664154 (10calbon) I approve of this. [17:03:21] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8170/console" [puppet] - 10https://gerrit.wikimedia.org/r/1238369 (https://phabricator.wikimedia.org/T417020) (owner: 10A smart kitten) [17:03:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T418465)', diff saved to https://phabricator.wikimedia.org/P89540 and previous config saved to /var/cache/conftool/dbconfig/20260302-170331-marostegui.json [17:03:35] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [17:05:45] (03CR) 10Blake: [C:03+1] alertmanager: Also add ServiceOps to mw-cron tasks for unstewarded components [puppet] - 10https://gerrit.wikimedia.org/r/1238369 (https://phabricator.wikimedia.org/T417020) (owner: 10A smart kitten) [17:07:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P89541 and previous config saved to /var/cache/conftool/dbconfig/20260302-170748-marostegui.json [17:09:57] 10ops-codfw, 06DC-Ops: Alert for device ps1-a2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T418771 (10phaultfinder) 03NEW [17:11:04] (03PS1) 10CDobbins: Revert "conftool: remove ats-be from cp20[43-58]" [puppet] - 10https://gerrit.wikimedia.org/r/1247110 [17:11:17] (03PS2) 10CDobbins: Revert "conftool: remove ats-be from cp20[43-58]" [puppet] - 10https://gerrit.wikimedia.org/r/1247110 [17:12:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:12:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11664238 (10MBinder_WMF) Hmm, weird. I could have sworn we already disabled that old account because we've discovered this confusion before. I just went... [17:14:28] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772 (10Papaul) 03NEW [17:15:30] (03CR) 10CDobbins: [C:03+2] Revert "conftool: remove ats-be from cp20[43-58]" [puppet] - 10https://gerrit.wikimedia.org/r/1247110 (owner: 10CDobbins) [17:16:24] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11664288 (10MoritzMuehlenhoff) Ok! I've just enabled "wmf" for your "mbinder" account. Let me know if all works fine, then I'll go ahead and disable the... [17:17:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:17:35] (03PS1) 10CDobbins: Revert^2 "conftool: remove ats-be from cp20[43-58]" [puppet] - 10https://gerrit.wikimedia.org/r/1247113 [17:18:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P89542 and previous config saved to /var/cache/conftool/dbconfig/20260302-171839-marostegui.json [17:19:28] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:19:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:20:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:20:58] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Eileen McFarland - https://phabricator.wikimedia.org/T418221#11664340 (10thcipriani) >>! In T418221#11663256, @EMcFarland-WMF wrote: > @thcipriani I looked and my request for Spiderpig-access is still pending. Could you please grant me acce... [17:22:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:22:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:22:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P89543 and previous config saved to /var/cache/conftool/dbconfig/20260302-172256-marostegui.json [17:23:52] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [17:24:01] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [17:24:44] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:25:13] (03PS1) 10CDobbins: conftool: remove ats-be from cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1247115 (https://phabricator.wikimedia.org/T418161) [17:27:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:27:15] (03PS1) 10C. Scott Ananian: Enable parser survey for opted-out users on German & French wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247119 (https://phabricator.wikimedia.org/T414852) [17:27:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:28:10] (03CR) 10BCornwall: [C:03+1] conftool: remove ats-be from cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1247115 (https://phabricator.wikimedia.org/T418161) (owner: 10CDobbins) [17:30:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:32:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:32:18] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [17:32:25] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.update-replication (exit_code=99) [17:33:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P89544 and previous config saved to /var/cache/conftool/dbconfig/20260302-173347-marostegui.json [17:33:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:33:51] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [17:34:08] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11664460 (10JMeybohm) [17:34:11] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [17:34:13] (03CR) 10Clément Goubert: [C:03+1] conftool-data: Fix YAML syntax [puppet] - 10https://gerrit.wikimedia.org/r/1247099 (https://phabricator.wikimedia.org/T418259) (owner: 10JMeybohm) [17:34:44] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:34:56] (03CR) 10Clément Goubert: [C:03+1] Add wikikube-worker[1350-1351] [puppet] - 10https://gerrit.wikimedia.org/r/1247100 (https://phabricator.wikimedia.org/T418259) (owner: 10JMeybohm) [17:35:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:36:32] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [17:36:42] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8171/console" [puppet] - 10https://gerrit.wikimedia.org/r/1247115 (https://phabricator.wikimedia.org/T418161) (owner: 10CDobbins) [17:37:00] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [17:38:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T418465)', diff saved to https://phabricator.wikimedia.org/P89545 and previous config saved to /var/cache/conftool/dbconfig/20260302-173803-marostegui.json [17:38:10] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [17:38:17] (03CR) 10CDobbins: [V:03+1 C:03+2] conftool: remove ats-be from cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1247115 (https://phabricator.wikimedia.org/T418161) (owner: 10CDobbins) [17:38:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1186.eqiad.wmnet with reason: Maintenance [17:38:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1186 (T418465)', diff saved to https://phabricator.wikimedia.org/P89546 and previous config saved to /var/cache/conftool/dbconfig/20260302-173827-marostegui.json [17:39:31] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:39:54] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:42:59] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding contint2003 to codfw - jhancock@cumin2002" [17:43:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding contint2003 to codfw - jhancock@cumin2002" [17:43:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:43:33] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host contint2003 [17:43:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host contint2003 [17:44:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2095.codfw.wmnet with OS bullseye [17:44:11] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11664532 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2095.codfw.wmnet with OS bullseye [17:44:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:44:39] (03PS2) 10C. Scott Ananian: Enable parser survey for opted-out users on German & French wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247119 (https://phabricator.wikimedia.org/T414852) [17:45:25] (03CR) 10Federico Ceratto: sre.mysql.clone: record clone runs into Zarcillo (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [17:45:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:46:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:46:41] (03CR) 10Marostegui: [C:03+1] sre.mysql.clone: record clone runs into Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [17:47:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:48:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T418465)', diff saved to https://phabricator.wikimedia.org/P89547 and previous config saved to /var/cache/conftool/dbconfig/20260302-174854-marostegui.json [17:48:59] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [17:49:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T418465)', diff saved to https://phabricator.wikimedia.org/P89548 and previous config saved to /var/cache/conftool/dbconfig/20260302-174903-marostegui.json [17:49:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2153.codfw.wmnet with reason: Maintenance [17:49:12] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:49:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2153 (T418465)', diff saved to https://phabricator.wikimedia.org/P89549 and previous config saved to /var/cache/conftool/dbconfig/20260302-174917-marostegui.json [17:50:01] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11664578 (10Papaul) p:05Triage→03High [17:50:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:52:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:52:39] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding contint2003 to codfw - jhancock@cumin2002" [17:52:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding contint2003 to codfw - jhancock@cumin2002" [17:52:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:53:47] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host contint2003 [17:53:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:53:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host contint2003 [17:54:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:57:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:58:18] jhancock@cumin2002 provision (PID 2964345) is awaiting input [17:58:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:59:17] jouncebot: nowandnext [17:59:17] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [17:59:17] In 0 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T1800) [17:59:17] In 0 hour(s) and 0 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T1800) [17:59:38] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745#11664624 (10Aklapper) [Please review project tags and subscribers when creating s... [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T1800) [18:00:05] ryankemper: That opportune time for a Wikidata Query Service weekly deploy deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T1800). [18:01:48] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:01:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [18:02:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T418465)', diff saved to https://phabricator.wikimedia.org/P89550 and previous config saved to /var/cache/conftool/dbconfig/20260302-180245-marostegui.json [18:02:50] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [18:03:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:04:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P89551 and previous config saved to /var/cache/conftool/dbconfig/20260302-180411-marostegui.json [18:04:47] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:05:09] (03PS1) 10Dzahn: miscweb: fix typo in image name for status release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247125 (https://phabricator.wikimedia.org/T414098) [18:05:37] (03CR) 10Dzahn: [C:03+2] miscweb: fix typo in image name for status release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247125 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [18:05:55] (03CR) 10Dzahn: [C:03+2] miscweb: add release for status.wikimedia.org (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240412 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [18:06:18] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11664656 (10Jclark-ctr) 05Open→03Resolved [18:08:13] (03Merged) 10jenkins-bot: miscweb: fix typo in image name for status release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247125 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [18:11:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [18:12:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:13:39] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745#11664686 (10Ladsgroup) Based on comparing https://caniuse.com/?search=WebP and ht... [18:16:11] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11664716 (10DTotten-WMF) My public key is `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKOnXaorvT6RAntmeg17rQ282mK2TXiF4jQjxxBRJ1uz wiki@Danielles-MBP-2` [18:16:32] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2008-dev.codfw.wmnet with OS trixie [18:16:41] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd2008-dev - https://phabricator.wikimedia.org/T416396#11664717 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd2008-dev.codfw.wmnet with OS trixie [18:17:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [18:17:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P89552 and previous config saved to /var/cache/conftool/dbconfig/20260302-181753-marostegui.json [18:18:12] (03PS2) 10CDanis: haproxy: tweak policy [puppet] - 10https://gerrit.wikimedia.org/r/1247098 [18:18:41] (03PS3) 10C. Scott Ananian: Enable parser survey for opted-out users on German/French/Polish wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247119 (https://phabricator.wikimedia.org/T414852) [18:19:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P89553 and previous config saved to /var/cache/conftool/dbconfig/20260302-181918-marostegui.json [18:20:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:20:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240721 (https://phabricator.wikimedia.org/T417429) (owner: 10Esanders) [18:21:18] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11664738 (10Ladsgroup) For the sake of bookkeeping. This is 1:128 sample of requests to non-standard sizes that haven't been blocked... [18:21:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [18:21:37] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [18:21:53] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [18:23:34] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11664749 (10VRiley-WMF) Very strange, yes, when I logged into it, it seems fine. I will reboot the iDRAC. I will let you know. [18:24:51] (03CR) 10CDanis: [C:03+2] haproxy: silent-drop: lower limit [puppet] - 10https://gerrit.wikimedia.org/r/1196723 (owner: 10CDanis) [18:24:54] (03CR) 10CDanis: [C:03+2] haproxy: tweak policy [puppet] - 10https://gerrit.wikimedia.org/r/1247098 (owner: 10CDanis) [18:25:25] FIRING: [4x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:28:06] 10ops-eqiad, 06DC-Ops: Double checking labels on servers - https://phabricator.wikimedia.org/T418783 (10VRiley-WMF) 03NEW [18:30:25] FIRING: [6x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:30:50] (03PS5) 10Federico Ceratto: sre.mysql.clone: record clone runs into Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) [18:31:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [18:31:28] (03CR) 10Federico Ceratto: "I updated the CR to also capture failure and if the cookbook is interrupted with ctrl-c." [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [18:31:37] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [18:31:53] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [18:32:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:33:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P89554 and previous config saved to /var/cache/conftool/dbconfig/20260302-183300-marostegui.json [18:33:19] 10ops-codfw, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 3 others: codfw: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418545#11664802 (10Jhancock.wm) servers racked and the idrac ip is set up. giving me a bit of trouble with the provisioning... [18:33:51] (03PS1) 10CDanis: haproxy: fix logging [puppet] - 10https://gerrit.wikimedia.org/r/1247126 [18:34:15] (03PS2) 10CDanis: haproxy: fix logging [puppet] - 10https://gerrit.wikimedia.org/r/1247126 [18:34:15] (03CR) 10CI reject: [V:04-1] haproxy: fix logging [puppet] - 10https://gerrit.wikimedia.org/r/1247126 (owner: 10CDanis) [18:34:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T418465)', diff saved to https://phabricator.wikimedia.org/P89555 and previous config saved to /var/cache/conftool/dbconfig/20260302-183425-marostegui.json [18:34:30] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [18:34:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1195.eqiad.wmnet with reason: Maintenance [18:34:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1195 (T418465)', diff saved to https://phabricator.wikimedia.org/P89556 and previous config saved to /var/cache/conftool/dbconfig/20260302-183449-marostegui.json [18:35:25] FIRING: [7x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:38:40] (03CR) 10CDanis: [C:03+2] haproxy: fix logging [puppet] - 10https://gerrit.wikimedia.org/r/1247126 (owner: 10CDanis) [18:38:45] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:40:25] RESOLVED: [7x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:40:45] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:41:55] (03CR) 10BCornwall: [C:03+1] haproxy: fix logging [puppet] - 10https://gerrit.wikimedia.org/r/1247126 (owner: 10CDanis) [18:42:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:42:31] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1018 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [18:42:59] (03CR) 10Aklapper: [V:03+2 C:03+2] "Works locally" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221191 (https://phabricator.wikimedia.org/T413531) (owner: 10Pppery) [18:43:22] (03CR) 10Aklapper: [V:03+2] Remove `projects/phabricator_ext/README` [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1245511 (owner: 10Pppery) [18:44:39] (03PS7) 10Pppery: [Don't merge yet] Add locales for all remaining languages [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221224 (https://phabricator.wikimedia.org/T412651) [18:45:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T418465)', diff saved to https://phabricator.wikimedia.org/P89557 and previous config saved to /var/cache/conftool/dbconfig/20260302-184524-marostegui.json [18:45:28] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [18:45:40] FIRING: [8x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:45:45] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:45:47] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:46:31] (03CR) 10Volans: sre.mysql.clone: record clone runs into Zarcillo (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [18:47:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:48:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T418465)', diff saved to https://phabricator.wikimedia.org/P89558 and previous config saved to /var/cache/conftool/dbconfig/20260302-184808-marostegui.json [18:48:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2170.codfw.wmnet with reason: Maintenance [18:48:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2170 (T418465)', diff saved to https://phabricator.wikimedia.org/P89559 and previous config saved to /var/cache/conftool/dbconfig/20260302-184832-marostegui.json [18:48:33] 10ops-codfw, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 3 others: codfw: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418545#11664958 (10Dzahn) a:05Dzahn→03None [18:48:38] (03CR) 10Kamila Součková: "LGTM except for Ariel's point inline." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244696 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [18:48:45] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:48:47] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:49:14] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11664963 (10MBinder_WMF) I still can't access https://superset.wikimedia.org/superset/dashboard/686/?native_filters_key=z1JEuUqeuYwWhAoOsDa2e4Fz57XRH5kXU... [18:49:15] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11664964 (10bd808) 05Declined→03Resolved a:03SLyngshede-WMF [18:50:15] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11664967 (10MBinder_WMF) Actually, I take that back, it just worked! Guess it needed a second, or perhaps a credential refresh. All good now, please disa... [18:50:40] RESOLVED: [4x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:52:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:52:31] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1018 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [18:53:03] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 3 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11664977 (10Dzahn) a:05VRiley-WMF→03None Hi @VRiley-WMF I tried to filter netbox for "eqiad" + "R440" + "status:... [18:53:36] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd2008-dev.codfw.wmnet with OS trixie [18:53:42] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd2008-dev - https://phabricator.wikimedia.org/T416396#11664980 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd2008-dev.codfw.wmnet with OS trixie executed with errors: - cloudcephosd20... [18:54:19] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2008-dev.codfw.wmnet with OS trixie [18:54:28] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd2008-dev - https://phabricator.wikimedia.org/T416396#11664982 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd2008-dev.codfw.wmnet with OS trixie [18:54:54] (03CR) 10Dzahn: [C:03+2] site: add contint1003/2003 with insetup collab role [puppet] - 10https://gerrit.wikimedia.org/r/1244743 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [18:55:02] (03PS3) 10Dzahn: site: add contint1003/2003 with insetup collab role [puppet] - 10https://gerrit.wikimedia.org/r/1244743 (https://phabricator.wikimedia.org/T418521) [18:56:48] (03PS8) 10Pppery: [Don't merge yet] Add locales for all remaining languages [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221224 (https://phabricator.wikimedia.org/T412651) [18:57:12] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T418771#11664988 (10Jhancock.wm) moved power for one server from overloaded lane to less busy one. no active alert [18:57:19] (03CR) 10Dzahn: [C:03+2] site: add contint1003/2003 with insetup collab role [puppet] - 10https://gerrit.wikimedia.org/r/1244743 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [18:58:19] (03CR) 10Dzahn: [C:03+2] httpbb/miscweb: add tests for wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1240421 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [18:58:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T418465)', diff saved to https://phabricator.wikimedia.org/P89560 and previous config saved to /var/cache/conftool/dbconfig/20260302-185848-marostegui.json [18:58:52] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [18:59:51] (03CR) 10Dzahn: "Valentin, this was in response to a comment from you. Better than nothing?" [puppet] - 10https://gerrit.wikimedia.org/r/1238798 (owner: 10Dzahn) [19:00:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P89561 and previous config saved to /var/cache/conftool/dbconfig/20260302-190032-marostegui.json [19:02:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:04:18] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2095.codfw.wmnet with OS bullseye [19:04:23] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11665052 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2095.codfw.wmnet with OS bullseye executed with errors: -... [19:07:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:10:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:12:22] !log dzahn@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [19:12:43] !log dzahn@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:13:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P89562 and previous config saved to /var/cache/conftool/dbconfig/20260302-191355-marostegui.json [19:15:21] (03CR) 10Dzahn: [C:03+2] "deployed on staging without error now. though:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240412 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [19:15:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P89563 and previous config saved to /var/cache/conftool/dbconfig/20260302-191539-marostegui.json [19:21:19] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [19:25:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:25:38] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11665163 (10ssingh) [19:25:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:27:09] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:27:23] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:27:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:29:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P89564 and previous config saved to /var/cache/conftool/dbconfig/20260302-192903-marostegui.json [19:29:09] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:29:21] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:30:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T418465)', diff saved to https://phabricator.wikimedia.org/P89565 and previous config saved to /var/cache/conftool/dbconfig/20260302-193046-marostegui.json [19:30:50] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [19:31:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1196.eqiad.wmnet with reason: Maintenance [19:31:12] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:31:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1196 (T418465)', diff saved to https://phabricator.wikimedia.org/P89566 and previous config saved to /var/cache/conftool/dbconfig/20260302-193119-marostegui.json [19:31:21] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1019 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [19:34:45] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:34:46] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#11665194 (10Dzahn) @Urbanecm I have... [19:34:47] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:35:25] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:37:39] FIRING: TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [19:40:25] RESOLVED: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:40:29] 06SRE, 07Sustainability (Incident Followup): Noise in #wikimedia-operations is making incident response more difficult - https://phabricator.wikimedia.org/T417163#11665213 (10Dzahn) duplicate of T384804 fwiw [19:41:19] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [19:41:21] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1019 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [19:41:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T418465)', diff saved to https://phabricator.wikimedia.org/P89568 and previous config saved to /var/cache/conftool/dbconfig/20260302-194155-marostegui.json [19:41:59] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [19:42:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [19:42:55] (03PS1) 10Bking: wdqs: reduce default remediation cooldown from 1hr to 30m [puppet] - 10https://gerrit.wikimedia.org/r/1247144 (https://phabricator.wikimedia.org/T242453) [19:44:02] (03PS2) 10Bking: wdqs: reduce default remediation cooldown from 1hr to 30m [puppet] - 10https://gerrit.wikimedia.org/r/1247144 (https://phabricator.wikimedia.org/T242453) [19:44:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T418465)', diff saved to https://phabricator.wikimedia.org/P89569 and previous config saved to /var/cache/conftool/dbconfig/20260302-194411-marostegui.json [19:44:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2173.codfw.wmnet with reason: Maintenance [19:44:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2173 (T418465)', diff saved to https://phabricator.wikimedia.org/P89570 and previous config saved to /var/cache/conftool/dbconfig/20260302-194435-marostegui.json [19:45:40] FIRING: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:46:19] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [19:46:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243990 (https://phabricator.wikimedia.org/T401739) (owner: 10Medelius) [19:50:30] (03CR) 10Dillon: [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [19:50:40] FIRING: [4x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:52:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:52:19] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [19:52:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [19:54:32] (03PS2) 10Aaron Schulz: Add growthexperiments.v0 to $wgRestSandboxSpecs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242613 (https://phabricator.wikimedia.org/T414470) [19:55:40] FIRING: [4x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:55:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T418465)', diff saved to https://phabricator.wikimedia.org/P89571 and previous config saved to /var/cache/conftool/dbconfig/20260302-195549-marostegui.json [19:55:53] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [19:56:19] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [19:56:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242613 (https://phabricator.wikimedia.org/T414470) (owner: 10Aaron Schulz) [19:57:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P89572 and previous config saved to /var/cache/conftool/dbconfig/20260302-195702-marostegui.json [19:57:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:00:40] FIRING: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:00:43] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd2008-dev.codfw.wmnet with OS trixie [20:00:57] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd2008-dev - https://phabricator.wikimedia.org/T416396#11665294 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd2008-dev.codfw.wmnet with OS trixie executed with errors: - cloudcephosd20... [20:01:10] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2008-dev.codfw.wmnet with OS trixie [20:01:19] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd2008-dev - https://phabricator.wikimedia.org/T416396#11665296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd2008-dev.codfw.wmnet with OS trixie [20:02:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:05:40] FIRING: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:07:19] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [20:07:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [20:09:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [20:10:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) (owner: 10Shivaansh Singh) [20:10:40] FIRING: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:10:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P89573 and previous config saved to /var/cache/conftool/dbconfig/20260302-201057-marostegui.json [20:12:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P89574 and previous config saved to /var/cache/conftool/dbconfig/20260302-201209-marostegui.json [20:12:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:12:19] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [20:15:40] RESOLVED: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:18] (03PS2) 10Daniel Kinzler: rest-gateway: use rlc claim from cookie with bearer token [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) [20:16:27] (03CR) 10CI reject: [V:04-1] rest-gateway: use rlc claim from cookie with bearer token [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [20:16:29] (03CR) 10Daniel Kinzler: rest-gateway: use rlc claim from cookie with bearer token (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [20:17:19] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [20:17:29] (03CR) 10BPirkle: [C:03+1] Add growthexperiments.v0 to $wgRestSandboxSpecs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242613 (https://phabricator.wikimedia.org/T414470) (owner: 10Aaron Schulz) [20:17:41] (03PS3) 10Daniel Kinzler: rest-gateway: use rlc claim from cookie with bearer token [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) [20:18:40] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745#11665357 (10Tacsipacsi) TIFFs (e.g. https://commons.wikimedia.org/wiki/Category:T... [20:24:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [20:25:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [20:26:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P89575 and previous config saved to /var/cache/conftool/dbconfig/20260302-202604-marostegui.json [20:26:27] (03PS1) 10Herron: mwlog[12]003: apply role [puppet] - 10https://gerrit.wikimedia.org/r/1247106 (https://phabricator.wikimedia.org/T417002) [20:27:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T418465)', diff saved to https://phabricator.wikimedia.org/P89576 and previous config saved to /var/cache/conftool/dbconfig/20260302-202716-marostegui.json [20:27:21] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [20:27:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1206.eqiad.wmnet with reason: Maintenance [20:27:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1206 (T418465)', diff saved to https://phabricator.wikimedia.org/P89577 and previous config saved to /var/cache/conftool/dbconfig/20260302-202740-marostegui.json [20:27:51] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd2008-dev - https://phabricator.wikimedia.org/T416396#11665376 (10Andrew) Is the server that's meant to fill "T412568: Y26 - Q3:codfw:(1) - Refresh cloudcephmon2004-dev - Config B" ? Or is there a different procurement associated with this? [20:27:56] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243700 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [20:28:11] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247144 (https://phabricator.wikimedia.org/T242453) (owner: 10Bking) [20:30:08] (03CR) 10Bking: [C:03+2] wdqs: reduce default remediation cooldown from 1hr to 30m [puppet] - 10https://gerrit.wikimedia.org/r/1247144 (https://phabricator.wikimedia.org/T242453) (owner: 10Bking) [20:30:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:31:50] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:33:23] FIRING: [3x] SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:33:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:34:05] (03PS6) 10Daniel Kinzler: rest gateway: expose headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240388 (https://phabricator.wikimedia.org/T417780) [20:34:17] FIRING: [4x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:35:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:36:31] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:37:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:37:33] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1017 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [20:38:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T418465)', diff saved to https://phabricator.wikimedia.org/P89578 and previous config saved to /var/cache/conftool/dbconfig/20260302-203759-marostegui.json [20:38:03] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [20:38:19] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [20:38:37] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [20:39:17] FIRING: [10x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:39:49] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:40:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [20:40:25] FIRING: [5x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:41:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:41:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T418465)', diff saved to https://phabricator.wikimedia.org/P89579 and previous config saved to /var/cache/conftool/dbconfig/20260302-204112-marostegui.json [20:41:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2174.codfw.wmnet with reason: Maintenance [20:41:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2174 (T418465)', diff saved to https://phabricator.wikimedia.org/P89580 and previous config saved to /var/cache/conftool/dbconfig/20260302-204136-marostegui.json [20:41:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [20:42:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:42:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:42:33] (03PS1) 10Dwisehaupt: Move fundrasing read db handle to frdb1007 [dns] - 10https://gerrit.wikimedia.org/r/1247147 [20:42:37] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [20:43:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247119 (https://phabricator.wikimedia.org/T414852) (owner: 10C. Scott Ananian) [20:44:17] FIRING: [11x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:45:25] FIRING: [6x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:47:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:47:33] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1017 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [20:48:48] andrew@cumin2002 reimage (PID 3029496) is awaiting input [20:49:17] FIRING: [11x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:49:21] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1020 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [20:50:24] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:50:25] FIRING: [11x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:50:25] (03PS1) 10Catrope: ApiCSPReport: Use structured logging for CSP reports [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247149 [20:51:41] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [20:52:37] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [20:53:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P89581 and previous config saved to /var/cache/conftool/dbconfig/20260302-205307-marostegui.json [20:54:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T418465)', diff saved to https://phabricator.wikimedia.org/P89582 and previous config saved to /var/cache/conftool/dbconfig/20260302-205411-marostegui.json [20:54:14] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-backup2003 to codfw - jhancock@cumin2002" [20:54:15] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [20:54:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-backup2003 to codfw - jhancock@cumin2002" [20:54:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:55:25] FIRING: [11x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:56:08] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-backup2003 [20:56:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-backup2003 [20:56:24] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-backup2004 [20:56:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-backup2004 [20:56:38] (03CR) 10Eevans: [V:03+2 C:03+2] Add (phony) password for linked_artifacts Cassandra role [labs/private] - 10https://gerrit.wikimedia.org/r/1243986 (https://phabricator.wikimedia.org/T418420) (owner: 10Eevans) [20:56:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [20:58:08] (03CR) 10Eevans: [C:03+2] cassandra: add new 'linked_artifacts' role (user) [puppet] - 10https://gerrit.wikimedia.org/r/1243985 (https://phabricator.wikimedia.org/T418420) (owner: 10Eevans) [20:58:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-backup2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:58:19] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [20:58:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-backup2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:58:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [20:59:21] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1020 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [20:59:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247149 (owner: 10Catrope) [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T2100). Please do the needful. [21:00:05] danisztls, Kemayo, AaronSchulz, Pppery, cscott, and RoanKattouw: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] here [21:00:12] o/ [21:00:25] FIRING: [11x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:33] I can deploy [21:01:06] o/ [21:01:37] o/ [21:01:41] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:01:43] I can self-deploy [21:01:57] Oh that would be great, please do [21:02:10] danisztls: Could you go first actually? I'm not quite ready yet [21:02:18] (03PS1) 10Bking: wdqs: do not monitor the blazegraph auto-remediation systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1247151 (https://phabricator.wikimedia.org/T242453) [21:02:21] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:02:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247107 (https://phabricator.wikimedia.org/T417834) (owner: 10DDesouza) [21:02:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247105 (https://phabricator.wikimedia.org/T417829) (owner: 10DDesouza) [21:02:46] RoanKattouw: sure [21:02:55] o/ (but I need to verify that the translations I need are already deployed) [21:03:12] I can also self-deploy, if that'd be easier. [21:03:16] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247151 (https://phabricator.wikimedia.org/T242453) (owner: 10Bking) [21:03:25] (03Merged) 10jenkins-bot: Undeploy Comparative Reader Research survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247107 (https://phabricator.wikimedia.org/T417834) (owner: 10DDesouza) [21:03:28] (03Merged) 10jenkins-bot: Undeploy Comparative Reader Research survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247105 (https://phabricator.wikimedia.org/T417829) (owner: 10DDesouza) [21:03:45] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1247107|Undeploy Comparative Reader Research survey on eswiki (T417834)]], [[gerrit:1247105|Undeploy Comparative Reader Research survey on enwiki (T417829)]] [21:03:50] T417834: Comparative Reader Research (Current Readers - ES) Deployment - https://phabricator.wikimedia.org/T417834 [21:03:50] T417829: Comparative Reader Research (Current Readers - EN) Deployment - https://phabricator.wikimedia.org/T417829 [21:04:26] (03PS2) 10Bking: wdqs: do not monitor the blazegraph auto-remediation systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1247151 (https://phabricator.wikimedia.org/T242453) [21:05:25] FIRING: [11x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:05:35] !log dani@deploy2002 dani: Backport for [[gerrit:1247107|Undeploy Comparative Reader Research survey on eswiki (T417834)]], [[gerrit:1247105|Undeploy Comparative Reader Research survey on enwiki (T417829)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:06:39] !log dani@deploy2002 dani: Continuing with sync [21:08:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P89583 and previous config saved to /var/cache/conftool/dbconfig/20260302-210813-marostegui.json [21:08:37] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:09:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P89584 and previous config saved to /var/cache/conftool/dbconfig/20260302-210919-marostegui.json [21:09:59] Kemayo: Yes please, go ahead and do that once danisztls is done [21:10:05] (03CR) 10Gergő Tisza: rest-gateway: use rlc claim from cookie with bearer token (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [21:10:25] FIRING: [8x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247119 (https://phabricator.wikimedia.org/T414852) (owner: 10C. Scott Ananian) [21:10:37] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247107|Undeploy Comparative Reader Research survey on eswiki (T417834)]], [[gerrit:1247105|Undeploy Comparative Reader Research survey on enwiki (T417829)]] (duration: 06m 52s) [21:10:42] T417834: Comparative Reader Research (Current Readers - ES) Deployment - https://phabricator.wikimedia.org/T417834 [21:10:42] T417829: Comparative Reader Research (Current Readers - EN) Deployment - https://phabricator.wikimedia.org/T417829 [21:10:46] Kemayo: I'm done [21:11:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-backup2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:11:49] (03CR) 10Eevans: [C:03+2] cassandra: Java 8 no longer supported [puppet] - 10https://gerrit.wikimedia.org/r/1242473 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans) [21:12:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-backup2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:12:14] RoanKattouw: I can't deploy my patch yet, the required translations aren't deployed. So I've removed my config patch from the queue. [21:12:21] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:12:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243990 (https://phabricator.wikimedia.org/T401739) (owner: 10Medelius) [21:12:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240721 (https://phabricator.wikimedia.org/T417429) (owner: 10Esanders) [21:12:41] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:13:28] (03Merged) 10jenkins-bot: Suggestion Mode: add values for suggestion feedback properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243990 (https://phabricator.wikimedia.org/T401739) (owner: 10Medelius) [21:13:32] (03Merged) 10jenkins-bot: Stop PasteCheck A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240721 (https://phabricator.wikimedia.org/T417429) (owner: 10Esanders) [21:13:49] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1243990|Suggestion Mode: add values for suggestion feedback properties (T401739)]], [[gerrit:1240721|Stop PasteCheck A/B test (T417429)]] [21:13:53] T401739: Introduce a way to offer feedback about Edit Suggestions - https://phabricator.wikimedia.org/T401739 [21:13:54] T417429: [Config] Deploy config change to STOP the Paste Check A/B experiment - https://phabricator.wikimedia.org/T417429 [21:14:38] !log bking@apt1002 reprepro --component thirdparty/opensearch3 update trixie-wikimedia T418388 [21:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:41] T418388: Upgrade DSE k8s opensearch clusters to 3.5.0 - https://phabricator.wikimedia.org/T418388 [21:15:25] FIRING: [6x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:15:36] !log kemayo@deploy2002 esanders, kemayo, caro: Backport for [[gerrit:1243990|Suggestion Mode: add values for suggestion feedback properties (T401739)]], [[gerrit:1240721|Stop PasteCheck A/B test (T417429)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:15:48] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-backup2003'] [21:15:55] I'll test feedback [21:16:00] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-backup2003'] [21:16:44] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: Testing removal of OpenJDK 8 support - eevans@cumin1003 [21:16:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-backup2003.codfw.wmnet with OS trixie [21:16:48] cmede: sounds good; I've verified that the a/b test is indeed turned off. [21:16:57] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717#11665598 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-backup2003.codfw.wmnet with OS trixie [21:17:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-backup2004.codfw.wmnet with OS trixie [21:17:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:17:18] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717#11665600 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-backup2004.codfw.wmnet with OS trixie [21:17:29] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11665601 (10colewhite) [21:18:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:19:28] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:19:49] think it's ok Kemayo [21:19:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:20:25] FIRING: [5x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:20:47] !log kemayo@deploy2002 esanders, kemayo, caro: Continuing with sync [21:22:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:22:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:22:41] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:23:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T418465)', diff saved to https://phabricator.wikimedia.org/P89585 and previous config saved to /var/cache/conftool/dbconfig/20260302-212321-marostegui.json [21:23:25] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [21:23:37] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:23:38] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1218.eqiad.wmnet with reason: Maintenance [21:23:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1218 (T418465)', diff saved to https://phabricator.wikimedia.org/P89586 and previous config saved to /var/cache/conftool/dbconfig/20260302-212345-marostegui.json [21:24:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:24:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P89587 and previous config saved to /var/cache/conftool/dbconfig/20260302-212426-marostegui.json [21:24:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [21:24:44] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:24:44] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243990|Suggestion Mode: add values for suggestion feedback properties (T401739)]], [[gerrit:1240721|Stop PasteCheck A/B test (T417429)]] (duration: 10m 55s) [21:24:48] T401739: Introduce a way to offer feedback about Edit Suggestions - https://phabricator.wikimedia.org/T401739 [21:24:49] T417429: [Config] Deploy config change to STOP the Paste Check A/B experiment - https://phabricator.wikimedia.org/T417429 [21:24:53] RoanKattouw: I am done. [21:25:06] Great, thanks, I can take it from here [21:25:24] Pppery: Your patch is coming up next [21:25:25] FIRING: [8x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:25:33] OK. Still here [21:25:45] You should probably run namespaceDupes on the wiki after deploying it [21:25:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) (owner: 10Shivaansh Singh) [21:25:58] Will do, thanks for the reminer [21:26:21] (03CR) 10BCornwall: [C:03+2] Revert "cp2043: Set use_noflow_iface_preup to true" [puppet] - 10https://gerrit.wikimedia.org/r/1244869 (owner: 10BCornwall) [21:26:47] (03Merged) 10jenkins-bot: Add Comments namespace for shnwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) (owner: 10Shivaansh Singh) [21:26:48] (03PS2) 10Daniel Kinzler: rest-gateway: no limits for wmcs for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243807 [21:27:06] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1226024|Add Comments namespace for shnwikinews (T414403)]] [21:27:09] T414403: Add Comments: Namespace for shnwikinews - https://phabricator.wikimedia.org/T414403 [21:27:30] (03PS3) 10Daniel Kinzler: rest-gateway: no limits for wmcs for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243807 [21:27:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:29:00] !log catrope@deploy2002 shivaanshsingh, catrope: Backport for [[gerrit:1226024|Add Comments namespace for shnwikinews (T414403)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:29:15] looking [21:29:37] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:29:54] Proceed [21:30:19] !log catrope@deploy2002 shivaanshsingh, catrope: Continuing with sync [21:30:25] FIRING: [14x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:30:59] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2043.codfw.wmnet [21:31:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:32:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:32:33] (03PS4) 10Daniel Kinzler: rest-gateway: use rlc claim from cookie with bearer token [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) [21:32:42] (03CR) 10Daniel Kinzler: rest-gateway: use rlc claim from cookie with bearer token (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [21:32:50] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T418771#11665676 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm no new alerts for a few hours [21:32:53] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:33:19] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:33:21] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:33:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup2003.codfw.wmnet with reason: host reimage [21:33:37] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:33:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup2004.codfw.wmnet with reason: host reimage [21:33:41] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:34:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T418465)', diff saved to https://phabricator.wikimedia.org/P89588 and previous config saved to /var/cache/conftool/dbconfig/20260302-213402-marostegui.json [21:34:06] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [21:34:13] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1226024|Add Comments namespace for shnwikinews (T414403)]] (duration: 07m 07s) [21:34:16] T414403: Add Comments: Namespace for shnwikinews - https://phabricator.wikimedia.org/T414403 [21:34:53] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:35:25] FIRING: [16x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:35:37] (03PS2) 10Dwisehaupt: Move fundrasing read db handle to frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/1247147 [21:36:41] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: Testing removal of OpenJDK 8 support - eevans@cumin1003 [21:37:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:38:34] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1017 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:38:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup2003.codfw.wmnet with reason: host reimage [21:39:20] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:39:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T418465)', diff saved to https://phabricator.wikimedia.org/P89589 and previous config saved to /var/cache/conftool/dbconfig/20260302-213934-marostegui.json [21:39:38] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [21:39:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [21:39:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2176.codfw.wmnet with reason: Maintenance [21:39:54] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:39:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2176 (T418465)', diff saved to https://phabricator.wikimedia.org/P89590 and previous config saved to /var/cache/conftool/dbconfig/20260302-213957-marostegui.json [21:40:25] FIRING: [18x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:41:22] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1021 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:42:20] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1014 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:42:22] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1019 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:42:35] RoanKattow: is it time to run NamespaceDupes now? [21:42:44] RoanKattouw: [21:42:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup2004.codfw.wmnet with reason: host reimage [21:43:01] Yes sorry I forgot, will do [21:43:20] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:43:29] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cp2043.codfw.wmnet [21:43:50] PROBLEM - Ensure traffic_server is running for instance backend on cp2043 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:43:57] (03PS2) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244696 (https://phabricator.wikimedia.org/T410273) [21:44:00] RoanKattouw: are you done deploying? [21:44:00] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp2043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:44:00] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:44:13] (03CR) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244696 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [21:44:29] (03PS4) 10Daniel Kinzler: rest-gateway: no limits for wmcs for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243807 [21:44:36] (03PS3) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244696 (https://phabricator.wikimedia.org/T410273) [21:44:38] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:45:25] FIRING: [19x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:45:56] Pppery: Done [21:45:58] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:46:22] AaronSchulz: One more of my own, then you can go ahead [21:46:32] aye [21:47:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247149 (owner: 10Catrope) [21:47:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:48:34] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1017 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:48:43] !log bking@desktop restarting wdqs codfw to clear ProbeDown alerts [21:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:48] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:49:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P89591 and previous config saved to /var/cache/conftool/dbconfig/20260302-214910-marostegui.json [21:49:20] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2010 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:50:16] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp2043.codfw.wmnet with reason: These are test instances, failing should not notif [21:50:25] FIRING: [19x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:50:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T418465)', diff saved to https://phabricator.wikimedia.org/P89592 and previous config saved to /var/cache/conftool/dbconfig/20260302-215025-marostegui.json [21:50:29] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [21:51:20] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1020 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:51:22] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1021 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:51:23] (03CR) 10Jgreen: [C:03+1] Move fundrasing read db handle to frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/1247147 (owner: 10Dwisehaupt) [21:52:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:52:20] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1014 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:52:22] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1019 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:52:54] (03Merged) 10jenkins-bot: ApiCSPReport: Use structured logging for CSP reports [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247149 (owner: 10Catrope) [21:52:54] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2007 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:53:13] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1247149|ApiCSPReport: Use structured logging for CSP reports]] [21:53:18] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:53:26] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:53:42] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2014 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:54:54] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [21:55:04] !log catrope@deploy2002 catrope: Backport for [[gerrit:1247149|ApiCSPReport: Use structured logging for CSP reports]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:55:25] FIRING: [19x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:55:34] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd2008-dev - https://phabricator.wikimedia.org/T416396#11665781 (10Jhancock.wm) yes, that's the one [21:56:57] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:57:16] RESOLVED: [3x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:57:32] !log catrope@deploy2002 catrope: Continuing with sync [21:58:53] (03PS1) 10Eevans: wmnet: add linked-artifacts CNAME record for k8s ingress [dns] - 10https://gerrit.wikimedia.org/r/1247172 (https://phabricator.wikimedia.org/T414112) [21:59:38] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2008 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [22:00:03] jhancock@cumin2002 reimage (PID 3054661) is awaiting input [22:00:05] Reedy, sbassett, Maryum, and manfredi: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T2200). [22:00:25] FIRING: [18x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:01:20] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1020 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [22:01:26] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [22:01:33] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247149|ApiCSPReport: Use structured logging for CSP reports]] (duration: 08m 19s) [22:01:49] AaronSchulz: I'm done, go ahead [22:01:59] cool [22:02:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aaron@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242613 (https://phabricator.wikimedia.org/T414470) (owner: 10Aaron Schulz) [22:02:16] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:03:03] (03Merged) 10jenkins-bot: Add growthexperiments.v0 to $wgRestSandboxSpecs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242613 (https://phabricator.wikimedia.org/T414470) (owner: 10Aaron Schulz) [22:03:07] AaronSchulz: are you deploying something. I would like to deploy a security patch in 10 minutes [22:03:13] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:03:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:03:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-backup2004.codfw.wmnet with OS trixie [22:03:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-backup2003.codfw.wmnet with OS trixie [22:03:20] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717#11665798 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-backup2004.codfw.wmnet with OS trixie completed: - ms-backup200... [22:03:22] !log aaron@deploy2002 Started scap sync-world: Backport for [[gerrit:1242613|Add growthexperiments.v0 to $wgRestSandboxSpecs (T414470)]] [22:03:23] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717#11665799 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-backup2003.codfw.wmnet with OS trixie completed: - ms-backup200... [22:03:25] T414470: Create the first extension MW REST API module - https://phabricator.wikimedia.org/T414470 [22:03:26] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2011 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [22:03:51] maryum: just this one rest sandbox patch, should be quick [22:03:56] no worries [22:04:12] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717#11665803 (10Jhancock.wm) [22:04:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P89593 and previous config saved to /var/cache/conftool/dbconfig/20260302-220418-marostegui.json [22:04:36] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717#11665805 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm @jcrespo these are complete and ready for you [22:05:11] !log aaron@deploy2002 aaron: Backport for [[gerrit:1242613|Add growthexperiments.v0 to $wgRestSandboxSpecs (T414470)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:05:25] FIRING: [16x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:05:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P89594 and previous config saved to /var/cache/conftool/dbconfig/20260302-220533-marostegui.json [22:05:54] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [22:06:05] !log aaron@deploy2002 aaron: Continuing with sync [22:10:02] !log aaron@deploy2002 Finished scap sync-world: Backport for [[gerrit:1242613|Add growthexperiments.v0 to $wgRestSandboxSpecs (T414470)]] (duration: 06m 39s) [22:10:05] T414470: Create the first extension MW REST API module - https://phabricator.wikimedia.org/T414470 [22:10:21] maryum: ok, done [22:10:25] FIRING: [15x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:10:42] AaronSchulz: thanks! [22:12:22] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1021 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [22:13:23] (03PS1) 10Eevans: service: add linked-artifact service (k8s ingress) [puppet] - 10https://gerrit.wikimedia.org/r/1247175 (https://phabricator.wikimedia.org/T414112) [22:14:17] RESOLVED: [2x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:14:18] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [22:14:38] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2013 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [22:15:25] FIRING: [12x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:15:54] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2022 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [22:15:59] running scap [22:18:15] (03CR) 10Ryan Kemper: [C:03+1] wdqs: do not monitor the blazegraph auto-remediation systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1247151 (https://phabricator.wikimedia.org/T242453) (owner: 10Bking) [22:18:55] (03CR) 10Bking: [C:03+2] wdqs: do not monitor the blazegraph auto-remediation systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1247151 (https://phabricator.wikimedia.org/T242453) (owner: 10Bking) [22:19:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T418465)', diff saved to https://phabricator.wikimedia.org/P89595 and previous config saved to /var/cache/conftool/dbconfig/20260302-221925-marostegui.json [22:19:30] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [22:19:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1219.eqiad.wmnet with reason: Maintenance [22:19:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1219 (T418465)', diff saved to https://phabricator.wikimedia.org/P89596 and previous config saved to /var/cache/conftool/dbconfig/20260302-221938-marostegui.json [22:20:23] (03PS1) 10Ryan Kemper: wdqs: Add retry logic and lag-based restart to deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1247177 (https://phabricator.wikimedia.org/T242453) [22:20:25] FIRING: [9x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:20:26] (03PS1) 10Ryan Kemper: wdqs: Reduce deadlock remediation cooldown to 15 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1247178 (https://phabricator.wikimedia.org/T242453) [22:20:35] (03PS1) 10Andrew Bogott: Revert "Add site/preseed for cloudcephosd2008-dev" [puppet] - 10https://gerrit.wikimedia.org/r/1247179 (https://phabricator.wikimedia.org/T416396) [22:20:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P89597 and previous config saved to /var/cache/conftool/dbconfig/20260302-222041-marostegui.json [22:21:49] !log Deployed security fix for T418179 [22:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:54] security deploy finished [22:22:22] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs1021 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [22:22:53] (03PS2) 10Bking: wdqs: Add retry logic and lag-based restart to deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1247177 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [22:22:56] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247177 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [22:24:18] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2015 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [22:24:23] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install cloudcephmon2007-dev - https://phabricator.wikimedia.org/T416396#11665913 (10Andrew) [22:25:25] FIRING: [7x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:25:29] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install cloudcephmon2007-dev - https://phabricator.wikimedia.org/T416396#11665921 (10Andrew) >>! In T416396#11665781, @Jhancock.wm wrote: > yes, that's the one Ok, then this host needs to be named cloudcephmon2007-dev. I've renamed the task a... [22:27:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:30:22] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: no limits for wmcs for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243807 (owner: 10Daniel Kinzler) [22:30:25] RESOLVED: [5x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:30:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:31:30] (03CR) 10Andrew Bogott: [C:03+2] Revert "Add site/preseed for cloudcephosd2008-dev" [puppet] - 10https://gerrit.wikimedia.org/r/1247179 (https://phabricator.wikimedia.org/T416396) (owner: 10Andrew Bogott) [22:31:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T418465)', diff saved to https://phabricator.wikimedia.org/P89598 and previous config saved to /var/cache/conftool/dbconfig/20260302-223135-marostegui.json [22:31:39] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [22:31:41] (03CR) 10RLazarus: [C:03+2] "It's just an issue with the puppet-compiler infra. There's no real way for this patch (template-only with only static text changes) to hav" [puppet] - 10https://gerrit.wikimedia.org/r/1238369 (https://phabricator.wikimedia.org/T417020) (owner: 10A smart kitten) [22:32:25] PROBLEM - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is CRITICAL: CRITICAL: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [22:32:43] andrewbogott: do merge mine too please, if you happened to catch it :) otherwise I'll just go after you [22:33:03] aha, after you it is :D disregard [22:33:21] ok :) [22:34:36] (03PS3) 10Ryan Kemper: wdqs: Add retry logic and lag-based restart to deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1247177 (https://phabricator.wikimedia.org/T242453) [22:34:37] (03PS2) 10Ryan Kemper: wdqs: Reduce deadlock remediation cooldown to 15 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1247178 (https://phabricator.wikimedia.org/T242453) [22:35:40] FIRING: [5x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:35:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T418465)', diff saved to https://phabricator.wikimedia.org/P89599 and previous config saved to /var/cache/conftool/dbconfig/20260302-223548-marostegui.json [22:36:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2188.codfw.wmnet with reason: Maintenance [22:36:11] (03CR) 10Bking: [C:03+1] wdqs: Add retry logic and lag-based restart to deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1247177 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [22:36:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2188 (T418465)', diff saved to https://phabricator.wikimedia.org/P89600 and previous config saved to /var/cache/conftool/dbconfig/20260302-223612-marostegui.json [22:38:53] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:38:59] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:39:29] (03CR) 10Bking: [C:03+2] wdqs: Add retry logic and lag-based restart to deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1247177 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [22:42:25] RECOVERY - Check unit status of wdqs-blazegraph-deadlock-check on wdqs2012 is OK: OK: Status of the systemd unit wdqs-blazegraph-deadlock-check https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook%23Blazegraph_deadlock [22:42:47] (03CR) 10Herron: [C:03+2] mwlog[12]003: apply role [puppet] - 10https://gerrit.wikimedia.org/r/1247106 (https://phabricator.wikimedia.org/T417002) (owner: 10Herron) [22:42:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:45:40] FIRING: [2x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:46:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P89601 and previous config saved to /var/cache/conftool/dbconfig/20260302-224643-marostegui.json [22:48:53] RESOLVED: [3x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:49:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T418465)', diff saved to https://phabricator.wikimedia.org/P89602 and previous config saved to /var/cache/conftool/dbconfig/20260302-224954-marostegui.json [22:49:58] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [22:50:40] FIRING: [4x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:52:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:53:17] (03CR) 10Bking: [C:03+1] wdqs: Reduce deadlock remediation cooldown to 15 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1247178 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [22:53:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:53:53] RESOLVED: [4x] SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:55:58] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: use rlc claim from cookie with bearer token [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [22:57:05] !log herron@cumin1003 START - Cookbook sre.hosts.reimage for host mwlog1003.eqiad.wmnet with OS trixie [22:58:56] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244696 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [22:59:20] (03PS4) 10Fabfur: varnish: add headers to x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1247034 (https://phabricator.wikimedia.org/T417864) [23:01:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P89603 and previous config saved to /var/cache/conftool/dbconfig/20260302-230151-marostegui.json [23:05:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P89604 and previous config saved to /var/cache/conftool/dbconfig/20260302-230502-marostegui.json [23:06:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:11:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:16:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T418465)', diff saved to https://phabricator.wikimedia.org/P89605 and previous config saved to /var/cache/conftool/dbconfig/20260302-231658-marostegui.json [23:17:02] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [23:17:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1232.eqiad.wmnet with reason: Maintenance [23:17:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1232 (T418465)', diff saved to https://phabricator.wikimedia.org/P89606 and previous config saved to /var/cache/conftool/dbconfig/20260302-231723-marostegui.json [23:18:43] !log herron@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mwlog1003.eqiad.wmnet with reason: host reimage [23:19:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:20:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P89607 and previous config saved to /var/cache/conftool/dbconfig/20260302-232009-marostegui.json [23:22:47] (03CR) 10Dwisehaupt: [C:03+2] Move fundrasing read db handle to frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/1247147 (owner: 10Dwisehaupt) [23:23:03] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwlog1003.eqiad.wmnet with reason: host reimage [23:24:35] !log dwisehaupt@dns1006 START - running authdns-update [23:25:46] !log dwisehaupt@dns1006 END - running authdns-update [23:29:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T418465)', diff saved to https://phabricator.wikimedia.org/P89608 and previous config saved to /var/cache/conftool/dbconfig/20260302-232918-marostegui.json [23:29:28] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [23:35:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T418465)', diff saved to https://phabricator.wikimedia.org/P89609 and previous config saved to /var/cache/conftool/dbconfig/20260302-233517-marostegui.json [23:35:20] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [23:35:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2202.codfw.wmnet with reason: Maintenance [23:41:30] jouncebot: nowandnext [23:41:31] For the next 0 hour(s) and 18 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260302T2200) [23:41:31] In 0 hour(s) and 18 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T0000) [23:42:36] (03CR) 10Zabe: [C:03+2] multiversion: Stop setting MW_USE_CONFIG_SCHEMA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1246880 (https://phabricator.wikimedia.org/T304460) (owner: 10Zabe) [23:43:29] (03Merged) 10jenkins-bot: multiversion: Stop setting MW_USE_CONFIG_SCHEMA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1246880 (https://phabricator.wikimedia.org/T304460) (owner: 10Zabe) [23:43:36] (03PS2) 10MGChecker: dewiki: Add abusefilter group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247186 (https://phabricator.wikimedia.org/T418815) [23:43:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2203.codfw.wmnet with reason: Maintenance [23:43:48] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1246880|multiversion: Stop setting MW_USE_CONFIG_SCHEMA (T304460)]] [23:43:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2203 (T418465)', diff saved to https://phabricator.wikimedia.org/P89610 and previous config saved to /var/cache/conftool/dbconfig/20260302-234350-marostegui.json [23:43:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:43:51] T304460: Roll out loading of default settings via SettingsBuilder - https://phabricator.wikimedia.org/T304460 [23:43:54] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [23:44:12] !log herron@cumin1003 START - Cookbook sre.hosts.reimage for host mwlog2003.codfw.wmnet with OS trixie [23:44:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P89611 and previous config saved to /var/cache/conftool/dbconfig/20260302-234425-marostegui.json [23:45:38] !log zabe@deploy2002 zabe: Backport for [[gerrit:1246880|multiversion: Stop setting MW_USE_CONFIG_SCHEMA (T304460)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:46:32] (03PS3) 10MGChecker: dewiki: Add abusefilter group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247186 (https://phabricator.wikimedia.org/T418815) [23:46:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:47:02] !log zabe@deploy2002 zabe: Continuing with sync [23:49:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:49:51] (03CR) 10Zabe: [C:03+2] Stop writing to il_to on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240320 (https://phabricator.wikimedia.org/T415787) (owner: 10Zabe) [23:50:43] (03Merged) 10jenkins-bot: Stop writing to il_to on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240320 (https://phabricator.wikimedia.org/T415787) (owner: 10Zabe) [23:50:58] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1246880|multiversion: Stop setting MW_USE_CONFIG_SCHEMA (T304460)]] (duration: 07m 10s) [23:51:01] T304460: Roll out loading of default settings via SettingsBuilder - https://phabricator.wikimedia.org/T304460 [23:51:44] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp2058.codfw.wmnet with reason: dcops troubleshooting for T418527 [23:51:47] T418527: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527 [23:52:02] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1240320|Stop writing to il_to on testwiki (T415787)]] [23:52:05] T415787: Stop writing to il_to by setting imagelinks migration to write new - https://phabricator.wikimedia.org/T415787 [23:53:51] !log zabe@deploy2002 zabe: Backport for [[gerrit:1240320|Stop writing to il_to on testwiki (T415787)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:54:09] !log zabe@deploy2002 zabe: Continuing with sync [23:54:44] (03CR) 10Zabe: [C:03+2] ImageListPager: Properly support file schema migration read new [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247068 (https://phabricator.wikimedia.org/T418327) (owner: 10Zabe) [23:55:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T418465)', diff saved to https://phabricator.wikimedia.org/P89612 and previous config saved to /var/cache/conftool/dbconfig/20260302-235511-marostegui.json [23:55:14] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [23:58:04] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1240320|Stop writing to il_to on testwiki (T415787)]] (duration: 06m 02s) [23:58:08] T415787: Stop writing to il_to by setting imagelinks migration to write new - https://phabricator.wikimedia.org/T415787 [23:59:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P89613 and previous config saved to /var/cache/conftool/dbconfig/20260302-235933-marostegui.json