[00:01:08] !log catrope@deploy2002 catrope, kharlan: Continuing with sync [00:03:38] 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface pfw1-codfw:reth1 () - https://phabricator.wikimedia.org/T419150#11679873 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm limited to maintenance window for frack mgmt vlan migration. no new alerts since. [00:05:03] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248628|Re-enable AllowUserJs (T419137)]] (duration: 08m 08s) [00:13:53] (03PS2) 10Jasmine: wmnet: add sophroid svc IPs [dns] - 10https://gerrit.wikimedia.org/r/1248617 (https://phabricator.wikimedia.org/T418748) [00:13:59] (03PS1) 10Dzahn: aptrepo: add jenkins for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) [00:17:35] Dreamy_Jazz: are you folks done with deploying? [00:19:59] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11679905 (10VRiley-WMF) Re-labeled moss-fe1002 to contint1003 Racked it in B1 U36 CableID: 3720 Port: 28 This unit... [00:25:04] zabe: yeah [00:25:20] Alright:) [00:25:52] (03CR) 10Zabe: [C:03+2] Stop writing to il_to on all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248493 (https://phabricator.wikimedia.org/T415787) (owner: 10Zabe) [00:26:44] (03Merged) 10jenkins-bot: Stop writing to il_to on all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248493 (https://phabricator.wikimedia.org/T415787) (owner: 10Zabe) [00:27:07] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248493|Stop writing to il_to on all wikis except commons (T415787)]] [00:27:11] T415787: Stop writing to il_to by setting imagelinks migration to write new - https://phabricator.wikimedia.org/T415787 [00:28:58] !log zabe@deploy2002 zabe: Backport for [[gerrit:1248493|Stop writing to il_to on all wikis except commons (T415787)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:29:36] !log zabe@deploy2002 zabe: Continuing with sync [00:33:29] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248493|Stop writing to il_to on all wikis except commons (T415787)]] (duration: 06m 22s) [00:33:33] T415787: Stop writing to il_to by setting imagelinks migration to write new - https://phabricator.wikimedia.org/T415787 [00:39:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1248646 [00:39:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1248646 (owner: 10TrainBranchBot) [00:44:14] FIRING: [5x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:53:03] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1248646 (owner: 10TrainBranchBot) [00:59:14] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:06:03] (03PS1) 10Zabe: Prepare kaiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248647 (https://phabricator.wikimedia.org/T414234) [01:08:16] (03CR) 10Zabe: [C:03+2] Prepare kaiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248647 (https://phabricator.wikimedia.org/T414234) (owner: 10Zabe) [01:09:05] (03Merged) 10jenkins-bot: Prepare kaiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248647 (https://phabricator.wikimedia.org/T414234) (owner: 10Zabe) [01:09:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1248648 [01:09:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1248648 (owner: 10TrainBranchBot) [01:09:42] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248647|Prepare kaiwiki (T414234)]] [01:09:46] T414234: Create Wikipedia Karai-Karai - https://phabricator.wikimedia.org/T414234 [01:11:49] !log zabe@deploy2002 zabe: Backport for [[gerrit:1248647|Prepare kaiwiki (T414234)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:13:13] !log zabe@deploy2002 zabe: Continuing with sync [01:13:46] FIRING: Emergency syslog message: Alert for device pfw1-codfw.wikimedia.org - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [01:15:11] FIRING: PfwCoreBGPDown: ... [01:15:17] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [01:17:07] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248647|Prepare kaiwiki (T414234)]] (duration: 07m 25s) [01:17:11] T414234: Create Wikipedia Karai-Karai - https://phabricator.wikimedia.org/T414234 [01:18:54] FIRING: [5x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:18:59] FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:0 (Core: pfw1-codfw:xe-7/2/0 {#11923_12249-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:20:26] (03PS1) 10Zabe: Activate kaiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248650 (https://phabricator.wikimedia.org/T414234) [01:21:08] (03CR) 10Zabe: [C:03+2] Activate kaiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248650 (https://phabricator.wikimedia.org/T414234) (owner: 10Zabe) [01:22:06] (03Merged) 10jenkins-bot: Activate kaiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248650 (https://phabricator.wikimedia.org/T414234) (owner: 10Zabe) [01:22:24] (03CR) 10Eevans: "There was a suggestion to use the python-webapp chart, but this uses aqs-http-gateway because a) all of the other Cassandra-connected serv" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248148 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [01:22:26] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248650|Activate kaiwiki (T414234)]] [01:22:30] T414234: Create Wikipedia Karai-Karai - https://phabricator.wikimedia.org/T414234 [01:23:46] RESOLVED: Emergency syslog message: Device pfw1-codfw.wikimedia.org recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [01:23:54] FIRING: [5x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:23:59] FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:0 (Core: pfw1-codfw:xe-7/2/0 {#11923_12249-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:24:19] (03CR) 10Scardenasmolinar: [C:03+1] Enable rr-ml AutoModerator CC Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247639 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [01:24:22] !log zabe@deploy2002 zabe: Backport for [[gerrit:1248650|Activate kaiwiki (T414234)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:24:46] FIRING: Emergency syslog message: Alert for device pfw1-codfw.wikimedia.org - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [01:25:11] RESOLVED: PfwCoreBGPDown: ... [01:25:17] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [01:25:31] !log zabe@deploy2002 zabe: Continuing with sync [01:27:39] (03PS1) 10Zabe: Prepare urwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248652 (https://phabricator.wikimedia.org/T415960) [01:28:06] (03PS1) 10Zabe: Activate urwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248653 (https://phabricator.wikimedia.org/T415960) [01:29:24] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248650|Activate kaiwiki (T414234)]] (duration: 06m 57s) [01:29:27] T414234: Create Wikipedia Karai-Karai - https://phabricator.wikimedia.org/T414234 [01:29:46] RESOLVED: Emergency syslog message: Device pfw1-codfw.wikimedia.org recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [01:30:05] (03CR) 10Zabe: [C:03+2] Prepare urwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248652 (https://phabricator.wikimedia.org/T415960) (owner: 10Zabe) [01:30:25] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1248648 (owner: 10TrainBranchBot) [01:30:59] (03Merged) 10jenkins-bot: Prepare urwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248652 (https://phabricator.wikimedia.org/T415960) (owner: 10Zabe) [01:32:18] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248652|Prepare urwikisource (T415960)]] [01:32:22] T415960: Create Wikisource Urdu - https://phabricator.wikimedia.org/T415960 [01:34:20] !log zabe@deploy2002 zabe: Backport for [[gerrit:1248652|Prepare urwikisource (T415960)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:34:41] !log zabe@deploy2002 zabe: Continuing with sync [01:37:50] (03CR) 10Zabe: [C:03+2] Activate urwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248653 (https://phabricator.wikimedia.org/T415960) (owner: 10Zabe) [01:38:36] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248652|Prepare urwikisource (T415960)]] (duration: 06m 18s) [01:38:40] (03Merged) 10jenkins-bot: Activate urwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248653 (https://phabricator.wikimedia.org/T415960) (owner: 10Zabe) [01:38:40] T415960: Create Wikisource Urdu - https://phabricator.wikimedia.org/T415960 [01:40:17] (03CR) 10RLazarus: [C:03+1] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247185 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [01:42:01] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248653|Activate urwikisource (T415960)]] [01:43:57] !log zabe@deploy2002 zabe: Backport for [[gerrit:1248653|Activate urwikisource (T415960)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:44:01] T415960: Create Wikisource Urdu - https://phabricator.wikimedia.org/T415960 [01:45:30] !log zabe@deploy2002 Sync cancelled. [01:47:38] (03PS1) 10Zabe: Set urwikisource to rtl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248656 (https://phabricator.wikimedia.org/T415960) [01:48:05] (03CR) 10Zabe: [C:03+2] Set urwikisource to rtl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248656 (https://phabricator.wikimedia.org/T415960) (owner: 10Zabe) [01:48:46] (03CR) 10CI reject: [V:04-1] Set urwikisource to rtl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248656 (https://phabricator.wikimedia.org/T415960) (owner: 10Zabe) [01:52:13] ^ CI fails due to a self-resolving conflig. It tests if the live rtl wikis match the dblist but now it complains that I am adding a wiki which is not live (obviously). [01:52:18] (03CR) 10Zabe: [V:03+2 C:03+2] Set urwikisource to rtl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248656 (https://phabricator.wikimedia.org/T415960) (owner: 10Zabe) [01:52:59] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248656|Set urwikisource to rtl (T415960)]] [01:53:03] T415960: Create Wikisource Urdu - https://phabricator.wikimedia.org/T415960 [01:54:54] !log zabe@deploy2002 zabe: Backport for [[gerrit:1248656|Set urwikisource to rtl (T415960)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:55:44] !log zabe@deploy2002 zabe: Continuing with sync [01:56:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:59:38] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248656|Set urwikisource to rtl (T415960)]] (duration: 06m 39s) [01:59:41] T415960: Create Wikisource Urdu - https://phabricator.wikimedia.org/T415960 [02:00:36] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248658 [02:00:36] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248658 (owner: 10Zabe) [02:00:50] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:01:26] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248658 (owner: 10Zabe) [02:08:54] FIRING: [7x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:13] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 08m 23s) [02:09:27] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248658|Update interwiki cache]] [02:11:21] !log zabe@deploy2002 zabe: Backport for [[gerrit:1248658|Update interwiki cache]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [02:12:07] !log zabe@deploy2002 zabe: Continuing with sync [02:12:30] !log zabe@deploy2002 mwscript-k8s job started: foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https # T415978, T414241 [02:12:34] T415978: Add Wikidata support for urwikisource - https://phabricator.wikimedia.org/T415978 [02:12:35] T414241: Add Wikidata support for kaiwiki - https://phabricator.wikimedia.org/T414241 [02:14:14] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:15:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:16:06] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248658|Update interwiki cache]] (duration: 06m 38s) [02:20:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:21:44] !log zabe@deploy2002:/srv/mediawiki-staging$ foreachwiki extensions/TimedMediaHandler/maintenance/migrateTranscodeStates.php --force # T415064 [02:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:47] T415064: Backfill new status and touched columns - https://phabricator.wikimedia.org/T415064 [02:33:54] FIRING: [7x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:53:54] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:56:03] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [02:59:40] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change frack mgmt vlan interface - pt1979@cumin2002" [02:59:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change frack mgmt vlan interface - pt1979@cumin2002" [02:59:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:15:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:20:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:23:54] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:37:25] FIRING: [8x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [03:39:13] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:53:58] (03PS1) 10Catrope: Drop $wgOATHUserHandlesTable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248663 (https://phabricator.wikimedia.org/T416544) [03:56:50] (03PS1) 10Catrope: beta: Enable passwordless login on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248664 (https://phabricator.wikimedia.org/T419198) [03:56:51] (03PS1) 10Catrope: Enable passwordless login in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248665 (https://phabricator.wikimedia.org/T419198) [03:57:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248664 (https://phabricator.wikimedia.org/T419198) (owner: 10Catrope) [03:58:08] (03Merged) 10jenkins-bot: beta: Enable passwordless login on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248664 (https://phabricator.wikimedia.org/T419198) (owner: 10Catrope) [04:15:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:20:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:36:51] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11680410 (10cmooney) Thinking it through, I think this process could be used to "drain" //ssw1-d1// if we wanted to attempt this without any impact to hosts. * Set //c... [05:23:54] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:38:31] (03PS2) 10Ryan Kemper: wdqs: cleanup code related to query-legacy-full.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1247933 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel) [05:39:32] (03PS2) 10Ryan Kemper: wdqs: remove query-legacy-full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247947 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel) [05:39:43] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247933 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel) [05:43:26] (03CR) 10Ryan Kemper: [C:03+2] wdqs: cleanup code related to query-legacy-full.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1247933 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel) [05:56:03] (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove query-legacy-full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247947 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel) [05:56:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:58:07] (03Merged) 10jenkins-bot: wdqs: remove query-legacy-full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247947 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel) [06:14:14] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:22:21] !log ryankemper@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [06:22:36] !log ryankemper@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [06:23:12] !log ryankemper@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [06:23:20] !log ryankemper@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [06:23:35] !log ryankemper@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [06:23:38] !log ryankemper@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [06:26:42] (03PS1) 10Ryan Kemper: wdqs: remove stale legacy-full-gui release entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248694 (https://phabricator.wikimedia.org/T415073) [06:28:27] (03PS1) 10Clare Ming: Remove Metrics Platform config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248695 (https://phabricator.wikimedia.org/T417568) [06:34:14] FIRING: [5x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:41:53] (03CR) 10Arnaudb: [C:03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1248127 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [06:43:04] (03CR) 10Arnaudb: [C:03+1] ci::website: support 2 different websites, integration vs zuul-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1248118 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [06:49:28] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:56:52] (03CR) 10Muehlenhoff: "Yeah, sure. I'll roll this out next week with Puppet disabled on C:profile::bird::anycast" [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260306T0700) [07:00:51] (03PS1) 10Muehlenhoff: Create component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1248701 (https://phabricator.wikimedia.org/T419058) [07:23:54] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:26:38] (03PS1) 10Muehlenhoff: Fix file existence check for Promtheus exporter [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/1248702 (https://phabricator.wikimedia.org/T419166) [07:31:18] (03CR) 10Slyngshede: [C:03+1] "Weird, you'd think systemd would complain.... Looks good" [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/1248702 (https://phabricator.wikimedia.org/T419166) (owner: 10Muehlenhoff) [07:36:51] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Fix file existence check for Promtheus exporter [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/1248702 (https://phabricator.wikimedia.org/T419166) (owner: 10Muehlenhoff) [07:37:25] FIRING: [8x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [07:39:13] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:45:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260306T0800) [08:00:30] (03CR) 10Elukey: [C:03+2] elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [08:10:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:11:34] !log imported prometheus-ganeti-exporter 0.3+deb12u2 for bookworm-wikimedia T419166 [08:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:38] T419166: Audit production for systemd parse warnings - https://phabricator.wikimedia.org/T419166 [08:13:32] 06SRE, 13Patch-For-Review, 07Security: Audit production for systemd parse warnings - https://phabricator.wikimedia.org/T419166#11680569 (10MoritzMuehlenhoff) Thanks for the report, the impact for the Prometheus exporter is harmless, the incorrect test only prevented some log spam on Ganeti servers which are... [08:15:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:17:24] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11680574 (10MoritzMuehlenhoff) [08:20:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:52] (03PS1) 10Elukey: profile::pyrra: rework wdqs availability SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1248760 (https://phabricator.wikimedia.org/T393966) [08:21:54] (03PS1) 10Elukey: profile::pyrra: remove old wdqs SLO configs [puppet] - 10https://gerrit.wikimedia.org/r/1248761 (https://phabricator.wikimedia.org/T393966) [08:25:03] !log uploaded openjdk-8 8u482-ga-1~deb12u1 to component/jdk8 of bookworm-wikimedia [08:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:10] (03CR) 10Elukey: "Hi Ryan! I reworked a little your patches in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1248760, lemme know your thoughts. The b" [puppet] - 10https://gerrit.wikimedia.org/r/1235891 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [08:42:12] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11680597 (10elukey) The UsbPorts setting was added by Riccardo in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/748741 in 202... [08:42:54] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-fe1013.eqiad.wmnet [08:45:25] FIRING: [3x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:25] FIRING: [3x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:54:25] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11680622 (10MatthewVernon) FWIW, I have no objection to your doing so. [08:54:35] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-fe1013.eqiad.wmnet [08:55:25] FIRING: [3x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:25] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-fe1013.eqiad.wmnet [08:56:27] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11680630 (10elukey) Of course the BIOS differs too :D ` >>> r = spicerack.redfish('ms-fe1013') Management Password: >>> r.bios_versi... [08:57:03] !log elukey@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe1013.eqiad.wmnet [09:05:57] (03PS1) 10Tiziano Fogli: thanos/rec_rules: adjust tsdb_head_series:zscore to use nested rec rules [puppet] - 10https://gerrit.wikimedia.org/r/1248766 (https://phabricator.wikimedia.org/T415317) [09:08:37] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1013.eqiad.wmnet [09:08:39] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-fe1013.eqiad.wmnet [09:09:33] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:10:24] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:11:04] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: ms-fe1013 reports a backplane error - https://phabricator.wikimedia.org/T419010#11680651 (10elukey) 05Open→03Resolved Thanks! [09:11:52] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11680654 (10elukey) Applied bios and idrac update to ms-fe1013, then re-run provisioning: ` Skipped set of attribute BIOS.Setup.1-1 ->... [09:18:25] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11680679 (10elukey) [09:21:32] (03CR) 10Gehel: [C:03+1] wdqs: remove stale legacy-full-gui release entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248694 (https://phabricator.wikimedia.org/T415073) (owner: 10Ryan Kemper) [09:23:22] !log derick@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=pmswiki --logwiki=metawiki Wikilimes Limes.pink # T419184 [09:23:26] T419184: Unblock stuck global rename of Limes.pink - https://phabricator.wikimedia.org/T419184 [09:24:14] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:24:26] (03Abandoned) 10Itamar Givon: Add new key generated with a security key [puppet] - 10https://gerrit.wikimedia.org/r/921356 (https://phabricator.wikimedia.org/T337037) (owner: 10Itamar Givon) [09:26:47] (03CR) 10Dpogorzelski: [C:03+2] ml-services: bump llm limitranges to enable embeddings isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248531 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [09:27:47] (03CR) 10Muehlenhoff: aptrepo: add jenkins for trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [09:28:03] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Create component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1248701 (https://phabricator.wikimedia.org/T419058) (owner: 10Muehlenhoff) [09:28:05] (03CR) 10Tiziano Fogli: [C:03+2] "Manually tested with promtool, I'm self-merging." [puppet] - 10https://gerrit.wikimedia.org/r/1248766 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [09:28:52] tappof: shall I merge your patch along? [09:29:05] yes thx [09:29:11] moritzm: [09:30:55] (03PS1) 10Muehlenhoff: mcrounter: Run spec tests on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1248768 [09:31:42] (03PS1) 10Muehlenhoff: Remove pointless spec test [puppet] - 10https://gerrit.wikimedia.org/r/1248769 [09:31:56] tappof: merged [09:34:20] thx moritzm [09:34:55] (03CR) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:35:18] (03Merged) 10jenkins-bot: ml-services: bump llm limitranges to enable embeddings isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248531 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [09:39:55] !log repool ms-fe1013 after PXE work T401966 [09:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:59] T401966: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966 [09:41:50] (03PS1) 10Btullis: Grant members of analytics-product-users access to an-coord hosts [puppet] - 10https://gerrit.wikimedia.org/r/1248771 (https://phabricator.wikimedia.org/T419167) [09:42:33] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248771 (https://phabricator.wikimedia.org/T419167) (owner: 10Btullis) [09:49:51] 06SRE, 10ServiceOps-Mediawiki, 06ServiceOps new (Next quarter): Migrate Service Ops Docker images running in production away from Bullseye - https://phabricator.wikimedia.org/T418200#11680749 (10MLechvien-WMF) [09:53:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:53:54] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:56:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:26] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:59:44] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:00:58] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248539 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [10:02:53] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248539 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [10:03:48] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2332-2356].codfw.wmnet [10:04:37] (03CR) 10JMeybohm: [C:04-1] aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [10:06:07] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206 (10BTullis) 03NEW [10:06:11] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11680776 (10BTullis) [10:09:00] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:09:06] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1199.eqiad.wmnet [10:09:09] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:09:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11680788 (10ops-monitoring-bot) Host an-worker1199.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:13:43] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1194.eqiad.wmnet [10:14:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11680818 (10ops-monitoring-bot) Host an-worker1194.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:14:14] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:15:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:15:55] FIRING: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:16:28] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11680838 (10elukey) Me and Matthew will coordinate next week to upgrade the ms-be and thanos-be hosts one at the time next week :) [10:16:46] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2332-2356].codfw.wmnet [10:19:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:21:09] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1194.eqiad.wmnet [10:23:06] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1199.eqiad.wmnet [10:23:21] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2332.codfw.wmnet with OS trixie [10:24:37] 06SRE, 06Infrastructure-Foundations, 10ServiceOps-Upgrades-Hardware, 07Epic, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212 (10MLechvien-WMF) 03NEW [10:24:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:32:01] 06SRE, 06Infrastructure-Foundations, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11680929 (10MLechvien-WMF) [10:32:27] 06SRE, 06Infrastructure-Foundations, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11680932 (10MLechvien-WMF) [10:34:14] FIRING: [5x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 16.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:34:23] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11680951 (10MoritzMuehlenhoff) [10:36:45] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2332.codfw.wmnet with reason: host reimage [10:43:09] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2332.codfw.wmnet with reason: host reimage [10:44:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:49:43] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:53:47] 06SRE, 06Infrastructure-Foundations, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11681015 (10MLechvien-WMF) @JMeybohm as discussed today: - Redis hosts upgrade should be done at the same time a... [10:56:19] 06SRE, 06Infrastructure-Foundations, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11681042 (10JMeybohm) >>! In T419212#11681013, @MLechvien-WMF wrote: > - deploy1003 > - deploy2002 I think the d... [10:58:59] 06SRE, 06Infrastructure-Foundations, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11681045 (10JMeybohm) [11:02:01] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2332.codfw.wmnet with OS trixie [11:02:20] 06SRE, 06Infrastructure-Foundations, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11681049 (10MLechvien-WMF) [11:04:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 15.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:05:59] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2332.codfw.wmnet [11:06:00] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2332.codfw.wmnet [11:08:19] (03CR) 10Elukey: [C:03+1] Remove pointless spec test [puppet] - 10https://gerrit.wikimedia.org/r/1248769 (owner: 10Muehlenhoff) [11:08:31] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2333.codfw.wmnet with OS trixie [11:08:58] (03PS14) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [11:09:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 18.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:09:33] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2335.codfw.wmnet with OS trixie [11:09:36] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2334.codfw.wmnet with OS trixie [11:12:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 13.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:16:47] (03CR) 10Btullis: [C:03+2] Update hadoop namenode JVM memory settings [puppet] - 10https://gerrit.wikimedia.org/r/1247643 (https://phabricator.wikimedia.org/T418551) (owner: 10Joal) [11:17:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 13.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:21:29] 06SRE, 10envoy, 06ServiceOps new, 10ServiceOps-Services-Oids: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975#11681081 (10MLechvien-WMF) @RLazarus can we close this in Q3? if not, how much effort should we factor in Q4 plan? [11:21:54] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2333.codfw.wmnet with reason: host reimage [11:22:40] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2334.codfw.wmnet with reason: host reimage [11:23:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:23:18] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2335.codfw.wmnet with reason: host reimage [11:26:22] 10ops-eqiad, 06DC-Ops: hw troubleshooting: CPU voltage error for an-worker1231.eqiad.wmnet - https://phabricator.wikimedia.org/T419222 (10BTullis) 03NEW [11:27:05] (03PS1) 10Majavah: P:toolforge::prometheus: Add Prometheus scrapes for Istio [puppet] - 10https://gerrit.wikimedia.org/r/1248791 (https://phabricator.wikimedia.org/T418274) [11:27:26] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2333.codfw.wmnet with reason: host reimage [11:28:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:30:39] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2334.codfw.wmnet with reason: host reimage [11:34:23] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2335.codfw.wmnet with reason: host reimage [11:36:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419000#11681212 (10BTullis) Thanks @Jclark-ctr - so this is an interesting one, as it is one of the SSDs that holds the operating system that has failed. ` Physic... [11:37:25] FIRING: [8x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [11:39:07] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1205.eqiad.wmnet [11:39:13] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:39:21] !log elukey@cumin1003 START - Cookbook sre.kafka.change-confluent-distro-version Change Confluent distribution for Kafka A:kafka-test-eqiad cluster: Change Confluent distribution. [11:39:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11681230 (10ops-monitoring-bot) Host an-worker1205.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [11:45:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248075 (https://phabricator.wikimedia.org/T415902) (owner: 10Mmartorana) [11:45:45] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2333.codfw.wmnet with OS trixie [11:46:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247651 (https://phabricator.wikimedia.org/T415902) (owner: 10Mmartorana) [11:48:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1205.eqiad.wmnet [11:48:04] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1205 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 0 OK : virtual_disk: 1 Dgrd : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T419224 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [11:48:13] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419224 (10ops-monitoring-bot) 03NEW [11:49:28] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1207.eqiad.wmnet [11:49:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11681259 (10ops-monitoring-bot) Host an-worker1207.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [11:50:48] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2334.codfw.wmnet with OS trixie [11:53:22] !log uploaded icu 72.1-3+deb12u1~wmf11u1 to component/php83-icu72 T419058 (backport of ICU 72 from Bookworm to Bullseye, built to be co-installable with the native ICU from Bullseye) [11:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:25] T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058 [11:54:15] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2335.codfw.wmnet with OS trixie [11:54:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1207.eqiad.wmnet [11:55:19] !log elukey@cumin1003 END (FAIL) - Cookbook sre.kafka.change-confluent-distro-version (exit_code=99) Change Confluent distribution for Kafka A:kafka-test-eqiad cluster: Change Confluent distribution. [11:55:58] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2333-2335].codfw.wmnet [11:56:01] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2333-2335].codfw.wmnet [11:59:00] (03PS15) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [11:59:31] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2336.codfw.wmnet with OS trixie [11:59:34] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2337.codfw.wmnet with OS trixie [11:59:46] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2338.codfw.wmnet with OS trixie [11:59:49] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2339.codfw.wmnet with OS trixie [12:00:03] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2340.codfw.wmnet with OS trixie [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260306T0800) [12:00:05] jelto, arnoldokoth, mutante, and arnaudb: GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260306T1200). Please do the needful. [12:00:07] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2341.codfw.wmnet with OS trixie [12:02:57] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1248791 (https://phabricator.wikimedia.org/T418274) (owner: 10Majavah) [12:06:27] (03PS16) 10Elukey: Add the sre.kafka.change-confluent-distro-version cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [12:12:39] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2339.codfw.wmnet with reason: host reimage [12:12:47] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2336.codfw.wmnet with reason: host reimage [12:12:49] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2338.codfw.wmnet with reason: host reimage [12:12:59] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2337.codfw.wmnet with reason: host reimage [12:13:05] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2341.codfw.wmnet with reason: host reimage [12:13:18] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2340.codfw.wmnet with reason: host reimage [12:15:21] (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Add Prometheus scrapes for Istio [puppet] - 10https://gerrit.wikimedia.org/r/1248791 (https://phabricator.wikimedia.org/T418274) (owner: 10Majavah) [12:16:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11681366 (10BTullis) I've applied the BIOS settings to all hadoop workers and re-added any data disks that wer... [12:17:21] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk in slot 10 for an-worker1194 - https://phabricator.wikimedia.org/T389065#11681371 (10BTullis) 05Open→03Resolved This has been done as part of the investigation into T415002 [12:17:53] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T416066#11681376 (10BTullis) 05Open→03Resolved This is fixed now. [12:18:16] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2339.codfw.wmnet with reason: host reimage [12:22:23] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2336.codfw.wmnet with reason: host reimage [12:24:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11681404 (10BTullis) 05Open→03Resolved It's also worth noting that we haven't seen any rise in power c... [12:26:42] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2338.codfw.wmnet with reason: host reimage [12:28:29] (03PS1) 10Muehlenhoff: Add a pbuilder hook to build against the ICU72 backport [puppet] - 10https://gerrit.wikimedia.org/r/1248797 (https://phabricator.wikimedia.org/T419058) [12:30:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248797 (https://phabricator.wikimedia.org/T419058) (owner: 10Muehlenhoff) [12:31:02] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2337.codfw.wmnet with reason: host reimage [12:34:27] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2340.codfw.wmnet with reason: host reimage [12:35:32] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2339.codfw.wmnet with OS trixie [12:40:25] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2341.codfw.wmnet with reason: host reimage [12:43:15] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2336.codfw.wmnet with OS trixie [12:45:21] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2338.codfw.wmnet with OS trixie [12:48:11] (03PS1) 10Mszwarc: Add a script to send mandatory 2FA Echo notification [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248806 (https://phabricator.wikimedia.org/T419111) [12:49:54] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2337.codfw.wmnet with OS trixie [12:50:31] (03PS1) 10Joal: Update hadoop namenode JVM configuration [puppet] - 10https://gerrit.wikimedia.org/r/1248807 (https://phabricator.wikimedia.org/T418551) [12:50:36] (03Abandoned) 10Hashar: Revert^2 "Add icons for wikibase changes. WIP" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196681 (owner: 10Neslihan Turan) [12:51:10] (03CR) 10Btullis: [C:03+2] Update hadoop namenode JVM configuration [puppet] - 10https://gerrit.wikimedia.org/r/1248807 (https://phabricator.wikimedia.org/T418551) (owner: 10Joal) [12:51:20] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248808 (https://phabricator.wikimedia.org/T408918) [12:54:53] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2340.codfw.wmnet with OS trixie [12:56:21] (03PS2) 10Muehlenhoff: dist-upgrade: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1248463 [12:56:26] (03CR) 10Muehlenhoff: dist-upgrade: Remove support for Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248463 (owner: 10Muehlenhoff) [12:58:14] (03CR) 10Dzahn: aptrepo: add jenkins for trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [13:01:12] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2341.codfw.wmnet with OS trixie [13:05:34] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2336-2341].codfw.wmnet [13:05:38] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2336-2341].codfw.wmnet [13:06:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:47] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2342.codfw.wmnet with OS trixie [13:06:58] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2343.codfw.wmnet with OS trixie [13:07:11] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2344.codfw.wmnet with OS trixie [13:07:22] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2345.codfw.wmnet with OS trixie [13:07:36] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2346.codfw.wmnet with OS trixie [13:07:47] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2347.codfw.wmnet with OS trixie [13:08:53] (03PS1) 10JMeybohm: kubernetes: Don't re-define default admission_plugins [puppet] - 10https://gerrit.wikimedia.org/r/1248812 (https://phabricator.wikimedia.org/T273507) [13:09:48] (03CR) 10Muehlenhoff: [C:03+2] dist-upgrade: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1248463 (owner: 10Muehlenhoff) [13:10:20] (03CR) 10Muehlenhoff: [C:03+2] Remove pointless spec test [puppet] - 10https://gerrit.wikimedia.org/r/1248769 (owner: 10Muehlenhoff) [13:10:26] (03PS2) 10JMeybohm: kubernetes: Don't re-define default admission_plugins [puppet] - 10https://gerrit.wikimedia.org/r/1248812 (https://phabricator.wikimedia.org/T273507) [13:10:41] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248812 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [13:12:34] (03PS1) 10Muehlenhoff: Remove mostly obsolete spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1248814 [13:13:13] (03PS3) 10JMeybohm: kubernetes: Don't re-define default admission_plugins [puppet] - 10https://gerrit.wikimedia.org/r/1248812 (https://phabricator.wikimedia.org/T273507) [13:17:10] (03CR) 10Muehlenhoff: aptrepo: add jenkins for trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [13:17:12] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419224#11681580 (10Jclark-ctr) a:03Jclark-ctr [13:17:50] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11681581 (10Jclark-ctr) a:03Jclark-ctr [13:19:54] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2342.codfw.wmnet with reason: host reimage [13:20:13] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2343.codfw.wmnet with reason: host reimage [13:20:38] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2344.codfw.wmnet with reason: host reimage [13:20:56] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2345.codfw.wmnet with reason: host reimage [13:20:59] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2346.codfw.wmnet with reason: host reimage [13:21:09] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2347.codfw.wmnet with reason: host reimage [13:21:49] !log Running foreachwikiindblist checkuser-suggested-investigations.dblist ~/PopulateSiuInfo.php --batch-size=1000 for T411118 [13:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:14] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:24:28] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2342.codfw.wmnet with reason: host reimage [13:25:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Follow-up: Degraded Disk Not Yet Added to RAID (an-worker1175, an-worker1199) - https://phabricator.wikimedia.org/T416166#11681611 (10Jclark-ctr) 05Open→03Resolved [13:26:33] (03PS2) 10Dzahn: aptrepo: add jenkins for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) [13:27:08] (03CR) 10Dzahn: "oh, ok! done" [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [13:28:14] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2344.codfw.wmnet with reason: host reimage [13:28:18] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419224#11681622 (10Jclark-ctr) 05Open→03Resolved duplicate to T419000 [13:29:20] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: CPU voltage error for an-worker1231.eqiad.wmnet - https://phabricator.wikimedia.org/T419222#11681629 (10Jclark-ctr) a:03Jclark-ctr [13:31:30] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2345.codfw.wmnet with reason: host reimage [13:35:01] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2343.codfw.wmnet with reason: host reimage [13:35:49] (03PS1) 10Dpogorzelski: pyrra(ML): fix updated revertrisk metric name [puppet] - 10https://gerrit.wikimedia.org/r/1248818 (https://phabricator.wikimedia.org/T419235) [13:36:06] (03CR) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [13:36:20] (03PS5) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) [13:37:00] (03CR) 10Brouberol: [C:03+2] deployment_server: add the kafka-mirrormaker kubeconfigs in the aux clusters [puppet] - 10https://gerrit.wikimedia.org/r/1248401 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [13:38:16] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2346.codfw.wmnet with reason: host reimage [13:38:21] (03CR) 10CI reject: [V:04-1] aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [13:42:20] (03PS1) 10Vgutierrez: varnish: Set custom glb_requests_limit for thumbs [puppet] - 10https://gerrit.wikimedia.org/r/1248819 [13:42:54] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2347.codfw.wmnet with reason: host reimage [13:42:59] (03PS2) 10Vgutierrez: varnish: Set custom glb_requests_limit for thumbs [puppet] - 10https://gerrit.wikimedia.org/r/1248819 [13:43:20] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2342.codfw.wmnet with OS trixie [13:44:08] (03CR) 10Brouberol: Add the sre.kafka.change-confluent-distro-version cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) (owner: 10Elukey) [13:44:26] (03PS6) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) [13:45:10] !log dreamyjazz@deploy2002 mwscript-k8s job started: foreachwikiindblist checkuser-suggested-investigations CheckUser:queueAutoCloseSICases.php # T418591 [13:45:13] T418591: Suggested Investigations: Run queueAutoCloseSICases.php on WMF production - https://phabricator.wikimedia.org/T418591 [13:46:20] (03CR) 10Vgutierrez: [V:03+1] "VTCs are happy" [puppet] - 10https://gerrit.wikimedia.org/r/1248819 (owner: 10Vgutierrez) [13:46:28] (03CR) 10CI reject: [V:04-1] aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [13:48:06] (03PS7) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) [13:48:41] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2344.codfw.wmnet with OS trixie [13:49:23] (03CR) 10Muehlenhoff: aptrepo: add jenkins for trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [13:50:12] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2345.codfw.wmnet with OS trixie [13:51:22] (03PS1) 10Mszwarc: Set $wgOATH2FARequiredGroupRemovalPages for interface-admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248821 (https://phabricator.wikimedia.org/T417880) [13:52:15] (03PS1) 10JMeybohm: Remove istio 1.15 wikikube config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248822 (https://phabricator.wikimedia.org/T341984) [13:52:18] (03PS1) 10JMeybohm: Remove PSP related code from admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248823 (https://phabricator.wikimedia.org/T273507) [13:53:26] RECOVERY - Host an-worker1231 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [13:53:39] (03CR) 10Aqu: [C:03+1] stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248808 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [13:53:54] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:54:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:55:44] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2343.codfw.wmnet with OS trixie [13:57:21] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2346.codfw.wmnet with OS trixie [13:57:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248806 (https://phabricator.wikimedia.org/T419111) (owner: 10Mszwarc) [13:57:58] PROBLEM - Host an-worker1231 is DOWN: PING CRITICAL - Packet loss = 100% [13:59:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247639 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [14:00:30] 10ops-eqiad, 06SRE, 06DC-Ops: Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11681841 (10Jclark-ctr) Parts have shipped SR 223498416 Should arrive today [14:00:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11681847 (10Gehel) [14:01:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419000#11681854 (10Gehel) [14:01:39] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2347.codfw.wmnet with OS trixie [14:01:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11681880 (10Gehel) [14:02:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11681886 (10Gehel) [14:02:16] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2342-2347].codfw.wmnet [14:02:20] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2342-2347].codfw.wmnet [14:03:22] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2348.codfw.wmnet with OS trixie [14:03:32] !log blake@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host wikikube-worker2348.codfw.wmnet with OS trixie [14:03:32] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2349.codfw.wmnet with OS trixie [14:03:42] !log blake@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host wikikube-worker2349.codfw.wmnet with OS trixie [14:03:45] 10SRE-SLO, 10observability, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Update WDQS SLOs to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11681912 (10Gehel) [14:04:01] 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11681925 (10Gehel) [14:05:12] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11681951 (10Gehel) [14:05:26] RECOVERY - Host an-worker1231 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [14:05:42] 07sre-alert-triage, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414413#11681979 (10Gehel) [14:05:48] 07sre-alert-triage, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414970#11681981 (10Gehel) [14:06:07] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2348.codfw.wmnet with OS trixie [14:06:20] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Grant Access to ops for ebernhardson - https://phabricator.wikimedia.org/T419029#11681995 (10Gehel) [14:06:26] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2349.codfw.wmnet with OS trixie [14:06:32] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11682083 (10Gehel) [14:06:44] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2350.codfw.wmnet with OS trixie [14:06:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11682085 (10Gehel) [14:06:53] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2351.codfw.wmnet with OS trixie [14:07:11] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2352.codfw.wmnet with OS trixie [14:07:22] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2353.codfw.wmnet with OS trixie [14:07:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Follow-up: Degraded Disk Not Yet Added to RAID (an-worker1175, an-worker1199) - https://phabricator.wikimedia.org/T416166#11682100 (10Gehel) [14:08:57] (03CR) 10Vgutierrez: [C:03+1] "this is a NOOP on LVS, no need to restart pybal" [puppet] - 10https://gerrit.wikimedia.org/r/1247608 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [14:10:24] (03CR) 10Andrew Bogott: "The json output doesn't include the member status. That plus the weird decimal ID leaves me to believe that no one is testing/fixing/maint" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott) [14:10:47] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Sun 22 Mar 2026 02:10:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [14:10:48] (03CR) 10Andrew Bogott: toolforge etcd: update handling of 'member list' output (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott) [14:13:57] (03CR) 10Muehlenhoff: [C:03+2] Remove mostly obsolete spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1248814 (owner: 10Muehlenhoff) [14:14:14] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:16:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:16:57] #wikimedia-tech [14:17:56] 06SRE, 06Infrastructure-Foundations: Create nodejs 24 production images - https://phabricator.wikimedia.org/T418440#11682153 (10MoritzMuehlenhoff) The images will be available after the forthcoming rebuild of the weekly images on Monday morning. [14:19:13] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2348.codfw.wmnet with reason: host reimage [14:19:28] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2349.codfw.wmnet with reason: host reimage [14:19:34] (03PS1) 10CDanis: docker-registry: lowercase path claim [puppet] - 10https://gerrit.wikimedia.org/r/1248829 [14:20:09] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2350.codfw.wmnet with reason: host reimage [14:20:30] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2353.codfw.wmnet with reason: host reimage [14:20:40] FIRING: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:40] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2352.codfw.wmnet with reason: host reimage [14:21:12] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2351.codfw.wmnet with reason: host reimage [14:21:54] (03PS1) 10Ebernhardson: semantic: Add egress policy to relforge [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248834 [14:22:52] (03PS2) 10Ebernhardson: semantic: Add egress policy to relforge [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248834 [14:23:19] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2348.codfw.wmnet with reason: host reimage [14:24:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.012s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:25:54] (03CR) 10Ebernhardson: [C:03+2] semantic: Add egress policy to relforge [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248834 (owner: 10Ebernhardson) [14:26:17] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2350.codfw.wmnet with reason: host reimage [14:26:38] (03PS1) 10Hashar: wm-checks-api: add tooltip to the CheckRun Run action [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1248835 [14:27:49] (03Merged) 10jenkins-bot: semantic: Add egress policy to relforge [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248834 (owner: 10Ebernhardson) [14:28:37] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 3 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11682215 (10TheDJ) [14:28:55] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [14:29:02] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 3 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11682219 (10TheDJ) [14:29:49] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [14:30:03] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2352.codfw.wmnet with reason: host reimage [14:31:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 2.172% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:31:36] (03CR) 10Andrew Bogott: "Nope, even if I run an additional 'endpoint status' call it doesn't actually include the status of the endpoint. And the json output for '" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott) [14:33:51] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2351.codfw.wmnet with reason: host reimage [14:34:14] FIRING: [5x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:34:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.705s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:36:27] (03PS17) 10Elukey: Add the sre.kafka.change-confluent-distro-version cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [14:37:07] (03CR) 10Elukey: Add the sre.kafka.change-confluent-distro-version cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) (owner: 10Elukey) [14:37:14] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2349.codfw.wmnet with reason: host reimage [14:37:32] (03PS5) 10Daniel Kinzler: rest-gateway: add CORS support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) [14:38:39] (03PS10) 10Andrew Bogott: toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (https://phabricator.wikimedia.org/T361237) [14:38:40] (03PS14) 10Andrew Bogott: toolforge etcdctl: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) [14:38:40] (03PS1) 10Andrew Bogott: toolforge etcdctl: remove get_cluster_health and associated rigging [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248842 [14:38:57] (03PS2) 10JMeybohm: Remove PSP related code from admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248823 (https://phabricator.wikimedia.org/T273507) [14:39:30] (03CR) 10Andrew Bogott: toolforge etcdctl: update cert flag names (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott) [14:41:03] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2353.codfw.wmnet with reason: host reimage [14:41:14] (03CR) 10Elukey: [C:03+1] "Paired with Riccardo over meet!" [software/cumin] - 10https://gerrit.wikimedia.org/r/1224035 (owner: 10Volans) [14:41:32] (03CR) 10TChin: [C:03+2] [eventgate] bump to v1.28.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248507 (https://phabricator.wikimedia.org/T409106) (owner: 10TChin) [14:41:42] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 3 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11682308 (10MatthewVernon) @Ladsgroup they're only a tiny number of files, but XCF will probably likewise need addressing? [14:43:03] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2348.codfw.wmnet with OS trixie [14:43:08] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [14:43:20] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [14:44:24] (03Merged) 10jenkins-bot: [eventgate] bump to v1.28.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248507 (https://phabricator.wikimedia.org/T409106) (owner: 10TChin) [14:44:56] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [14:45:02] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2350.codfw.wmnet with OS trixie [14:45:02] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [14:45:22] (03CR) 10CI reject: [V:04-1] toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott) [14:47:13] (03CR) 10Dillon: [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247639 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [14:47:45] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [14:48:11] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [14:48:27] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [14:48:36] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [14:48:45] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2352.codfw.wmnet with OS trixie [14:49:12] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:49:42] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [14:49:43] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:52:19] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2351.codfw.wmnet with OS trixie [14:52:24] (03PS5) 10Arnaudb: gerrit: remove mod_qos [puppet] - 10https://gerrit.wikimedia.org/r/1248843 (https://phabricator.wikimedia.org/T417615) [14:52:54] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [14:53:39] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [14:56:45] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [14:57:15] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [14:57:33] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2349.codfw.wmnet with OS trixie [14:58:37] (03CR) 10JavierMonton: "I created a new patch for it: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1236258" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235827 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [14:59:22] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2353.codfw.wmnet with OS trixie [15:02:38] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2348-2353].codfw.wmnet [15:02:40] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [15:02:41] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2348-2353].codfw.wmnet [15:02:54] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [15:03:26] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2354.codfw.wmnet with OS trixie [15:03:38] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2355.codfw.wmnet with OS trixie [15:03:50] !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2356.codfw.wmnet with OS trixie [15:05:19] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [15:05:24] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [15:05:36] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [15:06:16] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [15:08:48] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [15:08:58] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [15:09:17] (03PS1) 10Kevin Bazira: ml-services: revert embeddings isvc image to one that doesn't use AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248855 (https://phabricator.wikimedia.org/T418976) [15:09:20] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [15:10:03] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [15:11:53] (03CR) 10Dpogorzelski: [C:03+1] ml-services: revert embeddings isvc image to one that doesn't use AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248855 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [15:12:33] (03CR) 10Ozge: [C:03+2] ml-services: revert embeddings isvc image to one that doesn't use AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248855 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [15:14:38] (03Merged) 10jenkins-bot: ml-services: revert embeddings isvc image to one that doesn't use AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248855 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [15:15:40] RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:15:42] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [15:15:55] FIRING: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:21] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2354.codfw.wmnet with reason: host reimage [15:16:58] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2355.codfw.wmnet with reason: host reimage [15:17:02] (03PS2) 10Andrew Bogott: toolforge etcdctl: remove get_cluster_health and associated rigging [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248842 [15:17:02] (03PS11) 10Andrew Bogott: toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (https://phabricator.wikimedia.org/T361237) [15:17:02] (03PS15) 10Andrew Bogott: toolforge etcdctl: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) [15:17:07] (03PS3) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235827 (https://phabricator.wikimedia.org/T360794) [15:17:18] !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2356.codfw.wmnet with reason: host reimage [15:17:23] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [15:17:50] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [15:19:19] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [15:19:57] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [15:22:05] (03PS1) 10Ebernhardson: cirrus: Use https for semanticsearch-test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248858 [15:23:21] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2354.codfw.wmnet with reason: host reimage [15:23:44] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [15:24:30] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [15:25:02] (03PS1) 10Kevin Bazira: ml-services: rollback embeddings isvc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248859 (https://phabricator.wikimedia.org/T418976) [15:25:58] (03CR) 10Kevin Bazira: [C:03+2] ml-services: rollback embeddings isvc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248859 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [15:26:35] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2355.codfw.wmnet with reason: host reimage [15:26:35] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [15:26:49] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [15:27:55] (03Merged) 10jenkins-bot: ml-services: rollback embeddings isvc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248859 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [15:28:08] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [15:28:32] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [15:28:47] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [15:30:43] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [15:31:19] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [15:31:26] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2356.codfw.wmnet with reason: host reimage [15:32:10] (03PS1) 10Muehlenhoff: puppetserver: Use the hooks from Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1248862 (https://phabricator.wikimedia.org/T365798) [15:37:25] FIRING: [8x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [15:38:40] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new, 07Essential-Work: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11682582 (10elukey) I finally found a way to see a decent access log from the Istio gateway. It is... [15:39:13] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:42:12] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2354.codfw.wmnet with OS trixie [15:42:34] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 3 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11682595 (10Aklapper) [15:44:01] (03CR) 10JHathaway: [C:03+1] puppetserver: Use the hooks from Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1248862 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:46:24] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2355.codfw.wmnet with OS trixie [15:51:50] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2356.codfw.wmnet with OS trixie [15:52:28] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2354-2356].codfw.wmnet [15:52:30] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2354-2356].codfw.wmnet [15:52:38] (03CR) 10Scott French: [C:03+1] "Thanks, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1248797 (https://phabricator.wikimedia.org/T419058) (owner: 10Muehlenhoff) [15:53:06] (03CR) 10Hashar: [C:03+1] "I am happy to help for the deployment!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248858 (owner: 10Ebernhardson) [15:54:03] (03PS1) 10Btullis: opensearch-cluster: Terminate TLS on the ingress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175) [15:55:04] (03CR) 10Brouberol: [C:03+2] aux-k8s: define the kafka-mirrormaker namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248404 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [15:56:31] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [15:56:56] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:57:38] !log brouberol@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [15:57:52] !log brouberol@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:58:13] (03PS2) 10JMeybohm: Remove istio 1.15 wikikube config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248822 (https://phabricator.wikimedia.org/T341984) [15:58:13] (03PS3) 10JMeybohm: Remove PSP related code from admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248823 (https://phabricator.wikimedia.org/T273507) [15:58:59] (03PS1) 10Tiziano Fogli: prometheus: add cardinality explosion alerts [alerts] - 10https://gerrit.wikimedia.org/r/1248866 (https://phabricator.wikimedia.org/T415317) [16:00:24] (03CR) 10Scott French: [C:03+1] "Thank you, Moritz! Two optional suggestions that would probably make sense to pick up at the same time, but feel free to also not." [puppet] - 10https://gerrit.wikimedia.org/r/1247620 (owner: 10Muehlenhoff) [16:08:54] FIRING: [7x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:39] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new, 07Essential-Work: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11682705 (10elukey) For the orchestrator po, there is a decrease in traffic during the past couple... [16:13:23] (03CR) 10JMeybohm: Remove PSP related code from admin_ng (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248823 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [16:17:06] (03PS3) 10Dzahn: aptrepo: add jenkins for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) [16:23:36] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [16:23:39] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [16:30:59] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11682744 (10Volans) This host is still marked as Active in Netbox but disappeared from PuppetDB, if it's still broken please fix its status in Netbox. [16:33:54] FIRING: [7x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:33:58] FIRING: [3x] CertAlmostExpired: Certificate for service titan2001:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:35:00] (03CR) 10Herron: "Thanks for this looks great overall! added a few optional suggestions inline." [alerts] - 10https://gerrit.wikimedia.org/r/1248866 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [16:36:28] (03CR) 10Scott French: [V:03+2] "Built and tested locally:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247185 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [16:37:04] (03CR) 10Muehlenhoff: [C:03+1] aptrepo: add jenkins for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [16:37:22] (03CR) 10Scott French: [V:03+2] "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247185 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [16:37:35] (03CR) 10Scott French: [V:03+2 C:03+2] envoy: Support using envoy-drain-tool [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247185 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [16:42:20] (03PS2) 10Tiziano Fogli: prometheus: add cardinality explosion alerts [alerts] - 10https://gerrit.wikimedia.org/r/1248866 (https://phabricator.wikimedia.org/T415317) [16:44:20] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248808 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [16:44:23] (03CR) 10Dzahn: [C:03+1] "a lot at once - but as far as I an tell it looks good - https://puppet-compiler.wmflabs.org/output/1248843/8233/gerrit2002.wikimedia.org/i" [puppet] - 10https://gerrit.wikimedia.org/r/1248843 (https://phabricator.wikimedia.org/T417615) (owner: 10Arnaudb) [16:46:44] (03PS3) 10Daniel Kinzler: rest-gateway: per-path overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) [16:46:46] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248808 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [16:46:47] (03CR) 10Dzahn: [C:03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [16:46:55] (03CR) 10Tiziano Fogli: "Thank you for the hint. I think having a task is a really good idea, given that the alerts eventually triggered by these rules are tempora" [alerts] - 10https://gerrit.wikimedia.org/r/1248866 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [16:46:55] (03CR) 10Arnaudb: "It is a large change indeed, sorry about that. to avoid unintended side effect, I'll merge this with puppet disabled on primary and replic" [puppet] - 10https://gerrit.wikimedia.org/r/1248843 (https://phabricator.wikimedia.org/T417615) (owner: 10Arnaudb) [16:47:45] (03CR) 10Daniel Kinzler: rest-gateway: per-path overrides (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [16:48:19] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [16:48:28] (03CR) 10Dzahn: [C:03+2] aptrepo: add jenkins for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [16:48:30] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [16:51:35] (03CR) 10Dzahn: [C:03+1] "the plan sounds good :)" [puppet] - 10https://gerrit.wikimedia.org/r/1248843 (https://phabricator.wikimedia.org/T417615) (owner: 10Arnaudb) [16:54:56] (03CR) 10Dzahn: [C:03+2] site: apply jenkins stub role on contint2003 [puppet] - 10https://gerrit.wikimedia.org/r/1248635 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [17:04:19] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-backup-datanode1033.eqiad.wmnet [17:04:35] (03CR) 10Hashar: [C:03+2] wm-checks-api: add tooltip to the CheckRun Run action [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1248835 (owner: 10Hashar) [17:05:24] (03Merged) 10jenkins-bot: wm-checks-api: add tooltip to the CheckRun Run action [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1248835 (owner: 10Hashar) [17:05:44] !log hashar@deploy2002 Started deploy [gerrit/gerrit@b8183ba]: wm-checks-api: add tooltip to the CheckRun Run action [17:05:57] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@b8183ba]: wm-checks-api: add tooltip to the CheckRun Run action (duration: 00m 13s) [17:10:07] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [17:10:07] (03PS2) 10Btullis: opensearch-cluster: Terminate TLS on the ingress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175) [17:11:51] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [17:11:55] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [17:15:47] btullis@cumin1003 decommission (PID 1447184) is awaiting input [17:24:14] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:29:29] (03PS1) 10Dzahn: jenkins: set the CI manager host in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1248892 (https://phabricator.wikimedia.org/T418521) [17:29:55] (03CR) 10Dzahn: [C:03+2] jenkins: set the CI manager host in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1248892 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [17:33:05] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11682990 (10RobH) Set to failed. Ordering of replacement part will take place today or tomorrow, sorting the terms. [17:40:23] (03PS3) 10Btullis: opensearch-cluster: Terminate TLS on the ingress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175) [17:40:51] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [17:40:55] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [17:41:56] (03PS4) 10Btullis: opensearch-cluster: Terminate TLS on the ingress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175) [17:42:13] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [17:42:17] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [17:42:53] (03PS1) 10Medelius: Suggestion mode: update link for suggestion feedback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248894 [17:46:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248858 (owner: 10Ebernhardson) [17:47:19] (03Merged) 10jenkins-bot: cirrus: Use https for semanticsearch-test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248858 (owner: 10Ebernhardson) [17:47:43] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1248858|cirrus: Use https for semanticsearch-test cluster]] [17:50:23] (03PS1) 10JHathaway: mailman: disable web posting [puppet] - 10https://gerrit.wikimedia.org/r/1248895 (https://phabricator.wikimedia.org/T386559) [17:50:42] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11683041 (10jhathaway) @bd808 based on the User-Agent, `User-Agent: HyperKitty on https://lists.wikime... [17:50:59] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248895 (https://phabricator.wikimedia.org/T386559) (owner: 10JHathaway) [17:51:03] (03PS5) 10Btullis: opensearch-cluster: Terminate TLS on the ingress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175) [17:51:28] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [17:51:32] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [17:51:46] (03PS1) 10Scott French: mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248889 (https://phabricator.wikimedia.org/T364245) [17:52:25] !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:1248858|cirrus: Use https for semanticsearch-test cluster]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:53:35] !log ebernhardson@deploy2002 ebernhardson: Continuing with sync [17:54:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:54:14] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:55:10] (03CR) 10Majavah: [C:03+1] mailman: disable web posting [puppet] - 10https://gerrit.wikimedia.org/r/1248895 (https://phabricator.wikimedia.org/T386559) (owner: 10JHathaway) [17:59:03] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248858|cirrus: Use https for semanticsearch-test cluster]] (duration: 11m 20s) [18:11:15] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11683076 (10A_smart_kitten) >>! In T386559#11683041, @jhathaway wrote: > As briefly discussed, I think... [18:14:14] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:23:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:28:20] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-backup-datanode1033.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [18:28:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:29:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-backup-datanode1033.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [18:29:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:29:31] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts an-backup-datanode1033.eqiad.wmnet [18:29:45] 10ops-eqiad, 06SRE, 06DC-Ops: Decommission the 46 hadoop workers and 2 namenode servers that were planned for the hadoop-backup cluster - https://phabricator.wikimedia.org/T404970#11683137 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by btullis@cumin1003 for hosts: `an-backup-datanode... [18:41:09] (03CR) 10Ebernhardson: [C:03+1] "I don't know about the full implementation, but we are having an issue with opensearch being very strict about how an x-request-id is form" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175) (owner: 10Btullis) [18:44:43] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp2043.* [18:45:22] (03PS1) 10SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1248906 (https://phabricator.wikimedia.org/T419265) [18:49:43] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:56:22] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp2043.codfw.wmnet with reason: troubleshooting for network drops [18:59:17] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1248906 (https://phabricator.wikimedia.org/T419265) (owner: 10SBassett) [19:00:12] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: CPU voltage error for an-worker1231.eqiad.wmnet - https://phabricator.wikimedia.org/T419222#11683236 (10Jclark-ctr) 05Open→03Resolved performed flea power drain updated bios and did 2 reboots with out the issue returning [19:02:42] (03PS1) 10Zabe: Use Hadoop for Mostcategories on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248909 (https://phabricator.wikimedia.org/T413362) [19:10:41] (03CR) 10Scott French: [C:03+2] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1248906 (https://phabricator.wikimedia.org/T419265) (owner: 10SBassett) [19:11:41] (03CR) 10Dzahn: [C:03+2] "[apt1002:/srv/wikimedia] $ sudo -i reprepro -C thirdparty/jenkins checkupdate trixie-wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [19:12:05] (03PS1) 10Zabe: Stop writing to il_to on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248911 (https://phabricator.wikimedia.org/T415787) [19:17:43] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on wdqs2009.codfw.wmnet with reason: NFS might be hung, about to reboot [19:20:40] FIRING: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:23:28] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2009.codfw.wmnet [19:29:16] (03CR) 10Dduvall: [C:03+1] docker-registry: lowercase path claim [puppet] - 10https://gerrit.wikimedia.org/r/1248829 (owner: 10CDanis) [19:29:25] (03CR) 10CDanis: [C:03+2] docker-registry: lowercase path claim [puppet] - 10https://gerrit.wikimedia.org/r/1248829 (owner: 10CDanis) [19:35:59] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11683365 (10jhathaway) >>! In T386559#11683076, @A_smart_kitten wrote: >>>! In T386559#11683041, @jhat... [19:36:37] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host wdqs2009.codfw.wmnet [19:37:25] FIRING: [8x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [19:38:55] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11683374 (10A_smart_kitten) (To be clear, personally, I could probably try and configure my email setu... [19:40:43] (03CR) 10Scott French: "Thanks in advance for the review, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248889 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [19:41:38] 06SRE, 06serviceops-deprecated, 10WMDE-TechWish-Maintenance, 10Maps (Geoshapes), 07Service-deployment-requests: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388#11683390 (10Jdforrester-WMF) 05Open→03Declined Codebase is being archived, see T418372. [19:46:39] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:47:04] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:47:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:48:24] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [20:04:05] (03CR) 10Michael Große: [C:03+1] [Growth] Enable on every new Wikipedia by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239954 (https://phabricator.wikimedia.org/T304052) (owner: 10Urbanecm) [20:11:44] (03CR) 10RLazarus: [C:03+1] mw-debug: Pilot new drain configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248889 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [20:29:35] (03CR) 10Scott French: "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248889 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [20:34:13] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:34:14] FIRING: [4x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:36:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11683594 (10Jgreen) a:05Jgreen→03None [20:48:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:14] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:42:49] 10ops-magru: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T419298 (10phaultfinder) 03NEW [21:47:50] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11683842 (10bd808) >>! In T386559#11683374, @A_smart_kitten wrote: > (To be clear, personally, I could... [21:54:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:54:14] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:58:39] (03Abandoned) 10Clare Ming: Remove Metrics Platform config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248695 (https://phabricator.wikimedia.org/T417568) (owner: 10Clare Ming) [22:00:47] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11683855 (10herron) [22:01:58] (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove stale legacy-full-gui release entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248694 (https://phabricator.wikimedia.org/T415073) (owner: 10Ryan Kemper) [22:04:01] (03Merged) 10jenkins-bot: wdqs: remove stale legacy-full-gui release entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248694 (https://phabricator.wikimedia.org/T415073) (owner: 10Ryan Kemper) [22:08:35] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11683877 (10A_smart_kitten) >>! In T386559#11683842, @bd808 wrote: > Are you using the web UI because... [22:14:14] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:23:51] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11683912 (10jhathaway) if `/message/new` is the correct route, here is the count of usage from 03-05:... [22:26:13] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists via HyperKitty web ui - https://phabricator.wikimedia.org/T386559#11683914 (10bd808) [22:32:47] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists via HyperKitty web ui - https://phabricator.wikimedia.org/T386559#11683933 (10A_smart_kitten) >>! In T386559#11683912, @jhathaway wrote: > if `/me... [22:39:49] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2009.codfw.wmnet with OS bullseye [22:40:20] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host wdqs2009 [22:40:31] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists via HyperKitty web ui - https://phabricator.wikimedia.org/T386559#11683946 (10jhathaway) >>! In T386559#11683933, @A_smart_kitten wrote: >>>! In T... [22:41:27] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [22:45:51] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2009 - ryankemper@cumin2002" [22:45:56] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2009 - ryankemper@cumin2002" [22:45:57] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:45:57] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache wdqs2009.codfw.wmnet 141.0.192.10.in-addr.arpa 1.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [22:46:01] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wdqs2009.codfw.wmnet 141.0.192.10.in-addr.arpa 1.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [22:46:02] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2009 [22:46:14] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2009 [22:46:15] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wdqs2009 [23:07:42] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2009.codfw.wmnet with reason: host reimage [23:13:28] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2009.codfw.wmnet with reason: host reimage [23:20:40] FIRING: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:29:43] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2009.codfw.wmnet with OS bullseye [23:37:25] FIRING: [8x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown