[00:01:08] <logmsgbot>	 !log catrope@deploy2002 catrope, kharlan: Continuing with sync
[00:03:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface pfw1-codfw:reth1 () - https://phabricator.wikimedia.org/T419150#11679873 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm limited to maintenance window for frack mgmt vlan migration. no new alerts since.
[00:05:03] <logmsgbot>	 !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248628|Re-enable AllowUserJs (T419137)]] (duration: 08m 08s)
[00:13:53] <wikibugs>	 (03PS2) 10Jasmine: wmnet: add sophroid svc IPs [dns] - 10https://gerrit.wikimedia.org/r/1248617 (https://phabricator.wikimedia.org/T418748)
[00:13:59] <wikibugs>	 (03PS1) 10Dzahn: aptrepo: add jenkins for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521)
[00:17:35] <zabe>	 Dreamy_Jazz: are you folks done with deploying?
[00:19:59] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11679905 (10VRiley-WMF) Re-labeled moss-fe1002 to contint1003 Racked it in B1 U36 CableID: 3720 Port: 28  This unit...
[00:25:04] <cdanis>	 zabe: yeah
[00:25:20] <zabe>	 Alright:)
[00:25:52] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Stop writing to il_to on all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248493 (https://phabricator.wikimedia.org/T415787) (owner: 10Zabe)
[00:26:44] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to il_to on all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248493 (https://phabricator.wikimedia.org/T415787) (owner: 10Zabe)
[00:27:07] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248493|Stop writing to il_to on all wikis except commons (T415787)]]
[00:27:11] <stashbot>	 T415787: Stop writing to il_to by setting imagelinks migration to write new - https://phabricator.wikimedia.org/T415787
[00:28:58] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:1248493|Stop writing to il_to on all wikis except commons (T415787)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:29:36] <logmsgbot>	 !log zabe@deploy2002 zabe: Continuing with sync
[00:33:29] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248493|Stop writing to il_to on all wikis except commons (T415787)]] (duration: 06m 22s)
[00:33:33] <stashbot>	 T415787: Stop writing to il_to by setting imagelinks migration to write new - https://phabricator.wikimedia.org/T415787
[00:39:22] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1248646
[00:39:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1248646 (owner: 10TrainBranchBot)
[00:44:14] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:53:03] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1248646 (owner: 10TrainBranchBot)
[00:59:14] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[01:06:03] <wikibugs>	 (03PS1) 10Zabe: Prepare kaiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248647 (https://phabricator.wikimedia.org/T414234)
[01:08:16] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Prepare kaiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248647 (https://phabricator.wikimedia.org/T414234) (owner: 10Zabe)
[01:09:05] <wikibugs>	 (03Merged) 10jenkins-bot: Prepare kaiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248647 (https://phabricator.wikimedia.org/T414234) (owner: 10Zabe)
[01:09:21] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1248648
[01:09:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1248648 (owner: 10TrainBranchBot)
[01:09:42] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248647|Prepare kaiwiki (T414234)]]
[01:09:46] <stashbot>	 T414234: Create Wikipedia Karai-Karai - https://phabricator.wikimedia.org/T414234
[01:11:49] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:1248647|Prepare kaiwiki (T414234)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[01:13:13] <logmsgbot>	 !log zabe@deploy2002 zabe: Continuing with sync
[01:13:46] <jinxer-wm>	 FIRING: Emergency syslog message: Alert for device pfw1-codfw.wikimedia.org - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[01:15:11] <jinxer-wm>	 FIRING: PfwCoreBGPDown: ...
[01:15:17] <jinxer-wm>	 Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown
[01:17:07] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248647|Prepare kaiwiki (T414234)]] (duration: 07m 25s)
[01:17:11] <stashbot>	 T414234: Create Wikipedia Karai-Karai - https://phabricator.wikimedia.org/T414234
[01:18:54] <jinxer-wm>	 FIRING: [5x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[01:18:59] <jinxer-wm>	 FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:0 (Core: pfw1-codfw:xe-7/2/0 {#11923_12249-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[01:20:26] <wikibugs>	 (03PS1) 10Zabe: Activate kaiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248650 (https://phabricator.wikimedia.org/T414234)
[01:21:08] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Activate kaiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248650 (https://phabricator.wikimedia.org/T414234) (owner: 10Zabe)
[01:22:06] <wikibugs>	 (03Merged) 10jenkins-bot: Activate kaiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248650 (https://phabricator.wikimedia.org/T414234) (owner: 10Zabe)
[01:22:24] <wikibugs>	 (03CR) 10Eevans: "There was a suggestion to use the python-webapp chart, but this uses aqs-http-gateway because a) all of the other Cassandra-connected serv" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248148 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[01:22:26] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248650|Activate kaiwiki (T414234)]]
[01:22:30] <stashbot>	 T414234: Create Wikipedia Karai-Karai - https://phabricator.wikimedia.org/T414234
[01:23:46] <jinxer-wm>	 RESOLVED: Emergency syslog message: Device pfw1-codfw.wikimedia.org recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[01:23:54] <jinxer-wm>	 FIRING: [5x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[01:23:59] <jinxer-wm>	 FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:0 (Core: pfw1-codfw:xe-7/2/0 {#11923_12249-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[01:24:19] <wikibugs>	 (03CR) 10Scardenasmolinar: [C:03+1] Enable rr-ml AutoModerator CC Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247639 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle)
[01:24:22] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:1248650|Activate kaiwiki (T414234)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[01:24:46] <jinxer-wm>	 FIRING: Emergency syslog message: Alert for device pfw1-codfw.wikimedia.org - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[01:25:11] <jinxer-wm>	 RESOLVED: PfwCoreBGPDown: ...
[01:25:17] <jinxer-wm>	 Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.202) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown
[01:25:31] <logmsgbot>	 !log zabe@deploy2002 zabe: Continuing with sync
[01:27:39] <wikibugs>	 (03PS1) 10Zabe: Prepare urwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248652 (https://phabricator.wikimedia.org/T415960)
[01:28:06] <wikibugs>	 (03PS1) 10Zabe: Activate urwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248653 (https://phabricator.wikimedia.org/T415960)
[01:29:24] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248650|Activate kaiwiki (T414234)]] (duration: 06m 57s)
[01:29:27] <stashbot>	 T414234: Create Wikipedia Karai-Karai - https://phabricator.wikimedia.org/T414234
[01:29:46] <jinxer-wm>	 RESOLVED: Emergency syslog message: Device pfw1-codfw.wikimedia.org recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[01:30:05] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Prepare urwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248652 (https://phabricator.wikimedia.org/T415960) (owner: 10Zabe)
[01:30:25] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1248648 (owner: 10TrainBranchBot)
[01:30:59] <wikibugs>	 (03Merged) 10jenkins-bot: Prepare urwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248652 (https://phabricator.wikimedia.org/T415960) (owner: 10Zabe)
[01:32:18] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248652|Prepare urwikisource (T415960)]]
[01:32:22] <stashbot>	 T415960: Create Wikisource Urdu - https://phabricator.wikimedia.org/T415960
[01:34:20] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:1248652|Prepare urwikisource (T415960)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[01:34:41] <logmsgbot>	 !log zabe@deploy2002 zabe: Continuing with sync
[01:37:50] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Activate urwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248653 (https://phabricator.wikimedia.org/T415960) (owner: 10Zabe)
[01:38:36] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248652|Prepare urwikisource (T415960)]] (duration: 06m 18s)
[01:38:40] <wikibugs>	 (03Merged) 10jenkins-bot: Activate urwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248653 (https://phabricator.wikimedia.org/T415960) (owner: 10Zabe)
[01:38:40] <stashbot>	 T415960: Create Wikisource Urdu - https://phabricator.wikimedia.org/T415960
[01:40:17] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247185 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[01:42:01] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248653|Activate urwikisource (T415960)]]
[01:43:57] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:1248653|Activate urwikisource (T415960)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[01:44:01] <stashbot>	 T415960: Create Wikisource Urdu - https://phabricator.wikimedia.org/T415960
[01:45:30] <logmsgbot>	 !log zabe@deploy2002 Sync cancelled.
[01:47:38] <wikibugs>	 (03PS1) 10Zabe: Set urwikisource to rtl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248656 (https://phabricator.wikimedia.org/T415960)
[01:48:05] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Set urwikisource to rtl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248656 (https://phabricator.wikimedia.org/T415960) (owner: 10Zabe)
[01:48:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Set urwikisource to rtl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248656 (https://phabricator.wikimedia.org/T415960) (owner: 10Zabe)
[01:52:13] <zabe>	 ^ CI fails due to a self-resolving conflig. It tests if the live rtl wikis match the dblist but now it complains that I am adding a wiki which is not live (obviously).
[01:52:18] <wikibugs>	 (03CR) 10Zabe: [V:03+2 C:03+2] Set urwikisource to rtl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248656 (https://phabricator.wikimedia.org/T415960) (owner: 10Zabe)
[01:52:59] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248656|Set urwikisource to rtl (T415960)]]
[01:53:03] <stashbot>	 T415960: Create Wikisource Urdu - https://phabricator.wikimedia.org/T415960
[01:54:54] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:1248656|Set urwikisource to rtl (T415960)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[01:55:44] <logmsgbot>	 !log zabe@deploy2002 zabe: Continuing with sync
[01:56:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:59:38] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248656|Set urwikisource to rtl (T415960)]] (duration: 06m 39s)
[01:59:41] <stashbot>	 T415960: Create Wikisource Urdu - https://phabricator.wikimedia.org/T415960
[02:00:36] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248658
[02:00:36] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248658 (owner: 10Zabe)
[02:00:50] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[02:01:26] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248658 (owner: 10Zabe)
[02:08:54] <jinxer-wm>	 FIRING: [7x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:09:13] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 08m 23s)
[02:09:27] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248658|Update interwiki cache]]
[02:11:21] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:1248658|Update interwiki cache]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[02:12:07] <logmsgbot>	 !log zabe@deploy2002 zabe: Continuing with sync
[02:12:30] <logmsgbot>	 !log zabe@deploy2002 mwscript-k8s job started: foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https  # T415978, T414241
[02:12:34] <stashbot>	 T415978: Add Wikidata support for urwikisource - https://phabricator.wikimedia.org/T415978
[02:12:35] <stashbot>	 T414241: Add Wikidata support for kaiwiki - https://phabricator.wikimedia.org/T414241
[02:14:14] <jinxer-wm>	 FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[02:15:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:16:06] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248658|Update interwiki cache]] (duration: 06m 38s)
[02:20:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:21:44] <zabe>	 !log zabe@deploy2002:/srv/mediawiki-staging$ foreachwiki extensions/TimedMediaHandler/maintenance/migrateTranscodeStates.php --force # T415064
[02:21:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:21:47] <stashbot>	 T415064: Backfill new status and touched columns - https://phabricator.wikimedia.org/T415064
[02:33:54] <jinxer-wm>	 FIRING: [7x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:53:54] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[02:56:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[02:59:40] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change frack mgmt vlan interface - pt1979@cumin2002"
[02:59:46] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change frack mgmt vlan interface - pt1979@cumin2002"
[02:59:46] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[03:15:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:20:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:23:54] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[03:37:25] <jinxer-wm>	 FIRING: [8x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown  - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown
[03:39:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[03:53:58] <wikibugs>	 (03PS1) 10Catrope: Drop $wgOATHUserHandlesTable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248663 (https://phabricator.wikimedia.org/T416544)
[03:56:50] <wikibugs>	 (03PS1) 10Catrope: beta: Enable passwordless login on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248664 (https://phabricator.wikimedia.org/T419198)
[03:56:51] <wikibugs>	 (03PS1) 10Catrope: Enable passwordless login in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248665 (https://phabricator.wikimedia.org/T419198)
[03:57:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248664 (https://phabricator.wikimedia.org/T419198) (owner: 10Catrope)
[03:58:08] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Enable passwordless login on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248664 (https://phabricator.wikimedia.org/T419198) (owner: 10Catrope)
[04:15:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:20:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:36:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11680410 (10cmooney) Thinking it through, I think this process could be used to "drain" //ssw1-d1// if we wanted to attempt this without any impact to hosts.  * Set //c...
[05:23:54] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:38:31] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: cleanup code related to query-legacy-full.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1247933 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel)
[05:39:32] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: remove query-legacy-full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247947 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel)
[05:39:43] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247933 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel)
[05:43:26] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: cleanup code related to query-legacy-full.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1247933 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel)
[05:56:03] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove query-legacy-full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247947 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel)
[05:56:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:58:07] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs: remove query-legacy-full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247947 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel)
[06:14:14] <jinxer-wm>	 FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[06:22:21] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply
[06:22:36] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply
[06:23:12] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[06:23:20] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[06:23:35] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply
[06:23:38] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply
[06:26:42] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: remove stale legacy-full-gui release entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248694 (https://phabricator.wikimedia.org/T415073)
[06:28:27] <wikibugs>	 (03PS1) 10Clare Ming: Remove Metrics Platform config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248695 (https://phabricator.wikimedia.org/T417568)
[06:34:14] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:41:53] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1248127 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[06:43:04] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] ci::website: support 2 different websites, integration vs zuul-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1248118 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[06:49:28] <jinxer-wm>	 FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:56:52] <wikibugs>	 (03CR) 10Muehlenhoff: "Yeah, sure. I'll roll this out next week with Puppet disabled on C:profile::bird::anycast" [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff)
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260306T0700)
[07:00:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Create component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1248701 (https://phabricator.wikimedia.org/T419058)
[07:23:54] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:26:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix file existence check for Promtheus exporter [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/1248702 (https://phabricator.wikimedia.org/T419166)
[07:31:18] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Weird, you'd think systemd would complain.... Looks good" [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/1248702 (https://phabricator.wikimedia.org/T419166) (owner: 10Muehlenhoff)
[07:36:51] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Fix file existence check for Promtheus exporter [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/1248702 (https://phabricator.wikimedia.org/T419166) (owner: 10Muehlenhoff)
[07:37:25] <jinxer-wm>	 FIRING: [8x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown  - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown
[07:39:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[07:45:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:50:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260306T0800)
[08:00:30] <wikibugs>	 (03CR) 10Elukey: [C:03+2] elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper)
[08:10:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:11:34] <moritzm>	 !log imported prometheus-ganeti-exporter 0.3+deb12u2 for bookworm-wikimedia T419166
[08:11:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:38] <stashbot>	 T419166: Audit production for systemd parse warnings - https://phabricator.wikimedia.org/T419166
[08:13:32] <wikibugs>	 06SRE, 13Patch-For-Review, 07Security: Audit production for systemd parse warnings - https://phabricator.wikimedia.org/T419166#11680569 (10MoritzMuehlenhoff) Thanks for the report, the impact for the Prometheus exporter is harmless, the incorrect test only prevented some log spam on Ganeti servers which are...
[08:15:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:17:24] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11680574 (10MoritzMuehlenhoff)
[08:20:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:21:52] <wikibugs>	 (03PS1) 10Elukey: profile::pyrra: rework wdqs availability SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1248760 (https://phabricator.wikimedia.org/T393966)
[08:21:54] <wikibugs>	 (03PS1) 10Elukey: profile::pyrra: remove old wdqs SLO configs [puppet] - 10https://gerrit.wikimedia.org/r/1248761 (https://phabricator.wikimedia.org/T393966)
[08:25:03] <moritzm>	 !log uploaded openjdk-8 8u482-ga-1~deb12u1 to component/jdk8 of bookworm-wikimedia
[08:25:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:10] <wikibugs>	 (03CR) 10Elukey: "Hi Ryan! I reworked a little your patches in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1248760, lemme know your thoughts. The b" [puppet] - 10https://gerrit.wikimedia.org/r/1235891 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[08:42:12] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11680597 (10elukey) The UsbPorts setting was added by Riccardo in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/748741 in 202...
[08:42:54] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-fe1013.eqiad.wmnet
[08:45:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:50:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:54:25] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11680622 (10MatthewVernon) FWIW, I have no objection to your doing so.
[08:54:35] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-fe1013.eqiad.wmnet
[08:55:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:56:25] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-fe1013.eqiad.wmnet
[08:56:27] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11680630 (10elukey) Of course the BIOS differs too :D   ` >>> r = spicerack.redfish('ms-fe1013') Management Password:  >>> r.bios_versi...
[08:57:03] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe1013.eqiad.wmnet
[09:05:57] <wikibugs>	 (03PS1) 10Tiziano Fogli: thanos/rec_rules: adjust tsdb_head_series:zscore to use nested rec rules [puppet] - 10https://gerrit.wikimedia.org/r/1248766 (https://phabricator.wikimedia.org/T415317)
[09:08:37] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1013.eqiad.wmnet
[09:08:39] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-fe1013.eqiad.wmnet
[09:09:33] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[09:10:24] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[09:11:04] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: ms-fe1013 reports a backplane error - https://phabricator.wikimedia.org/T419010#11680651 (10elukey) 05Open→03Resolved Thanks!
[09:11:52] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11680654 (10elukey) Applied bios and idrac update to ms-fe1013, then re-run provisioning:  ` Skipped set of attribute BIOS.Setup.1-1 ->...
[09:18:25] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11680679 (10elukey)
[09:21:32] <wikibugs>	 (03CR) 10Gehel: [C:03+1] wdqs: remove stale legacy-full-gui release entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248694 (https://phabricator.wikimedia.org/T415073) (owner: 10Ryan Kemper)
[09:23:22] <logmsgbot>	 !log derick@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=pmswiki --logwiki=metawiki Wikilimes Limes.pink  # T419184
[09:23:26] <stashbot>	 T419184: Unblock stuck global rename of Limes.pink - https://phabricator.wikimedia.org/T419184
[09:24:14] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:24:26] <wikibugs>	 (03Abandoned) 10Itamar Givon: Add new key generated with a security key [puppet] - 10https://gerrit.wikimedia.org/r/921356 (https://phabricator.wikimedia.org/T337037) (owner: 10Itamar Givon)
[09:26:47] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-services: bump llm limitranges to enable embeddings isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248531 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira)
[09:27:47] <wikibugs>	 (03CR) 10Muehlenhoff: aptrepo: add jenkins for trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[09:28:03] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Create component/php83-icu72 [puppet] - 10https://gerrit.wikimedia.org/r/1248701 (https://phabricator.wikimedia.org/T419058) (owner: 10Muehlenhoff)
[09:28:05] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] "Manually tested with promtool, I'm self-merging." [puppet] - 10https://gerrit.wikimedia.org/r/1248766 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli)
[09:28:52] <moritzm>	 tappof: shall I merge your patch along?
[09:29:05] <tappof>	 yes thx
[09:29:11] <tappof>	 moritzm: 
[09:30:55] <wikibugs>	 (03PS1) 10Muehlenhoff: mcrounter: Run spec tests on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1248768
[09:31:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove pointless spec test [puppet] - 10https://gerrit.wikimedia.org/r/1248769
[09:31:56] <moritzm>	 tappof: merged
[09:34:20] <tappof>	 thx moritzm 
[09:34:55] <wikibugs>	 (03CR) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[09:35:18] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: bump llm limitranges to enable embeddings isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248531 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira)
[09:39:55] <Emperor>	 !log repool ms-fe1013 after PXE work T401966
[09:39:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:59] <stashbot>	 T401966: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966
[09:41:50] <wikibugs>	 (03PS1) 10Btullis: Grant members of analytics-product-users access to an-coord hosts [puppet] - 10https://gerrit.wikimedia.org/r/1248771 (https://phabricator.wikimedia.org/T419167)
[09:42:33] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248771 (https://phabricator.wikimedia.org/T419167) (owner: 10Btullis)
[09:49:51] <wikibugs>	 06SRE, 10ServiceOps-Mediawiki, 06ServiceOps new (Next quarter): Migrate Service Ops Docker images running in production away from Bullseye - https://phabricator.wikimedia.org/T418200#11680749 (10MLechvien-WMF)
[09:53:51] <jinxer-wm>	 FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[09:53:54] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:56:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:59:26] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[09:59:44] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:00:58] <wikibugs>	 (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248539 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton)
[10:02:53] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mw-page-html-content-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248539 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton)
[10:03:48] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2332-2356].codfw.wmnet
[10:04:37] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[10:06:07] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206 (10BTullis) 03NEW
[10:06:11] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11680776 (10BTullis)
[10:09:00] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[10:09:06] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1199.eqiad.wmnet
[10:09:09] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[10:09:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11680788 (10ops-monitoring-bot) Host an-worker1199.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo...
[10:13:43] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1194.eqiad.wmnet
[10:14:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11680818 (10ops-monitoring-bot) Host an-worker1194.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo...
[10:14:14] <jinxer-wm>	 FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:15:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:15:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:16:28] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11680838 (10elukey) Me and Matthew will coordinate next week to upgrade the ms-be and thanos-be hosts one at the time next week :)
[10:16:46] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2332-2356].codfw.wmnet
[10:19:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:21:09] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1194.eqiad.wmnet
[10:23:06] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1199.eqiad.wmnet
[10:23:21] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2332.codfw.wmnet with OS trixie
[10:24:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10ServiceOps-Upgrades-Hardware, 07Epic, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212 (10MLechvien-WMF) 03NEW
[10:24:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:32:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11680929 (10MLechvien-WMF)
[10:32:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11680932 (10MLechvien-WMF)
[10:34:14] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:34:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 16.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:34:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11680951 (10MoritzMuehlenhoff)
[10:36:45] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2332.codfw.wmnet with reason: host reimage
[10:43:09] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2332.codfw.wmnet with reason: host reimage
[10:44:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:49:43] <jinxer-wm>	 FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:53:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11681015 (10MLechvien-WMF) @JMeybohm as discussed today:  - Redis hosts upgrade should be done at the same time a...
[10:56:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11681042 (10JMeybohm) >>! In T419212#11681013, @MLechvien-WMF wrote: > - deploy1003 > - deploy2002  I think the d...
[10:58:59] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11681045 (10JMeybohm)
[11:02:01] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2332.codfw.wmnet with OS trixie
[11:02:20] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11681049 (10MLechvien-WMF)
[11:04:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 15.78% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:05:59] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2332.codfw.wmnet
[11:06:00] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2332.codfw.wmnet
[11:08:19] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove pointless spec test [puppet] - 10https://gerrit.wikimedia.org/r/1248769 (owner: 10Muehlenhoff)
[11:08:31] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2333.codfw.wmnet with OS trixie
[11:08:58] <wikibugs>	 (03PS14) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035)
[11:09:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 18.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:09:33] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2335.codfw.wmnet with OS trixie
[11:09:36] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2334.codfw.wmnet with OS trixie
[11:12:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 13.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:16:47] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Update hadoop namenode JVM memory settings [puppet] - 10https://gerrit.wikimedia.org/r/1247643 (https://phabricator.wikimedia.org/T418551) (owner: 10Joal)
[11:17:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 13.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:21:29] <wikibugs>	 06SRE, 10envoy, 06ServiceOps new, 10ServiceOps-Services-Oids: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975#11681081 (10MLechvien-WMF) @RLazarus can we close this in Q3? if not, how much effort should we factor in Q4 plan?
[11:21:54] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2333.codfw.wmnet with reason: host reimage
[11:22:40] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2334.codfw.wmnet with reason: host reimage
[11:23:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:23:18] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2335.codfw.wmnet with reason: host reimage
[11:26:22] <wikibugs>	 10ops-eqiad, 06DC-Ops: hw troubleshooting: CPU voltage error for an-worker1231.eqiad.wmnet - https://phabricator.wikimedia.org/T419222 (10BTullis) 03NEW
[11:27:05] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::prometheus: Add Prometheus scrapes for Istio [puppet] - 10https://gerrit.wikimedia.org/r/1248791 (https://phabricator.wikimedia.org/T418274)
[11:27:26] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2333.codfw.wmnet with reason: host reimage
[11:28:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:30:39] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2334.codfw.wmnet with reason: host reimage
[11:34:23] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2335.codfw.wmnet with reason: host reimage
[11:36:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419000#11681212 (10BTullis) Thanks @Jclark-ctr - so this is an interesting one, as it is one of the SSDs that holds the operating system that has failed. ` Physic...
[11:37:25] <jinxer-wm>	 FIRING: [8x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown  - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown
[11:39:07] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1205.eqiad.wmnet
[11:39:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[11:39:21] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.kafka.change-confluent-distro-version Change Confluent distribution for Kafka A:kafka-test-eqiad cluster: Change Confluent distribution.
[11:39:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11681230 (10ops-monitoring-bot) Host an-worker1205.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo...
[11:45:18] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248075 (https://phabricator.wikimedia.org/T415902) (owner: 10Mmartorana)
[11:45:45] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2333.codfw.wmnet with OS trixie
[11:46:10] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247651 (https://phabricator.wikimedia.org/T415902) (owner: 10Mmartorana)
[11:48:03] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1205.eqiad.wmnet
[11:48:04] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1205 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 0 OK : virtual_disk: 1 Dgrd : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T419224 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[11:48:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419224 (10ops-monitoring-bot) 03NEW
[11:49:28] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1207.eqiad.wmnet
[11:49:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11681259 (10ops-monitoring-bot) Host an-worker1207.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo...
[11:50:48] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2334.codfw.wmnet with OS trixie
[11:53:22] <moritzm>	 !log uploaded icu 72.1-3+deb12u1~wmf11u1 to component/php83-icu72 T419058 (backport of ICU 72 from Bookworm to Bullseye, built to be co-installable with the native ICU from Bullseye)
[11:53:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:25] <stashbot>	 T419058: Prepare packages and production images for ICU 72 upgrade - https://phabricator.wikimedia.org/T419058
[11:54:15] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2335.codfw.wmnet with OS trixie
[11:54:50] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1207.eqiad.wmnet
[11:55:19] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.kafka.change-confluent-distro-version (exit_code=99) Change Confluent distribution for Kafka A:kafka-test-eqiad cluster: Change Confluent distribution.
[11:55:58] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2333-2335].codfw.wmnet
[11:56:01] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2333-2335].codfw.wmnet
[11:59:00] <wikibugs>	 (03PS15) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035)
[11:59:31] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2336.codfw.wmnet with OS trixie
[11:59:34] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2337.codfw.wmnet with OS trixie
[11:59:46] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2338.codfw.wmnet with OS trixie
[11:59:49] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2339.codfw.wmnet with OS trixie
[12:00:03] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2340.codfw.wmnet with OS trixie
[12:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260306T0800)
[12:00:05] <jouncebot>	 jelto, arnoldokoth, mutante, and arnaudb: GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260306T1200). Please do the needful.
[12:00:07] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2341.codfw.wmnet with OS trixie
[12:02:57] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1248791 (https://phabricator.wikimedia.org/T418274) (owner: 10Majavah)
[12:06:27] <wikibugs>	 (03PS16) 10Elukey: Add the sre.kafka.change-confluent-distro-version cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035)
[12:12:39] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2339.codfw.wmnet with reason: host reimage
[12:12:47] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2336.codfw.wmnet with reason: host reimage
[12:12:49] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2338.codfw.wmnet with reason: host reimage
[12:12:59] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2337.codfw.wmnet with reason: host reimage
[12:13:05] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2341.codfw.wmnet with reason: host reimage
[12:13:18] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2340.codfw.wmnet with reason: host reimage
[12:15:21] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Add Prometheus scrapes for Istio [puppet] - 10https://gerrit.wikimedia.org/r/1248791 (https://phabricator.wikimedia.org/T418274) (owner: 10Majavah)
[12:16:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11681366 (10BTullis) I've applied the BIOS settings to all hadoop workers and re-added any data disks that wer...
[12:17:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk in slot 10 for an-worker1194 - https://phabricator.wikimedia.org/T389065#11681371 (10BTullis) 05Open→03Resolved This has been done as part of the investigation into T415002
[12:17:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T416066#11681376 (10BTullis) 05Open→03Resolved This is fixed now.
[12:18:16] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2339.codfw.wmnet with reason: host reimage
[12:22:23] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2336.codfw.wmnet with reason: host reimage
[12:24:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11681404 (10BTullis) 05Open→03Resolved It's also worth noting that we haven't seen any rise in power c...
[12:26:42] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2338.codfw.wmnet with reason: host reimage
[12:28:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a pbuilder hook to build against the ICU72 backport [puppet] - 10https://gerrit.wikimedia.org/r/1248797 (https://phabricator.wikimedia.org/T419058)
[12:30:31] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248797 (https://phabricator.wikimedia.org/T419058) (owner: 10Muehlenhoff)
[12:31:02] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2337.codfw.wmnet with reason: host reimage
[12:34:27] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2340.codfw.wmnet with reason: host reimage
[12:35:32] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2339.codfw.wmnet with OS trixie
[12:40:25] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2341.codfw.wmnet with reason: host reimage
[12:43:15] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2336.codfw.wmnet with OS trixie
[12:45:21] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2338.codfw.wmnet with OS trixie
[12:48:11] <wikibugs>	 (03PS1) 10Mszwarc: Add a script to send mandatory 2FA Echo notification [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248806 (https://phabricator.wikimedia.org/T419111)
[12:49:54] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2337.codfw.wmnet with OS trixie
[12:50:31] <wikibugs>	 (03PS1) 10Joal: Update hadoop namenode JVM configuration [puppet] - 10https://gerrit.wikimedia.org/r/1248807 (https://phabricator.wikimedia.org/T418551)
[12:50:36] <wikibugs>	 (03Abandoned) 10Hashar: Revert^2 "Add icons for wikibase changes. WIP" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1196681 (owner: 10Neslihan Turan)
[12:51:10] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Update hadoop namenode JVM configuration [puppet] - 10https://gerrit.wikimedia.org/r/1248807 (https://phabricator.wikimedia.org/T418551) (owner: 10Joal)
[12:51:20] <wikibugs>	 (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248808 (https://phabricator.wikimedia.org/T408918)
[12:54:53] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2340.codfw.wmnet with OS trixie
[12:56:21] <wikibugs>	 (03PS2) 10Muehlenhoff: dist-upgrade: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1248463
[12:56:26] <wikibugs>	 (03CR) 10Muehlenhoff: dist-upgrade: Remove support for Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248463 (owner: 10Muehlenhoff)
[12:58:14] <wikibugs>	 (03CR) 10Dzahn: aptrepo: add jenkins for trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[13:01:12] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2341.codfw.wmnet with OS trixie
[13:05:34] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2336-2341].codfw.wmnet
[13:05:38] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2336-2341].codfw.wmnet
[13:06:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:06:47] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2342.codfw.wmnet with OS trixie
[13:06:58] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2343.codfw.wmnet with OS trixie
[13:07:11] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2344.codfw.wmnet with OS trixie
[13:07:22] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2345.codfw.wmnet with OS trixie
[13:07:36] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2346.codfw.wmnet with OS trixie
[13:07:47] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2347.codfw.wmnet with OS trixie
[13:08:53] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes: Don't re-define default admission_plugins [puppet] - 10https://gerrit.wikimedia.org/r/1248812 (https://phabricator.wikimedia.org/T273507)
[13:09:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] dist-upgrade: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1248463 (owner: 10Muehlenhoff)
[13:10:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove pointless spec test [puppet] - 10https://gerrit.wikimedia.org/r/1248769 (owner: 10Muehlenhoff)
[13:10:26] <wikibugs>	 (03PS2) 10JMeybohm: kubernetes: Don't re-define default admission_plugins [puppet] - 10https://gerrit.wikimedia.org/r/1248812 (https://phabricator.wikimedia.org/T273507)
[13:10:41] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248812 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[13:12:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove mostly obsolete spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1248814
[13:13:13] <wikibugs>	 (03PS3) 10JMeybohm: kubernetes: Don't re-define default admission_plugins [puppet] - 10https://gerrit.wikimedia.org/r/1248812 (https://phabricator.wikimedia.org/T273507)
[13:17:10] <wikibugs>	 (03CR) 10Muehlenhoff: aptrepo: add jenkins for trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[13:17:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419224#11681580 (10Jclark-ctr) a:03Jclark-ctr
[13:17:50] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11681581 (10Jclark-ctr) a:03Jclark-ctr
[13:19:54] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2342.codfw.wmnet with reason: host reimage
[13:20:13] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2343.codfw.wmnet with reason: host reimage
[13:20:38] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2344.codfw.wmnet with reason: host reimage
[13:20:56] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2345.codfw.wmnet with reason: host reimage
[13:20:59] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2346.codfw.wmnet with reason: host reimage
[13:21:09] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2347.codfw.wmnet with reason: host reimage
[13:21:49] <Dreamy_Jazz>	 !log Running foreachwikiindblist checkuser-suggested-investigations.dblist ~/PopulateSiuInfo.php --batch-size=1000 for T411118
[13:21:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:14] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:24:28] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2342.codfw.wmnet with reason: host reimage
[13:25:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Follow-up: Degraded Disk Not Yet Added to RAID (an-worker1175, an-worker1199) - https://phabricator.wikimedia.org/T416166#11681611 (10Jclark-ctr) 05Open→03Resolved
[13:26:33] <wikibugs>	 (03PS2) 10Dzahn: aptrepo: add jenkins for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521)
[13:27:08] <wikibugs>	 (03CR) 10Dzahn: "oh, ok! done" [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[13:28:14] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2344.codfw.wmnet with reason: host reimage
[13:28:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419224#11681622 (10Jclark-ctr) 05Open→03Resolved duplicate to T419000
[13:29:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: CPU voltage error for an-worker1231.eqiad.wmnet - https://phabricator.wikimedia.org/T419222#11681629 (10Jclark-ctr) a:03Jclark-ctr
[13:31:30] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2345.codfw.wmnet with reason: host reimage
[13:35:01] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2343.codfw.wmnet with reason: host reimage
[13:35:49] <wikibugs>	 (03PS1) 10Dpogorzelski: pyrra(ML): fix updated revertrisk metric name [puppet] - 10https://gerrit.wikimedia.org/r/1248818 (https://phabricator.wikimedia.org/T419235)
[13:36:06] <wikibugs>	 (03CR) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[13:36:20] <wikibugs>	 (03PS5) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407)
[13:37:00] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] deployment_server: add the kafka-mirrormaker kubeconfigs in the aux clusters [puppet] - 10https://gerrit.wikimedia.org/r/1248401 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[13:38:16] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2346.codfw.wmnet with reason: host reimage
[13:38:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[13:42:20] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Set custom glb_requests_limit for thumbs [puppet] - 10https://gerrit.wikimedia.org/r/1248819
[13:42:54] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2347.codfw.wmnet with reason: host reimage
[13:42:59] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Set custom glb_requests_limit for thumbs [puppet] - 10https://gerrit.wikimedia.org/r/1248819
[13:43:20] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2342.codfw.wmnet with OS trixie
[13:44:08] <wikibugs>	 (03CR) 10Brouberol: Add the sre.kafka.change-confluent-distro-version cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) (owner: 10Elukey)
[13:44:26] <wikibugs>	 (03PS6) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407)
[13:45:10] <logmsgbot>	 !log dreamyjazz@deploy2002 mwscript-k8s job started: foreachwikiindblist checkuser-suggested-investigations CheckUser:queueAutoCloseSICases.php  # T418591
[13:45:13] <stashbot>	 T418591: Suggested Investigations: Run queueAutoCloseSICases.php on WMF production - https://phabricator.wikimedia.org/T418591
[13:46:20] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "VTCs are happy" [puppet] - 10https://gerrit.wikimedia.org/r/1248819 (owner: 10Vgutierrez)
[13:46:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[13:48:06] <wikibugs>	 (03PS7) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407)
[13:48:41] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2344.codfw.wmnet with OS trixie
[13:49:23] <wikibugs>	 (03CR) 10Muehlenhoff: aptrepo: add jenkins for trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[13:50:12] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2345.codfw.wmnet with OS trixie
[13:51:22] <wikibugs>	 (03PS1) 10Mszwarc: Set $wgOATH2FARequiredGroupRemovalPages for interface-admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248821 (https://phabricator.wikimedia.org/T417880)
[13:52:15] <wikibugs>	 (03PS1) 10JMeybohm: Remove istio 1.15 wikikube config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248822 (https://phabricator.wikimedia.org/T341984)
[13:52:18] <wikibugs>	 (03PS1) 10JMeybohm: Remove PSP related code from admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248823 (https://phabricator.wikimedia.org/T273507)
[13:53:26] <icinga-wm>	 RECOVERY - Host an-worker1231 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms
[13:53:39] <wikibugs>	 (03CR) 10Aqu: [C:03+1] stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248808 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton)
[13:53:54] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[13:54:06] <jinxer-wm>	 FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[13:55:44] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2343.codfw.wmnet with OS trixie
[13:57:21] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2346.codfw.wmnet with OS trixie
[13:57:49] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248806 (https://phabricator.wikimedia.org/T419111) (owner: 10Mszwarc)
[13:57:58] <icinga-wm>	 PROBLEM - Host an-worker1231 is DOWN: PING CRITICAL - Packet loss = 100%
[13:59:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247639 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle)
[14:00:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11681841 (10Jclark-ctr) Parts have shipped SR 223498416 Should arrive today
[14:00:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11681847 (10Gehel)
[14:01:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419000#11681854 (10Gehel)
[14:01:39] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2347.codfw.wmnet with OS trixie
[14:01:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11681880 (10Gehel)
[14:02:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11681886 (10Gehel)
[14:02:16] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2342-2347].codfw.wmnet
[14:02:20] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2342-2347].codfw.wmnet
[14:03:22] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2348.codfw.wmnet with OS trixie
[14:03:32] <logmsgbot>	 !log blake@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host wikikube-worker2348.codfw.wmnet with OS trixie
[14:03:32] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2349.codfw.wmnet with OS trixie
[14:03:42] <logmsgbot>	 !log blake@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host wikikube-worker2349.codfw.wmnet with OS trixie
[14:03:45] <wikibugs>	 10SRE-SLO, 10observability, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Update WDQS SLOs to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11681912 (10Gehel)
[14:04:01] <wikibugs>	 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11681925 (10Gehel)
[14:05:12] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11681951 (10Gehel)
[14:05:26] <icinga-wm>	 RECOVERY - Host an-worker1231 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms
[14:05:42] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414413#11681979 (10Gehel)
[14:05:48] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414970#11681981 (10Gehel)
[14:06:07] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2348.codfw.wmnet with OS trixie
[14:06:20] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Grant Access to ops for ebernhardson - https://phabricator.wikimedia.org/T419029#11681995 (10Gehel)
[14:06:26] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2349.codfw.wmnet with OS trixie
[14:06:32] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11682083 (10Gehel)
[14:06:44] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2350.codfw.wmnet with OS trixie
[14:06:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11682085 (10Gehel)
[14:06:53] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2351.codfw.wmnet with OS trixie
[14:07:11] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2352.codfw.wmnet with OS trixie
[14:07:22] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2353.codfw.wmnet with OS trixie
[14:07:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Follow-up: Degraded Disk Not Yet Added to RAID (an-worker1175, an-worker1199) - https://phabricator.wikimedia.org/T416166#11682100 (10Gehel)
[14:08:57] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "this is a NOOP on LVS, no need to restart pybal" [puppet] - 10https://gerrit.wikimedia.org/r/1247608 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth)
[14:10:24] <wikibugs>	 (03CR) 10Andrew Bogott: "The json output doesn't include the member status. That plus the weird decimal ID leaves me to believe that no one is testing/fixing/maint" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott)
[14:10:47] <icinga-wm>	 PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Sun 22 Mar 2026 02:10:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[14:10:48] <wikibugs>	 (03CR) 10Andrew Bogott: toolforge etcd: update handling of 'member list' output (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott)
[14:13:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove mostly obsolete spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1248814 (owner: 10Muehlenhoff)
[14:14:14] <jinxer-wm>	 FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:16:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:16:57] <cptk3vn>	 #wikimedia-tech
[14:17:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Create nodejs 24 production images - https://phabricator.wikimedia.org/T418440#11682153 (10MoritzMuehlenhoff) The images will be available after the forthcoming rebuild of the weekly images on Monday morning.
[14:19:13] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2348.codfw.wmnet with reason: host reimage
[14:19:28] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2349.codfw.wmnet with reason: host reimage
[14:19:34] <wikibugs>	 (03PS1) 10CDanis: docker-registry: lowercase path claim [puppet] - 10https://gerrit.wikimedia.org/r/1248829
[14:20:09] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2350.codfw.wmnet with reason: host reimage
[14:20:30] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2353.codfw.wmnet with reason: host reimage
[14:20:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:20:40] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2352.codfw.wmnet with reason: host reimage
[14:21:12] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2351.codfw.wmnet with reason: host reimage
[14:21:54] <wikibugs>	 (03PS1) 10Ebernhardson: semantic: Add egress policy to relforge [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248834
[14:22:52] <wikibugs>	 (03PS2) 10Ebernhardson: semantic: Add egress policy to relforge [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248834
[14:23:19] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2348.codfw.wmnet with reason: host reimage
[14:24:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.012s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:25:54] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] semantic: Add egress policy to relforge [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248834 (owner: 10Ebernhardson)
[14:26:17] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2350.codfw.wmnet with reason: host reimage
[14:26:38] <wikibugs>	 (03PS1) 10Hashar: wm-checks-api: add tooltip to the CheckRun Run action [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1248835
[14:27:49] <wikibugs>	 (03Merged) 10jenkins-bot: semantic: Add egress policy to relforge [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248834 (owner: 10Ebernhardson)
[14:28:37] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 3 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11682215 (10TheDJ)
[14:28:55] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply
[14:29:02] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 3 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11682219 (10TheDJ)
[14:29:49] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply
[14:30:03] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2352.codfw.wmnet with reason: host reimage
[14:31:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 2.172% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:31:36] <wikibugs>	 (03CR) 10Andrew Bogott: "Nope, even if I run an additional 'endpoint status' call it doesn't actually include the status of the endpoint. And the json output for '" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott)
[14:33:51] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2351.codfw.wmnet with reason: host reimage
[14:34:14] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:34:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.705s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:36:27] <wikibugs>	 (03PS17) 10Elukey: Add the sre.kafka.change-confluent-distro-version cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035)
[14:37:07] <wikibugs>	 (03CR) 10Elukey: Add the sre.kafka.change-confluent-distro-version cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) (owner: 10Elukey)
[14:37:14] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2349.codfw.wmnet with reason: host reimage
[14:37:32] <wikibugs>	 (03PS5) 10Daniel Kinzler: rest-gateway: add CORS support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969)
[14:38:39] <wikibugs>	 (03PS10) 10Andrew Bogott: toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (https://phabricator.wikimedia.org/T361237)
[14:38:40] <wikibugs>	 (03PS14) 10Andrew Bogott: toolforge etcdctl: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237)
[14:38:40] <wikibugs>	 (03PS1) 10Andrew Bogott: toolforge etcdctl: remove get_cluster_health and associated rigging [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248842
[14:38:57] <wikibugs>	 (03PS2) 10JMeybohm: Remove PSP related code from admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248823 (https://phabricator.wikimedia.org/T273507)
[14:39:30] <wikibugs>	 (03CR) 10Andrew Bogott: toolforge etcdctl: update cert flag names (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott)
[14:41:03] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2353.codfw.wmnet with reason: host reimage
[14:41:14] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Paired with Riccardo over meet!" [software/cumin] - 10https://gerrit.wikimedia.org/r/1224035 (owner: 10Volans)
[14:41:32] <wikibugs>	 (03CR) 10TChin: [C:03+2] [eventgate] bump to v1.28.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248507 (https://phabricator.wikimedia.org/T409106) (owner: 10TChin)
[14:41:42] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 3 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11682308 (10MatthewVernon) @Ladsgroup they're only a tiny number of files, but XCF will probably likewise need addressing?
[14:43:03] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2348.codfw.wmnet with OS trixie
[14:43:08] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[14:43:20] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[14:44:24] <wikibugs>	 (03Merged) 10jenkins-bot: [eventgate] bump to v1.28.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248507 (https://phabricator.wikimedia.org/T409106) (owner: 10TChin)
[14:44:56] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[14:45:02] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2350.codfw.wmnet with OS trixie
[14:45:02] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[14:45:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott)
[14:47:13] <wikibugs>	 (03CR) 10Dillon: [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247639 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle)
[14:47:45] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[14:48:11] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[14:48:27] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[14:48:36] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[14:48:45] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2352.codfw.wmnet with OS trixie
[14:49:12] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[14:49:42] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[14:49:43] <jinxer-wm>	 FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:52:19] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2351.codfw.wmnet with OS trixie
[14:52:24] <wikibugs>	 (03PS5) 10Arnaudb: gerrit: remove mod_qos [puppet] - 10https://gerrit.wikimedia.org/r/1248843 (https://phabricator.wikimedia.org/T417615)
[14:52:54] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply
[14:53:39] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply
[14:56:45] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply
[14:57:15] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply
[14:57:33] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2349.codfw.wmnet with OS trixie
[14:58:37] <wikibugs>	 (03CR) 10JavierMonton: "I created a new patch for it: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1236258" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235827 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton)
[14:59:22] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2353.codfw.wmnet with OS trixie
[15:02:38] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2348-2353].codfw.wmnet
[15:02:40] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[15:02:41] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2348-2353].codfw.wmnet
[15:02:54] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[15:03:26] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2354.codfw.wmnet with OS trixie
[15:03:38] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2355.codfw.wmnet with OS trixie
[15:03:50] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2356.codfw.wmnet with OS trixie
[15:05:19] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply
[15:05:24] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply
[15:05:36] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply
[15:06:16] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply
[15:08:48] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[15:08:58] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[15:09:17] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: revert embeddings isvc image to one that doesn't use AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248855 (https://phabricator.wikimedia.org/T418976)
[15:09:20] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[15:10:03] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[15:11:53] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] ml-services: revert embeddings isvc image to one that doesn't use AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248855 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira)
[15:12:33] <wikibugs>	 (03CR) 10Ozge: [C:03+2] ml-services: revert embeddings isvc image to one that doesn't use AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248855 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira)
[15:14:38] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: revert embeddings isvc image to one that doesn't use AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248855 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira)
[15:15:40] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:15:42] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[15:15:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:16:21] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2354.codfw.wmnet with reason: host reimage
[15:16:58] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2355.codfw.wmnet with reason: host reimage
[15:17:02] <wikibugs>	 (03PS2) 10Andrew Bogott: toolforge etcdctl: remove get_cluster_health and associated rigging [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248842
[15:17:02] <wikibugs>	 (03PS11) 10Andrew Bogott: toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (https://phabricator.wikimedia.org/T361237)
[15:17:02] <wikibugs>	 (03PS15) 10Andrew Bogott: toolforge etcdctl: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237)
[15:17:07] <wikibugs>	 (03PS3) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235827 (https://phabricator.wikimedia.org/T360794)
[15:17:18] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2356.codfw.wmnet with reason: host reimage
[15:17:23] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[15:17:50] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[15:19:19] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
[15:19:57] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
[15:22:05] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Use https for semanticsearch-test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248858
[15:23:21] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2354.codfw.wmnet with reason: host reimage
[15:23:44] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[15:24:30] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[15:25:02] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: rollback embeddings isvc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248859 (https://phabricator.wikimedia.org/T418976)
[15:25:58] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: rollback embeddings isvc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248859 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira)
[15:26:35] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2355.codfw.wmnet with reason: host reimage
[15:26:35] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply
[15:26:49] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply
[15:27:55] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: rollback embeddings isvc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248859 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira)
[15:28:08] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply
[15:28:32] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply
[15:28:47] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[15:30:43] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply
[15:31:19] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply
[15:31:26] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2356.codfw.wmnet with reason: host reimage
[15:32:10] <wikibugs>	 (03PS1) 10Muehlenhoff: puppetserver: Use the hooks from Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1248862 (https://phabricator.wikimedia.org/T365798)
[15:37:25] <jinxer-wm>	 FIRING: [8x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown  - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown
[15:38:40] <wikibugs>	 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new, 07Essential-Work: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11682582 (10elukey) I finally found a way to see a decent access log from the Istio gateway. It is...
[15:39:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:42:12] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2354.codfw.wmnet with OS trixie
[15:42:34] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 3 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11682595 (10Aklapper)
[15:44:01] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] puppetserver: Use the hooks from Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1248862 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[15:46:24] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2355.codfw.wmnet with OS trixie
[15:51:50] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2356.codfw.wmnet with OS trixie
[15:52:28] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2354-2356].codfw.wmnet
[15:52:30] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2354-2356].codfw.wmnet
[15:52:38] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1248797 (https://phabricator.wikimedia.org/T419058) (owner: 10Muehlenhoff)
[15:53:06] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "I am happy to help for the deployment!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248858 (owner: 10Ebernhardson)
[15:54:03] <wikibugs>	 (03PS1) 10Btullis: opensearch-cluster: Terminate TLS on the ingress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175)
[15:55:04] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] aux-k8s: define the kafka-mirrormaker namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248404 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[15:56:31] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:56:56] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:57:38] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'.
[15:57:52] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'.
[15:58:13] <wikibugs>	 (03PS2) 10JMeybohm: Remove istio 1.15 wikikube config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248822 (https://phabricator.wikimedia.org/T341984)
[15:58:13] <wikibugs>	 (03PS3) 10JMeybohm: Remove PSP related code from admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248823 (https://phabricator.wikimedia.org/T273507)
[15:58:59] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus: add cardinality explosion alerts [alerts] - 10https://gerrit.wikimedia.org/r/1248866 (https://phabricator.wikimedia.org/T415317)
[16:00:24] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thank you, Moritz! Two optional suggestions that would probably make sense to pick up at the same time, but feel free to also not." [puppet] - 10https://gerrit.wikimedia.org/r/1247620 (owner: 10Muehlenhoff)
[16:08:54] <jinxer-wm>	 FIRING: [7x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:09:39] <wikibugs>	 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new, 07Essential-Work: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11682705 (10elukey) For the orchestrator po, there is a decrease in traffic during the past couple...
[16:13:23] <wikibugs>	 (03CR) 10JMeybohm: Remove PSP related code from admin_ng (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248823 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[16:17:06] <wikibugs>	 (03PS3) 10Dzahn: aptrepo: add jenkins for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521)
[16:23:36] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply
[16:23:39] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[16:30:59] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11682744 (10Volans) This host is still marked as Active in Netbox but disappeared from PuppetDB, if it's still broken please fix its status in Netbox.
[16:33:54] <jinxer-wm>	 FIRING: [7x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:33:58] <jinxer-wm>	 FIRING: [3x] CertAlmostExpired: Certificate for service titan2001:443 is about to expire   - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[16:35:00] <wikibugs>	 (03CR) 10Herron: "Thanks for this looks great overall!  added a few optional suggestions inline." [alerts] - 10https://gerrit.wikimedia.org/r/1248866 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli)
[16:36:28] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Built and tested locally:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247185 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[16:37:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] aptrepo: add jenkins for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[16:37:22] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247185 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[16:37:35] <wikibugs>	 (03CR) 10Scott French: [V:03+2 C:03+2] envoy: Support using envoy-drain-tool [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247185 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[16:42:20] <wikibugs>	 (03PS2) 10Tiziano Fogli: prometheus: add cardinality explosion alerts [alerts] - 10https://gerrit.wikimedia.org/r/1248866 (https://phabricator.wikimedia.org/T415317)
[16:44:20] <wikibugs>	 (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248808 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton)
[16:44:23] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "a lot at once - but as far as I an tell it looks good - https://puppet-compiler.wmflabs.org/output/1248843/8233/gerrit2002.wikimedia.org/i" [puppet] - 10https://gerrit.wikimedia.org/r/1248843 (https://phabricator.wikimedia.org/T417615) (owner: 10Arnaudb)
[16:46:44] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest-gateway: per-path overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130)
[16:46:46] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248808 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton)
[16:46:47] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[16:46:55] <wikibugs>	 (03CR) 10Tiziano Fogli: "Thank you for the hint. I think having a task is a really good idea, given that the alerts eventually triggered by these rules are tempora" [alerts] - 10https://gerrit.wikimedia.org/r/1248866 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli)
[16:46:55] <wikibugs>	 (03CR) 10Arnaudb: "It is a large change indeed, sorry about that. to avoid unintended side effect, I'll merge this with puppet disabled on primary and replic" [puppet] - 10https://gerrit.wikimedia.org/r/1248843 (https://phabricator.wikimedia.org/T417615) (owner: 10Arnaudb)
[16:47:45] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: per-path overrides (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler)
[16:48:19] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[16:48:28] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] aptrepo: add jenkins for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[16:48:30] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[16:51:35] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "the plan sounds good :)" [puppet] - 10https://gerrit.wikimedia.org/r/1248843 (https://phabricator.wikimedia.org/T417615) (owner: 10Arnaudb)
[16:54:56] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] site: apply jenkins stub role on contint2003 [puppet] - 10https://gerrit.wikimedia.org/r/1248635 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[17:04:19] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-backup-datanode1033.eqiad.wmnet
[17:04:35] <wikibugs>	 (03CR) 10Hashar: [C:03+2] wm-checks-api: add tooltip to the CheckRun Run action [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1248835 (owner: 10Hashar)
[17:05:24] <wikibugs>	 (03Merged) 10jenkins-bot: wm-checks-api: add tooltip to the CheckRun Run action [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1248835 (owner: 10Hashar)
[17:05:44] <logmsgbot>	 !log hashar@deploy2002 Started deploy [gerrit/gerrit@b8183ba]: wm-checks-api: add tooltip to the CheckRun Run action
[17:05:57] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [gerrit/gerrit@b8183ba]: wm-checks-api: add tooltip to the CheckRun Run action (duration: 00m 13s)
[17:10:07] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.netbox
[17:10:07] <wikibugs>	 (03PS2) 10Btullis: opensearch-cluster: Terminate TLS on the ingress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175)
[17:11:51] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply
[17:11:55] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[17:15:47] <logmsgbot>	 btullis@cumin1003 decommission (PID 1447184) is awaiting input
[17:24:14] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:29:29] <wikibugs>	 (03PS1) 10Dzahn: jenkins: set the CI manager host in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1248892 (https://phabricator.wikimedia.org/T418521)
[17:29:55] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] jenkins: set the CI manager host in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1248892 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[17:33:05] <wikibugs>	 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11682990 (10RobH) Set to failed.  Ordering of replacement part will take place today or tomorrow, sorting the terms.
[17:40:23] <wikibugs>	 (03PS3) 10Btullis: opensearch-cluster: Terminate TLS on the ingress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175)
[17:40:51] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply
[17:40:55] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[17:41:56] <wikibugs>	 (03PS4) 10Btullis: opensearch-cluster: Terminate TLS on the ingress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175)
[17:42:13] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply
[17:42:17] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[17:42:53] <wikibugs>	 (03PS1) 10Medelius: Suggestion mode: update link for suggestion feedback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248894
[17:46:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248858 (owner: 10Ebernhardson)
[17:47:19] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Use https for semanticsearch-test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248858 (owner: 10Ebernhardson)
[17:47:43] <logmsgbot>	 !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1248858|cirrus: Use https for semanticsearch-test cluster]]
[17:50:23] <wikibugs>	 (03PS1) 10JHathaway: mailman: disable web posting [puppet] - 10https://gerrit.wikimedia.org/r/1248895 (https://phabricator.wikimedia.org/T386559)
[17:50:42] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11683041 (10jhathaway) @bd808 based on the User-Agent, `User-Agent: HyperKitty on https://lists.wikime...
[17:50:59] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248895 (https://phabricator.wikimedia.org/T386559) (owner: 10JHathaway)
[17:51:03] <wikibugs>	 (03PS5) 10Btullis: opensearch-cluster: Terminate TLS on the ingress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175)
[17:51:28] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply
[17:51:32] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[17:51:46] <wikibugs>	 (03PS1) 10Scott French: mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248889 (https://phabricator.wikimedia.org/T364245)
[17:52:25] <logmsgbot>	 !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:1248858|cirrus: Use https for semanticsearch-test cluster]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:53:35] <logmsgbot>	 !log ebernhardson@deploy2002 ebernhardson: Continuing with sync
[17:54:06] <jinxer-wm>	 FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[17:54:14] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:55:10] <wikibugs>	 (03CR) 10Majavah: [C:03+1] mailman: disable web posting [puppet] - 10https://gerrit.wikimedia.org/r/1248895 (https://phabricator.wikimedia.org/T386559) (owner: 10JHathaway)
[17:59:03] <logmsgbot>	 !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248858|cirrus: Use https for semanticsearch-test cluster]] (duration: 11m 20s)
[18:11:15] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11683076 (10A_smart_kitten) >>! In T386559#11683041, @jhathaway wrote: > As briefly discussed, I think...
[18:14:14] <jinxer-wm>	 FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[18:23:44] <jinxer-wm>	 FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[18:28:20] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-backup-datanode1033.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003"
[18:28:44] <jinxer-wm>	 RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[18:29:30] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-backup-datanode1033.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003"
[18:29:30] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:29:31] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts an-backup-datanode1033.eqiad.wmnet
[18:29:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Decommission the 46 hadoop workers and 2 namenode servers that were planned for the hadoop-backup cluster - https://phabricator.wikimedia.org/T404970#11683137 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by btullis@cumin1003 for hosts: `an-backup-datanode...
[18:41:09] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] "I don't know about the full implementation, but we are having an issue with opensearch being very strict about how an x-request-id is form" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175) (owner: 10Btullis)
[18:44:43] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp2043.*
[18:45:22] <wikibugs>	 (03PS1) 10SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1248906 (https://phabricator.wikimedia.org/T419265)
[18:49:43] <jinxer-wm>	 FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:56:22] <logmsgbot>	 !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp2043.codfw.wmnet with reason: troubleshooting for network drops
[18:59:17] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1248906 (https://phabricator.wikimedia.org/T419265) (owner: 10SBassett)
[19:00:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: CPU voltage error for an-worker1231.eqiad.wmnet - https://phabricator.wikimedia.org/T419222#11683236 (10Jclark-ctr) 05Open→03Resolved performed flea power drain updated bios and did 2 reboots with out the issue returning
[19:02:42] <wikibugs>	 (03PS1) 10Zabe: Use Hadoop for Mostcategories on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248909 (https://phabricator.wikimedia.org/T413362)
[19:10:41] <wikibugs>	 (03CR) 10Scott French: [C:03+2] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1248906 (https://phabricator.wikimedia.org/T419265) (owner: 10SBassett)
[19:11:41] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "[apt1002:/srv/wikimedia] $ sudo -i reprepro -C thirdparty/jenkins checkupdate trixie-wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/1248641 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[19:12:05] <wikibugs>	 (03PS1) 10Zabe: Stop writing to il_to on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248911 (https://phabricator.wikimedia.org/T415787)
[19:17:43] <logmsgbot>	 !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on wdqs2009.codfw.wmnet with reason: NFS might be hung, about to reboot
[19:20:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:23:28] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2009.codfw.wmnet
[19:29:16] <wikibugs>	 (03CR) 10Dduvall: [C:03+1] docker-registry: lowercase path claim [puppet] - 10https://gerrit.wikimedia.org/r/1248829 (owner: 10CDanis)
[19:29:25] <wikibugs>	 (03CR) 10CDanis: [C:03+2] docker-registry: lowercase path claim [puppet] - 10https://gerrit.wikimedia.org/r/1248829 (owner: 10CDanis)
[19:35:59] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11683365 (10jhathaway) >>! In T386559#11683076, @A_smart_kitten wrote: >>>! In T386559#11683041, @jhat...
[19:36:37] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host wdqs2009.codfw.wmnet
[19:37:25] <jinxer-wm>	 FIRING: [8x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown  - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown
[19:38:55] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11683374 (10A_smart_kitten) (To be clear, personally, I could probably try and configure my email setu...
[19:40:43] <wikibugs>	 (03CR) 10Scott French: "Thanks in advance for the review, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248889 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[19:41:38] <wikibugs>	 06SRE, 06serviceops-deprecated, 10WMDE-TechWish-Maintenance, 10Maps (Geoshapes), 07Service-deployment-requests: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388#11683390 (10Jdforrester-WMF) 05Open→03Declined Codebase is being archived, see T418372.
[19:46:39] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply
[19:47:04] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply
[19:47:40] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply
[19:48:24] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply
[20:04:05] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] [Growth] Enable on every new Wikipedia by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239954 (https://phabricator.wikimedia.org/T304052) (owner: 10Urbanecm)
[20:11:44] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] mw-debug: Pilot new drain configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248889 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[20:29:35] <wikibugs>	 (03CR) 10Scott French: "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248889 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[20:34:13] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[20:34:14] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:36:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11683594 (10Jgreen) a:05Jgreen→03None
[20:48:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:24:14] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[21:42:49] <wikibugs>	 10ops-magru: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T419298 (10phaultfinder) 03NEW
[21:47:50] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11683842 (10bd808) >>! In T386559#11683374, @A_smart_kitten wrote: > (To be clear, personally, I could...
[21:54:06] <jinxer-wm>	 FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[21:54:14] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:58:39] <wikibugs>	 (03Abandoned) 10Clare Ming: Remove Metrics Platform config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248695 (https://phabricator.wikimedia.org/T417568) (owner: 10Clare Ming)
[22:00:47] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11683855 (10herron)
[22:01:58] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove stale legacy-full-gui release entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248694 (https://phabricator.wikimedia.org/T415073) (owner: 10Ryan Kemper)
[22:04:01] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs: remove stale legacy-full-gui release entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248694 (https://phabricator.wikimedia.org/T415073) (owner: 10Ryan Kemper)
[22:08:35] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11683877 (10A_smart_kitten) >>! In T386559#11683842, @bd808 wrote: > Are you using the web UI because...
[22:14:14] <jinxer-wm>	 FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[22:23:51] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11683912 (10jhathaway) if `/message/new` is the correct route, here is the count of usage from 03-05:...
[22:26:13] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists via HyperKitty web ui - https://phabricator.wikimedia.org/T386559#11683914 (10bd808)
[22:32:47] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists via HyperKitty web ui - https://phabricator.wikimedia.org/T386559#11683933 (10A_smart_kitten) >>! In T386559#11683912, @jhathaway wrote: > if `/me...
[22:39:49] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2009.codfw.wmnet with OS bullseye
[22:40:20] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host wdqs2009
[22:40:31] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists via HyperKitty web ui - https://phabricator.wikimedia.org/T386559#11683946 (10jhathaway) >>! In T386559#11683933, @A_smart_kitten wrote: >>>! In T...
[22:41:27] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox
[22:45:51] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2009 - ryankemper@cumin2002"
[22:45:56] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2009 - ryankemper@cumin2002"
[22:45:57] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:45:57] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache wdqs2009.codfw.wmnet 141.0.192.10.in-addr.arpa 1.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[22:46:01] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wdqs2009.codfw.wmnet 141.0.192.10.in-addr.arpa 1.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[22:46:02] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2009
[22:46:14] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2009
[22:46:15] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wdqs2009
[23:07:42] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2009.codfw.wmnet with reason: host reimage
[23:13:28] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2009.codfw.wmnet with reason: host reimage
[23:20:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:29:43] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2009.codfw.wmnet with OS bullseye
[23:37:25] <jinxer-wm>	 FIRING: [8x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown  - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown