[00:11:58] <wikibugs>	 (03PS1) 10Ladsgroup: Close Italian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283117 (https://phabricator.wikimedia.org/T421796)
[00:18:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283117 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[00:19:11] <wikibugs>	 (03Merged) 10jenkins-bot: Close Italian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283117 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[00:19:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: opensearch_2@.service.d.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:19:58] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283117|Close Italian Wikinews (T421796)]]
[00:20:01] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[00:21:55] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283117|Close Italian Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:23:07] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment
[00:24:53] <logmsgbot>	 !log herron@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1001.eqiad.wmnet with OS trixie
[00:25:14] <logmsgbot>	 !log herron@cumin1003 START - Cookbook sre.hosts.move-vlan for host kafka-logging1001
[00:25:14] <logmsgbot>	 !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host kafka-logging1001
[00:27:25] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283117|Close Italian Wikinews (T421796)]] (duration: 07m 26s)
[00:27:28] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[00:27:42] <wikibugs>	 (03PS1) 10Herron: kafka-logging1001: prep for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1283139 (https://phabricator.wikimedia.org/T417001)
[00:29:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[00:34:55] <icinga-wm>	 PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.130, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:35:24] <papaul>	 that is me ^
[00:37:03] <jinxer-wm>	 FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[00:38:43] <icinga-wm>	 PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100%
[00:41:24] <logmsgbot>	 !log herron@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: host reimage
[00:43:45] <icinga-wm>	 RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 64.56 ms
[00:44:25] <wikibugs>	 (03PS2) 10Ladsgroup: Close Dutch Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283061 (https://phabricator.wikimedia.org/T421796)
[00:44:37] <jinxer-wm>	 FIRING: [6x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:45:30] <logmsgbot>	 !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: host reimage
[00:45:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283061 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[00:46:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#11892483 (10Papaul)
[00:46:16] <wikibugs>	 (03CR) 10Neriah: [C:03+1] Close Dutch Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283061 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[00:47:16] <wikibugs>	 (03Merged) 10jenkins-bot: Close Dutch Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283061 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[00:47:40] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283061|Close Dutch Wikinews (T421796)]]
[00:47:44] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[00:49:36] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283061|Close Dutch Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:49:57] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment
[00:51:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[00:54:06] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283061|Close Dutch Wikinews (T421796)]] (duration: 06m 26s)
[00:54:09] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[01:04:36] <jinxer-wm>	 FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT
[01:05:20] <logmsgbot>	 !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1001.eqiad.wmnet with OS trixie
[01:10:15] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1283160
[01:10:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1283160 (owner: 10TrainBranchBot)
[01:21:24] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1283160 (owner: 10TrainBranchBot)
[01:35:44] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[02:00:41] <logmsgbot>	 !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
[02:07:19] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 37s)
[02:07:56] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 5 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11892594 (10EvenTwist41) I understand the need to stop generating arbitrary thumbnail sizes, but was it really necessary to break exi...
[02:09:21] <jinxer-wm>	 FIRING: [8x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:12:07] <jinxer-wm>	 FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[02:17:03] <jinxer-wm>	 RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[02:26:18] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11892598 (10Papaul) All the servers in rack 23 are online and ready for re-image. I tested the re-image on cp4038 and completed with no issues after @ayounsi...
[02:34:21] <jinxer-wm>	 FIRING: [8x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:44:11] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 3989.41 ms
[02:51:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[02:53:42] <wikibugs>	 (03PS1) 10Dzahn: microsites: adjust monitoring string for design.wikimedia.org, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/1283213 (https://phabricator.wikimedia.org/T329991)
[02:54:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] microsites: adjust monitoring string for design.wikimedia.org, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/1283213 (https://phabricator.wikimedia.org/T329991) (owner: 10Dzahn)
[02:54:11] <wikibugs>	 (03PS2) 10Dzahn: microsites: adjust monitoring string for design.wikimedia.org, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/1283213 (https://phabricator.wikimedia.org/T329991)
[02:54:20] <wikibugs>	 (03CR) 10ArielGlenn: "Right, here's my official +1, ok by me to merge." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler)
[02:54:25] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 309.00 ms
[02:54:42] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] microsites: adjust monitoring string for design.wikimedia.org, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/1283213 (https://phabricator.wikimedia.org/T329991) (owner: 10Dzahn)
[03:11:07] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/2 (Core: asw1-23-ulsfo:ethernet-1/55 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[03:16:52] <jinxer-wm>	 RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[03:17:03] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[03:20:52] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/2 (Core: asw1-23-ulsfo:ethernet-1/55 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[03:22:05] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 237.59 ms
[03:22:15] <wikibugs>	 (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283048 (https://phabricator.wikimedia.org/T425440) (owner: 10Danielyepezgarces)
[03:26:54] <logmsgbot>	 !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply
[03:27:02] <logmsgbot>	 !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply
[03:27:30] <logmsgbot>	 !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[03:28:02] <logmsgbot>	 !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[03:30:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[03:35:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[03:36:46] <jinxer-wm>	 FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs4009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[03:40:36] <wikibugs>	 (03PS1) 10Andrew Bogott: magnum setup_capi.sh: export some vars [puppet] - 10https://gerrit.wikimedia.org/r/1283239
[03:41:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] magnum setup_capi.sh: export some vars [puppet] - 10https://gerrit.wikimedia.org/r/1283239 (owner: 10Andrew Bogott)
[04:06:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[04:06:50] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[04:19:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: opensearch_2@.service.d.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:21:45] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[04:21:50] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[04:33:47] <icinga-wm>	 PROBLEM - Host lvs4008 is DOWN: PING CRITICAL - Packet loss = 100%
[04:33:51] <icinga-wm>	 PROBLEM - Host lvs4010 is DOWN: PING CRITICAL - Packet loss = 100%
[04:34:07] <icinga-wm>	 PROBLEM - Host cp4040 is DOWN: PING CRITICAL - Packet loss = 100%
[04:34:07] <icinga-wm>	 PROBLEM - Host cp4042 is DOWN: PING CRITICAL - Packet loss = 100%
[04:34:07] <icinga-wm>	 PROBLEM - Host cp4046 is DOWN: PING CRITICAL - Packet loss = 100%
[04:34:07] <icinga-wm>	 PROBLEM - Host cp4044 is DOWN: PING CRITICAL - Packet loss = 100%
[04:34:07] <icinga-wm>	 PROBLEM - Host cp4050 is DOWN: PING CRITICAL - Packet loss = 100%
[04:34:07] <icinga-wm>	 PROBLEM - Host cp4052 is DOWN: PING CRITICAL - Packet loss = 100%
[04:34:07] <icinga-wm>	 PROBLEM - Host ganeti4006 is DOWN: PING CRITICAL - Packet loss = 100%
[04:34:08] <icinga-wm>	 PROBLEM - Host cp4048 is DOWN: PING CRITICAL - Packet loss = 100%
[04:34:08] <icinga-wm>	 PROBLEM - Host dns4004 is DOWN: PING CRITICAL - Packet loss = 100%
[04:34:09] <icinga-wm>	 PROBLEM - Host ganeti4008 is DOWN: PING CRITICAL - Packet loss = 100%
[04:35:11] <jinxer-wm>	 FIRING: [4x] GanetiBGPDown: BGP session down between ganeti4006 and asw1-23-ulsfo - group ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown  - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown
[04:35:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[04:40:10] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.3.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282420 (https://phabricator.wikimedia.org/T419511) (owner: 10Santiago Faci)
[04:42:34] <wikibugs>	 (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282420 (https://phabricator.wikimedia.org/T419511) (owner: 10Santiago Faci)
[05:03:43] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1223 with weight 0 T425318', diff saved to https://phabricator.wikimedia.org/P92342 and previous config saved to /var/cache/conftool/dbconfig/20260506-050342-marostegui.json
[05:03:46] <stashbot>	 T425318: Switchover s3 master (db1189 -> db1223) - https://phabricator.wikimedia.org/T425318
[05:03:54] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s3 T425318
[05:03:56] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1223 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1282279 (https://phabricator.wikimedia.org/T425318) (owner: 10Gerrit maintenance bot)
[05:04:36] <jinxer-wm>	 FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT
[05:06:54] <marostegui>	 !log Starting s3 eqiad failover from db1189 to db1223 - T425318
[05:06:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:07:56] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T425318', diff saved to https://phabricator.wikimedia.org/P92343 and previous config saved to /var/cache/conftool/dbconfig/20260506-050755-marostegui.json
[05:08:17] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1223 to s3 primary and set section read-write T425318', diff saved to https://phabricator.wikimedia.org/P92344 and previous config saved to /var/cache/conftool/dbconfig/20260506-050816-marostegui.json
[05:09:03] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1282280 (https://phabricator.wikimedia.org/T425318) (owner: 10Gerrit maintenance bot)
[05:09:09] <logmsgbot>	 !log marostegui@dns1004 START - running authdns-update
[05:09:49] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1189 T425318', diff saved to https://phabricator.wikimedia.org/P92345 and previous config saved to /var/cache/conftool/dbconfig/20260506-050948-marostegui.json
[05:09:52] <stashbot>	 T425318: Switchover s3 master (db1189 -> db1223) - https://phabricator.wikimedia.org/T425318
[05:11:01] <logmsgbot>	 !log marostegui@dns1004 END - running authdns-update
[05:12:29] <icinga-wm>	 PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator
[05:12:29] <wikibugs>	 (03PS1) 10Marostegui: db1189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283306 (https://phabricator.wikimedia.org/T424792)
[05:13:52] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283306 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui)
[05:14:21] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1189.eqiad.wmnet with reason: Reimage to Trixie
[05:14:26] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1189: Reimage to Trixie
[05:14:29] <icinga-wm>	 RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator
[05:14:33] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1189: Reimage to Trixie
[05:15:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[05:16:54] <wikibugs>	 (03CR) 10ArielGlenn: [C:03+1] "Let's see what the impact is." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282908 (https://phabricator.wikimedia.org/T425391) (owner: 10Daniel Kinzler)
[05:17:30] <icinga-wm>	 PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator
[05:18:37] <logmsgbot>	 marostegui@cumin1003 reimage (PID 455315) is awaiting input
[05:19:37] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1189.eqiad.wmnet with OS trixie
[05:21:12] <wikibugs>	 (03PS1) 10Marostegui: db1191,db2208: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283483 (https://phabricator.wikimedia.org/T425388)
[05:21:49] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1191.eqiad.wmnet with reason: Reimage to Trixie
[05:21:54] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1191: Reimage to Trixie
[05:22:22] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1191: Reimage to Trixie
[05:23:37] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1191.eqiad.wmnet with OS trixie
[05:24:22] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1191,db2208: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283483 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui)
[05:24:30] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2208.codfw.wmnet with reason: Reimage to Trixie
[05:24:36] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2208: Reimage to Trixie
[05:25:05] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2208: Reimage to Trixie
[05:26:39] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2208.codfw.wmnet with reason: Reimage to Trixie
[05:26:45] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2208: Reimage to Trixie
[05:26:52] <logmsgbot>	 !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) depool db2208: Reimage to Trixie
[05:33:52] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1189.eqiad.wmnet with reason: host reimage
[05:35:59] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:35:59] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2208 PXE boot change not accessible - https://phabricator.wikimedia.org/T425506#11892785 (10Marostegui) p:05Triage→03Medium I've reseted the idrac but I still cannot reimage the host as I get that error - maybe this needs something else?
[05:37:41] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage
[05:39:28] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1189.eqiad.wmnet with reason: host reimage
[05:43:30] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage
[05:47:05] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2208.codfw.wmnet with reason: Idrac issues T425506
[05:47:09] <stashbot>	 T425506: db2208 PXE boot change not accessible - https://phabricator.wikimedia.org/T425506
[05:52:23] <wikibugs>	 (03PS1) 10Ayounsi: mr1-ulsfo: remove device specific security_zones definition [homer/public] - 10https://gerrit.wikimedia.org/r/1283503 (https://phabricator.wikimedia.org/T421674)
[05:55:13] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1283507
[06:01:41] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1189.eqiad.wmnet with OS trixie
[06:06:30] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1189: after reimage to trixie
[06:06:52] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1191.eqiad.wmnet with OS trixie
[06:09:22] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1191: after reimage to trixie
[06:17:12] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1191: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283519 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui)
[06:20:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti4006.ulsfo.wmnet
[06:24:32] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] o11y/global: disable seasonality checks for small prom instances [alerts] - 10https://gerrit.wikimedia.org/r/1282935 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli)
[06:24:42] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] o11y/global: adjust formatting [alerts] - 10https://gerrit.wikimedia.org/r/1282934 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli)
[06:26:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[06:26:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283037 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse)
[06:26:38] <wikibugs>	 (03Merged) 10jenkins-bot: o11y/global: adjust formatting [alerts] - 10https://gerrit.wikimedia.org/r/1282934 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli)
[06:26:52] <jinxer-wm>	 FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[06:26:57] <wikibugs>	 (03Merged) 10jenkins-bot: o11y/global: disable seasonality checks for small prom instances [alerts] - 10https://gerrit.wikimedia.org/r/1282935 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli)
[06:27:08] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283041 (https://phabricator.wikimedia.org/T425018) (owner: 10DCausse)
[06:30:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti4006.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[06:34:04] <logmsgbot>	 jmm@cumin2002 decommission (PID 2485932) is awaiting input
[06:34:37] <jinxer-wm>	 FIRING: [6x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:40:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:41:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:45:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:45:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:47:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:48:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti4006.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[06:48:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:48:06] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti4006.ulsfo.wmnet
[06:48:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti4008.ulsfo.wmnet
[06:48:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[06:48:50] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[06:50:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:51:41] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1189: after reimage to trixie
[06:52:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:52:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:54:48] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1191: after reimage to trixie
[06:55:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[06:58:45] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[06:58:50] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[06:59:04] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T0700).
[07:00:05] <jouncebot>	 awight, WMDE-Fisch, and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:12] <dcausse>	 o/
[07:00:15] <awight>	 I can deploy the first patches
[07:00:28] <dcausse>	 sure
[07:01:28] <logmsgbot>	 jmm@cumin2002 decommission (PID 2504587) is awaiting input
[07:01:35] <wikibugs>	 (03PS1) 10Slyngshede: Hieradata: Update IPs for ULSFO CP hosts in rack 23 [puppet] - 10https://gerrit.wikimedia.org/r/1283536 (https://phabricator.wikimedia.org/T424686)
[07:04:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:04:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:06:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:06:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:07:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283033 (https://phabricator.wikimedia.org/T425433) (owner: 10Svantje Lilienthal)
[07:11:09] <WMDE-Fisch>	 \o
[07:13:02] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2208 PXE boot change not accessible - https://phabricator.wikimedia.org/T425506#11892872 (10elukey) ` Traceback (most recent call last):   File "/usr/lib/python3/dist-packages/spicerack/redfish.py", line 382, in request     return self._api_client.request(method, uri,...
[07:13:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti4008.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[07:14:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:14:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti4008.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[07:14:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:14:41] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti4008.ulsfo.wmnet
[07:14:43] <wikibugs>	 (03PS1) 10Slyngshede: Hieradata: Update IPs for ULSFO lvs hosts in rack 23 [puppet] - 10https://gerrit.wikimedia.org/r/1283545 (https://phabricator.wikimedia.org/T424686)
[07:15:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:16:20] <awight>	 dcausse: This is slow.  Would you like me to include your patches in my next SpiderPig batch?  There's ~0% chance I would need to roll back the feature patch so the focus would be on your config changes, if you agree.
[07:16:22] <wikibugs>	 (03CR) 10Tiziano Fogli: CoreRouterInterfaceDropPercent: fix ping disable (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi)
[07:17:28] <dcausse>	 awight: sure!
[07:17:37] <awight>	 :-)
[07:18:00] <dcausse>	 and I apologize in advance if something goes wrong with mine :)
[07:18:16] <wikibugs>	 (03PS1) 10Slyngshede: Hieradata: Update IPs for ULSFO dns host in rack 23 [puppet] - 10https://gerrit.wikimedia.org/r/1283548 (https://phabricator.wikimedia.org/T424686)
[07:18:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:18:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:18:34] <wikibugs>	 (03Merged) 10jenkins-bot: VE: Avoid counting all refs when listIndex is undefined [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283033 (https://phabricator.wikimedia.org/T425433) (owner: 10Svantje Lilienthal)
[07:18:44] <wikibugs>	 (03CR) 10Awight: [C:03+1] search: fix alt. completion indices to test keyword tokenizer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283037 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse)
[07:19:17] <logmsgbot>	 !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1283033|VE: Avoid counting all refs when listIndex is undefined (T425433)]]
[07:19:20] <stashbot>	 T425433: Wrong reuse message when creating a new reference - https://phabricator.wikimedia.org/T425433
[07:20:56] <wikibugs>	 (03CR) 10Awight: [C:03+1] search: enable Latin-to-Devanagari transliteration second-chance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283041 (https://phabricator.wikimedia.org/T425018) (owner: 10DCausse)
[07:20:57] <wikibugs>	 (03PS1) 10Elukey: Set pki1001 to insetup to ease decom [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664)
[07:21:17] <logmsgbot>	 !log awight@deploy1003 awight, lilients: Backport for [[gerrit:1283033|VE: Avoid counting all refs when listIndex is undefined (T425433)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:21:26] <wikibugs>	 (03PS2) 10Elukey: Set pki1001 to insetup to ease decom [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664)
[07:21:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:21:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:22:18] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey)
[07:22:39] <logmsgbot>	 !log awight@deploy1003 awight, lilients: Continuing with deployment
[07:23:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Set pki1001 to insetup to ease decom [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey)
[07:25:21] <wikibugs>	 (03CR) 10Awight: [C:03+1] VE: Avoid counting all refs when listIndex is undefined [extensions/Cite] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283101 (https://phabricator.wikimedia.org/T425433) (owner: 10WMDE-Fisch)
[07:26:05] <awight>	 dcausse: Just to confirm, is there anything to test as your patches are deployed?  I imagine that the search index takes some time to rebuild...
[07:26:45] <dcausse>	 awight: yes, only the hindi transliteration will need a bit of testing
[07:26:55] <logmsgbot>	 !log awight@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283033|VE: Avoid counting all refs when listIndex is undefined (T425433)]] (duration: 07m 37s)
[07:26:58] <stashbot>	 T425433: Wrong reuse message when creating a new reference - https://phabricator.wikimedia.org/T425433
[07:27:12] <awight>	 dcausse: okay I'll wait then, once we get to the test phase
[07:27:21] <dcausse>	 thanks!
[07:28:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/Cite] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283101 (https://phabricator.wikimedia.org/T425433) (owner: 10WMDE-Fisch)
[07:28:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283037 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse)
[07:28:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283041 (https://phabricator.wikimedia.org/T425018) (owner: 10DCausse)
[07:29:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:30:14] <wikibugs>	 (03Merged) 10jenkins-bot: search: fix alt. completion indices to test keyword tokenizer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283037 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse)
[07:30:22] <wikibugs>	 (03Merged) 10jenkins-bot: search: enable Latin-to-Devanagari transliteration second-chance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283041 (https://phabricator.wikimedia.org/T425018) (owner: 10DCausse)
[07:30:54] <wikibugs>	 (03Merged) 10jenkins-bot: VE: Avoid counting all refs when listIndex is undefined [extensions/Cite] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283101 (https://phabricator.wikimedia.org/T425433) (owner: 10WMDE-Fisch)
[07:31:24] <logmsgbot>	 !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1283101|VE: Avoid counting all refs when listIndex is undefined (T425433)]], [[gerrit:1283037|search: fix alt. completion indices to test keyword tokenizer (T420427)]], [[gerrit:1283041|search: enable Latin-to-Devanagari transliteration second-chance (T425018)]]
[07:31:31] <stashbot>	 T420427: Search shouldn't trim trailing space when suggesting suggestions - https://phabricator.wikimedia.org/T420427
[07:31:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:31:32] <stashbot>	 T425018: Enable Latin-to-Devanagari Transliteration second-try search on Hindi Wikis - https://phabricator.wikimedia.org/T425018
[07:32:35] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "LGTM, double checked with netbox" [puppet] - 10https://gerrit.wikimedia.org/r/1283536 (https://phabricator.wikimedia.org/T424686) (owner: 10Slyngshede)
[07:33:19] <logmsgbot>	 !log awight@deploy1003 wmde-fisch, awight, dcausse: Backport for [[gerrit:1283101|VE: Avoid counting all refs when listIndex is undefined (T425433)]], [[gerrit:1283037|search: fix alt. completion indices to test keyword tokenizer (T420427)]], [[gerrit:1283041|search: enable Latin-to-Devanagari transliteration second-chance (T425018)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can
[07:33:19] <logmsgbot>	 now be verified there.
[07:33:25] <stashbot>	 T425433: Wrong reuse message when creating a new reference - https://phabricator.wikimedia.org/T425433
[07:33:32] <dcausse>	 testing
[07:33:33] <wikibugs>	 (03CR) 10Tiziano Fogli: "I noticed the same alert is defined in team-data-platform/stat_host.yaml. To improve maintainability, you could leverage YAML anchors and " [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking)
[07:33:45] <awight>	 ty
[07:34:49] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1283545 (https://phabricator.wikimedia.org/T424686) (owner: 10Slyngshede)
[07:34:53] <wikibugs>	 (03PS3) 10Elukey: Set pki1001 to insetup to ease decom [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664)
[07:35:32] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "Ok for me!" [puppet] - 10https://gerrit.wikimedia.org/r/1283548 (https://phabricator.wikimedia.org/T424686) (owner: 10Slyngshede)
[07:35:59] <awight>	 Cite change lgtm
[07:36:07] <dcausse>	 awight: lgtm
[07:36:10] <awight>	 ack!
[07:36:14] <logmsgbot>	 !log awight@deploy1003 wmde-fisch, awight, dcausse: Continuing with deployment
[07:36:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:36:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:36:46] <jinxer-wm>	 FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs4009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[07:40:23] <logmsgbot>	 !log awight@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283101|VE: Avoid counting all refs when listIndex is undefined (T425433)]], [[gerrit:1283037|search: fix alt. completion indices to test keyword tokenizer (T420427)]], [[gerrit:1283041|search: enable Latin-to-Devanagari transliteration second-chance (T425018)]] (duration: 08m 58s)
[07:40:28] <stashbot>	 T425433: Wrong reuse message when creating a new reference - https://phabricator.wikimedia.org/T425433
[07:40:29] <stashbot>	 T420427: Search shouldn't trim trailing space when suggesting suggestions - https://phabricator.wikimedia.org/T420427
[07:40:29] <stashbot>	 T425018: Enable Latin-to-Devanagari Transliteration second-try search on Hindi Wikis - https://phabricator.wikimedia.org/T425018
[07:41:00] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE, 06ServiceOps new: Alert in need of triage: Kafka MirrorMaker main-codfw_to_main-eqiad dropped message count in last 30m (instance alert1002) - https://phabricator.wikimedia.org/T425339#11892950 (10JMeybohm) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282919...
[07:41:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:42:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:45:39] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Add dual stack cert support" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm)
[07:46:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:46:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:46:39] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1 C:03+2] Revert "envoyproxy: global_tlsparams" [puppet] - 10https://gerrit.wikimedia.org/r/1282347 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm)
[07:46:43] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Add STEK configuration support" [puppet] - 10https://gerrit.wikimedia.org/r/1282346 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm)
[07:46:48] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Provide support for UDS upstreams" [puppet] - 10https://gerrit.wikimedia.org/r/1282345 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm)
[07:46:53] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Support alpn_protocols configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1282344 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm)
[07:46:57] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Support TLS min/max version config" [puppet] - 10https://gerrit.wikimedia.org/r/1282343 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm)
[07:47:03] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Allow configuring TLS handshake timeout" [puppet] - 10https://gerrit.wikimedia.org/r/1282342 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm)
[07:47:07] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Allow setting http2 protocol options" [puppet] - 10https://gerrit.wikimedia.org/r/1282341 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm)
[07:47:13] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2208.codfw.wmnet with OS trixie
[07:47:13] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Allow disabling x-request-id generation" [puppet] - 10https://gerrit.wikimedia.org/r/1282340 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm)
[07:47:18] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "envoy: Allow disabling circuit breakers" [puppet] - 10https://gerrit.wikimedia.org/r/1282339 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm)
[07:47:24] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "envoy: Allow configuring delayed_closed_timeout" [puppet] - 10https://gerrit.wikimedia.org/r/1282338 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm)
[07:48:19] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2208 PXE boot change not accessible - https://phabricator.wikimedia.org/T425506#11892986 (10Marostegui) 05Open→03Resolved After a cold restart it worked - seems that it needed more time after the cold reboot.
[07:48:48] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey)
[07:49:00] <wikibugs>	 (03CR) 10Muehlenhoff: Set pki1001 to insetup to ease decom (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey)
[07:50:55] <wikibugs>	 (03CR) 10Muehlenhoff: Set pki1001 to insetup to ease decom (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey)
[07:51:46] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[07:52:53] <wikibugs>	 (03PS4) 10Elukey: Set pki1001 to insetup to ease decom [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664)
[07:53:18] <wikibugs>	 (03CR) 10Elukey: Set pki1001 to insetup to ease decom (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey)
[07:53:38] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey)
[07:55:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:55:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:57:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:58:52] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2144: Replacing HW T418979
[07:58:52] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache
[07:58:55] <stashbot>	 T418979: Productionize db225[0-3] - https://phabricator.wikimedia.org/T418979
[07:59:00] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[07:59:00] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2144: Replacing HW T418979
[08:00:09] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2144.codfw.wmnet,db1151.eqiad.wmnet with reason: Replacing hw
[08:00:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:01:35] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2253 [puppet] - 10https://gerrit.wikimedia.org/r/1283619 (https://phabricator.wikimedia.org/T418979)
[08:02:19] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2253 [puppet] - 10https://gerrit.wikimedia.org/r/1283619 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui)
[08:02:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:02:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:02:50] <wikibugs>	 (03PS1) 10Muehlenhoff: ferm: Absent the NRPE check when migrating from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283620
[08:03:41] <wikibugs>	 (03PS2) 10Muehlenhoff: ferm: Absent the NRPE check when migrating from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283620
[08:06:27] <awight>	 !log EU morning deployment is done
[08:06:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:27] <wikibugs>	 (03PS1) 10Elukey: confluent::kafka: introduce the super-user-client.properties for Kafka 3 [puppet] - 10https://gerrit.wikimedia.org/r/1283621
[08:08:30] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2208.codfw.wmnet with OS trixie
[08:09:09] <wikibugs>	 (03CR) 10Majavah: "nrpe::plugin places files into a directory that has `recurse => true, purge => true`, so having a plugin defined as `ensure => absent` and" [puppet] - 10https://gerrit.wikimedia.org/r/1283620 (owner: 10Muehlenhoff)
[08:09:13] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey)
[08:09:25] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2208.codfw.wmnet with OS trixie
[08:10:30] <icinga-wm>	 RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator
[08:14:29] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2208 Backplane 0 error - https://phabricator.wikimedia.org/T425516 (10Marostegui) 03NEW
[08:15:45] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2208 Backplane 0 error - https://phabricator.wikimedia.org/T425516#11893166 (10Marostegui) p:05Triage→03Medium
[08:16:12] <wikibugs>	 (03Abandoned) 10STran: Add exposure for experiment instrumentation [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280387 (https://phabricator.wikimedia.org/T424075) (owner: 10STran)
[08:16:20] <wikibugs>	 (03Abandoned) 10STran: Fix incorrect source in back instrumentation [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280386 (https://phabricator.wikimedia.org/T424075) (owner: 10STran)
[08:19:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: opensearch_2@.service.d.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:20:51] <wikibugs>	 (03PS1) 10Jelto: miscweb: add emptyDir to wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283627 (https://phabricator.wikimedia.org/T414405)
[08:24:05] <wikibugs>	 (03PS1) 10Marostegui: installserver: Add hosts for uefi /srv reusage [puppet] - 10https://gerrit.wikimedia.org/r/1283629
[08:25:16] <wikibugs>	 (03PS2) 10Elukey: confluent::kafka: introduce the super-user-client.properties for Kafka 3 [puppet] - 10https://gerrit.wikimedia.org/r/1283621
[08:25:57] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey)
[08:26:59] <wikibugs>	 (03CR) 10Elukey: confluent::kafka: introduce the super-user-client.properties for Kafka 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey)
[08:28:02] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] add discovery names for gitlab [dns] - 10https://gerrit.wikimedia.org/r/1282437 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn)
[08:29:02] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Add hosts for uefi /srv reusage [puppet] - 10https://gerrit.wikimedia.org/r/1283629 (owner: 10Marostegui)
[08:29:28] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2208.codfw.wmnet with OS trixie
[08:31:25] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "looks good to me, thanks! I'm wondering how we'll advertise the port migration that will have to be done by gitlab users." [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn)
[08:31:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:31:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:31:37] <zabe>	 jouncebot: nowandnext
[08:31:37] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 28 minute(s)
[08:31:37] <jouncebot>	 In 1 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1000)
[08:32:09] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Correctly support new file tables in RevisionDeleteUser [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281894 (https://phabricator.wikimedia.org/T424553) (owner: 10Zabe)
[08:32:13] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] add load balancer IPs for gitlab to geo DNS [dns] - 10https://gerrit.wikimedia.org/r/1282436 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn)
[08:32:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:35:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:36:09] <wikibugs>	 (03PS2) 10Cathal Mooney: CoreRouterInterfaceDropPercent: fix ping disable [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi)
[08:36:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:36:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:38:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey)
[08:38:30] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2253 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1283638 (https://phabricator.wikimedia.org/T418979)
[08:38:34] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1039.eqiad.wmnet with reason: Maintenance
[08:38:42] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1039 (T419961)', diff saved to https://phabricator.wikimedia.org/P92357 and previous config saved to /var/cache/conftool/dbconfig/20260506-083841-fceratto.json
[08:39:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:39:50] <wikibugs>	 (03PS3) 10Cathal Mooney: CoreRouterInterfaceDropPercent: fix ping disable [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi)
[08:39:52] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2253 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1283638 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui)
[08:40:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch pki2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283640 (https://phabricator.wikimedia.org/T416664)
[08:40:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:41:59] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283640 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff)
[08:42:21] <wikibugs>	 (03CR) 10Cathal Mooney: CoreRouterInterfaceDropPercent: fix ping disable (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi)
[08:42:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:43:29] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] kafka-main: add eqiad-main cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1282999 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine)
[08:43:38] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db2253 to ms2 T418973', diff saved to https://phabricator.wikimedia.org/P92358 and previous config saved to /var/cache/conftool/dbconfig/20260506-084337-marostegui.json
[08:43:41] <stashbot>	 T418973: Productionize pc20[21-24] and pc10[21-24] - https://phabricator.wikimedia.org/T418973
[08:43:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Correctly support new file tables in RevisionDeleteUser [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281894 (https://phabricator.wikimedia.org/T424553) (owner: 10Zabe)
[08:44:21] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Hieradata: Update IPs for ULSFO CP hosts in rack 23 [puppet] - 10https://gerrit.wikimedia.org/r/1283536 (https://phabricator.wikimedia.org/T424686) (owner: 10Slyngshede)
[08:45:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:46:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:46:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:48:07] <wikibugs>	 (03CR) 10Zabe: [C:03+2] "..." [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281894 (https://phabricator.wikimedia.org/T424553) (owner: 10Zabe)
[08:49:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:50:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:51:06] <wikibugs>	 (03PS11) 10JMeybohm: tlsproxy::envoy: Support ratelimit configuration [puppet] - 10https://gerrit.wikimedia.org/r/1228995 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert)
[08:51:23] <wikibugs>	 (03PS2) 10Tiziano Fogli: add query object [software/ecs] - 10https://gerrit.wikimedia.org/r/1280737 (https://phabricator.wikimedia.org/T423986) (owner: 10Cwhite)
[08:51:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:52:12] <wikibugs>	 (03PS1) 10Marostegui: db2253: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283644 (https://phabricator.wikimedia.org/T418979)
[08:52:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:52:46] <wikibugs>	 (03PS1) 10Majavah: P:ssl: Renew toolsbeta Prometheus certificate [puppet] - 10https://gerrit.wikimedia.org/r/1283645
[08:53:08] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2253: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283644 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui)
[08:53:16] <wikibugs>	 (03Merged) 10jenkins-bot: Correctly support new file tables in RevisionDeleteUser [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281894 (https://phabricator.wikimedia.org/T424553) (owner: 10Zabe)
[08:53:21] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1039 (T419961)', diff saved to https://phabricator.wikimedia.org/P92359 and previous config saved to /var/cache/conftool/dbconfig/20260506-085321-fceratto.json
[08:54:41] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1281894|Correctly support new file tables in RevisionDeleteUser (T424553)]]
[08:55:01] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:ssl: Renew toolsbeta Prometheus certificate [puppet] - 10https://gerrit.wikimedia.org/r/1283645 (owner: 10Majavah)
[08:55:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:55:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:56:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:56:39] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1281894|Correctly support new file tables in RevisionDeleteUser (T424553)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:57:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:59:07] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with deployment
[08:59:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:00:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:01:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:01:43] <wikibugs>	 06SRE, 10Observability-Metrics, 10Prod-Kubernetes, 06ServiceOps new, and 2 others: write some recording rules for queries used in the appserver RED k8s dashboard - https://phabricator.wikimedia.org/T249663#11893353 (10MLechvien-WMF) @hnowlan can I confirm if we are targeting to complete this this quarter?
[09:03:25] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1281894|Correctly support new file tables in RevisionDeleteUser (T424553)]] (duration: 08m 44s)
[09:03:30] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1039', diff saved to https://phabricator.wikimedia.org/P92360 and previous config saved to /var/cache/conftool/dbconfig/20260506-090329-fceratto.json
[09:04:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:04:36] <jinxer-wm>	 FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT
[09:06:34] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:07:07] <wikibugs>	 (03CR) 10JMeybohm: "Shouldn't this change remove/move `modules/confluent/files/kafka/kafka3.sh` (to the template)?" [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey)
[09:07:24] <wikibugs>	 (03PS3) 10Muehlenhoff: ferm: Absent the NRPE check when migrating from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283620
[09:07:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ferm: Absent the NRPE check when migrating from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283620 (owner: 10Muehlenhoff)
[09:10:07] <wikibugs>	 (03PS4) 10Muehlenhoff: ferm: Absent the NRPE check when migrating from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283620
[09:13:38] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1039', diff saved to https://phabricator.wikimedia.org/P92361 and previous config saved to /var/cache/conftool/dbconfig/20260506-091337-fceratto.json
[09:14:04] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2253: Replacing HW T418979
[09:14:04] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache
[09:14:04] <logmsgbot>	 !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99)
[09:14:04] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2253: Replacing HW T418979
[09:14:08] <stashbot>	 T418979: Productionize db225[0-3] - https://phabricator.wikimedia.org/T418979
[09:14:34] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[09:15:14] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool ms2 T418979ç', diff saved to https://phabricator.wikimedia.org/P92362 and previous config saved to /var/cache/conftool/dbconfig/20260506-091513-marostegui.json
[09:15:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:15:37] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup2005.codfw.wmnet with reason: update
[09:16:34] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:16:53] <logmsgbot>	 !log ayounsi@cumin1003 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[09:17:22] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp4040.ulsfo.wmnet with OS trixie
[09:17:25] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[09:23:46] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1039 (T419961)', diff saved to https://phabricator.wikimedia.org/P92363 and previous config saved to /var/cache/conftool/dbconfig/20260506-092345-fceratto.json
[09:23:53] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti4006/8 mgmt - ayounsi@cumin1003"
[09:24:06] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1040.eqiad.wmnet with reason: Maintenance
[09:24:15] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1040 (T419961)', diff saved to https://phabricator.wikimedia.org/P92364 and previous config saved to /var/cache/conftool/dbconfig/20260506-092414-fceratto.json
[09:26:58] <logmsgbot>	 ayounsi@cumin1003 netbox (PID 501175) is awaiting input
[09:27:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4006.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[09:27:44] <wikibugs>	 (03PS1) 10Marostegui: db2253: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1283664
[09:27:58] <wikibugs>	 (03PS3) 10Elukey: confluent::kafka: introduce the super-user-client.properties for Kafka 3 [puppet] - 10https://gerrit.wikimedia.org/r/1283621
[09:28:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:28:46] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2253: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1283664 (owner: 10Marostegui)
[09:29:13] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti4006/8 mgmt - ayounsi@cumin1003"
[09:29:13] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:31:01] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283620 (owner: 10Muehlenhoff)
[09:31:07] <wikibugs>	 (03CR) 10Muehlenhoff: "This wasn't about about /usr/local/lib/nagios/plugins, but the /etc/nagios/nrpe.d cfg still present after moving from nftables->ferm. For " [puppet] - 10https://gerrit.wikimedia.org/r/1283620 (owner: 10Muehlenhoff)
[09:31:47] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4006.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[09:32:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4006.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[09:32:46] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey)
[09:35:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:35:41] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:35:59] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:37:21] <wikibugs>	 (03CR) 10Elukey: "done!" [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey)
[09:38:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4006.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[09:38:41] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:40:04] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage
[09:40:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:45:03] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage
[09:46:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:47:45] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1040 (T419961)', diff saved to https://phabricator.wikimedia.org/P92365 and previous config saved to /var/cache/conftool/dbconfig/20260506-094744-fceratto.json
[09:48:49] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest gateway: defined anon-mediawiki class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390)
[09:49:00] <wikibugs>	 (03CR) 10Daniel Kinzler: rest gateway: defined anon-mediawiki class (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler)
[09:49:08] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest-gateway: add anon-app ratelimit class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282908 (https://phabricator.wikimedia.org/T425391)
[09:49:17] <kostajh>	 jouncebot: nowandnext
[09:49:17] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 10 minute(s)
[09:49:17] <jouncebot>	 In 0 hour(s) and 10 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1000)
[09:52:53] <wikibugs>	 (03CR) 10JMeybohm: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey)
[09:55:18] <wikibugs>	 (03CR) 10Elukey: lvs: expose grpc port on ml-serve staging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1282328 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski)
[09:55:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4008.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[09:57:13] <wikibugs>	 (03CR) 10Elukey: confluent::kafka: introduce the super-user-client.properties for Kafka 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey)
[09:57:53] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1040', diff saved to https://phabricator.wikimedia.org/P92366 and previous config saved to /var/cache/conftool/dbconfig/20260506-095752-fceratto.json
[09:59:51] <logmsgbot>	 jmm@cumin2002 provision (PID 2625414) is awaiting input
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1000)
[10:01:05] <wikibugs>	 (03PS6) 10Tiziano Fogli: rsyslog: forward thanos-query-frontend logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1275799 (https://phabricator.wikimedia.org/T423986)
[10:01:05] <wikibugs>	 (03PS9) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986)
[10:01:05] <wikibugs>	 (03PS1) 10Tiziano Fogli: logstash/ecs: import ecs 1.11.0-8 template file [puppet] - 10https://gerrit.wikimedia.org/r/1283683 (https://phabricator.wikimedia.org/T423986)
[10:02:45] <icinga-wm>	 PROBLEM - Druid historical on an-druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[10:03:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli)
[10:03:50] <wikibugs>	 (03PS2) 10Federico Ceratto: common, site, ferm: Remove dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/1278452 (https://phabricator.wikimedia.org/T416582)
[10:05:01] <wikibugs>	 (03CR) 10Federico Ceratto: "Updated removing the yaml file." [puppet] - 10https://gerrit.wikimedia.org/r/1278452 (https://phabricator.wikimedia.org/T416582) (owner: 10Federico Ceratto)
[10:05:58] <wikibugs>	 (03CR) 10Tiziano Fogli: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli)
[10:08:01] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1040', diff saved to https://phabricator.wikimedia.org/P92367 and previous config saved to /var/cache/conftool/dbconfig/20260506-100800-fceratto.json
[10:08:45] <icinga-wm>	 RECOVERY - Druid historical on an-druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[10:09:54] <wikibugs>	 06SRE: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528 (10elukey) 03NEW
[10:10:03] <wikibugs>	 (03PS4) 10Elukey: confluent::kafka: introduce the super-user-client.properties for Kafka 3 [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (https://phabricator.wikimedia.org/T425528)
[10:10:55] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4040.ulsfo.wmnet with OS trixie
[10:11:53] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Switch pki2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283640 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff)
[10:14:55] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:15:42] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 3 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11893589 (10ABran-WMF) Small update after another round of Pontoon testing. A few things changed while testing the patch: * Rs...
[10:16:44] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp4042.ulsfo.wmnet with OS trixie
[10:16:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[10:17:39] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp4044.ulsfo.wmnet with OS trixie
[10:18:09] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1040 (T419961)', diff saved to https://phabricator.wikimedia.org/P92368 and previous config saved to /var/cache/conftool/dbconfig/20260506-101808-fceratto.json
[10:18:29] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1048.eqiad.wmnet with reason: Maintenance
[10:18:29] <wikibugs>	 (03PS10) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986)
[10:18:37] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1048 (T419961)', diff saved to https://phabricator.wikimedia.org/P92369 and previous config saved to /var/cache/conftool/dbconfig/20260506-101836-fceratto.json
[10:18:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[10:18:50] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[10:20:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli)
[10:22:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti4006.ulsfo.wmnet']
[10:22:36] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] common, site, ferm: Remove dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/1278452 (https://phabricator.wikimedia.org/T416582) (owner: 10Federico Ceratto)
[10:22:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti4006.ulsfo.wmnet']
[10:23:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti4006.ulsfo.wmnet']
[10:23:20] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] common, site, ferm: Remove dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/1278452 (https://phabricator.wikimedia.org/T416582) (owner: 10Federico Ceratto)
[10:26:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[10:27:07] <jinxer-wm>	 FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[10:27:19] <wikibugs>	 (03PS1) 10Elukey: wikifunctions: use mesh for the evaluator endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193)
[10:29:14] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ganeti4006.ulsfo.wmnet']
[10:30:51] <icinga-wm>	 PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100%
[10:32:34] <wikibugs>	 (03CR) 10Elukey: wikifunctions: use mesh for the evaluator endpoints (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[10:33:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4006.ulsfo.wmnet with OS bookworm
[10:33:45] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[10:33:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[10:34:37] <jinxer-wm>	 FIRING: [6x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:35:53] <icinga-wm>	 RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 117.28 ms
[10:39:41] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage
[10:40:45] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage
[10:44:16] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage
[10:45:46] <wikibugs>	 (03PS1) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377)
[10:46:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko)
[10:48:29] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage
[10:53:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage
[10:54:08] <wikibugs>	 (03CR) 10Nikerabbit: translate: add opensearch-ttmserver-test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko)
[10:56:01] <wikibugs>	 (03PS2) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377)
[10:58:26] <wikibugs>	 (03PS3) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377)
[10:58:53] <wikibugs>	 (03CR) 10Atsuko: "set writable to false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko)
[10:59:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage
[11:00:04] <jouncebot>	 mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1100).
[11:09:03] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1217.eqiad.wmnet with reason: Reboot
[11:09:56] <wikibugs>	 (03PS2) 10Effie Mouzeli: site.pp: add role for rdb2011 and rdb2012 [puppet] - 10https://gerrit.wikimedia.org/r/1277429 (https://phabricator.wikimedia.org/T418261)
[11:10:36] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4042.ulsfo.wmnet with OS trixie
[11:11:23] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:11:29] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1028 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:11:29] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1029 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:11:29] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:11:29] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:11:29] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:11:29] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:11:29] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[11:12:29] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277429 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli)
[11:12:53] <marostegui>	 federico3: ^
[11:13:24] <federico3>	 looking
[11:14:36] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4044.ulsfo.wmnet with OS trixie
[11:18:55] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048 (T419961)', diff saved to https://phabricator.wikimedia.org/P92370 and previous config saved to /var/cache/conftool/dbconfig/20260506-111854-fceratto.json
[11:19:42] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp4048.ulsfo.wmnet with OS trixie
[11:20:16] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp4050.ulsfo.wmnet with OS trixie
[11:20:44] <wikibugs>	 (03CR) 10Blake: [C:03+1] site.pp: add role for rdb2011 and rdb2012 [puppet] - 10https://gerrit.wikimedia.org/r/1277429 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli)
[11:20:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[11:21:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[11:22:03] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] confluent::kafka: introduce the super-user-client.properties for Kafka 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey)
[11:22:04] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] site.pp: add role for rdb2011 and rdb2012 [puppet] - 10https://gerrit.wikimedia.org/r/1277429 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli)
[11:24:16] <logmsgbot>	 jmm@cumin2002 reimage (PID 2656424) is awaiting input
[11:25:23] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1025 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:25:29] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1028 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:25:29] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1029 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:25:29] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:25:29] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:25:29] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:25:29] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:25:29] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[11:25:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[11:29:03] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048', diff saved to https://phabricator.wikimedia.org/P92371 and previous config saved to /var/cache/conftool/dbconfig/20260506-112903-fceratto.json
[11:30:27] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host rdb2011.codfw.wmnet with OS trixie
[11:35:29] <wikibugs>	 (03CR) 10Jforrester: wikifunctions: use mesh for the evaluator endpoints (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[11:36:46] <jinxer-wm>	 FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs4009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[11:39:11] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048', diff saved to https://phabricator.wikimedia.org/P92372 and previous config saved to /var/cache/conftool/dbconfig/20260506-113910-fceratto.json
[11:41:36] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage
[11:42:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[11:42:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4006.ulsfo.wmnet with OS bookworm
[11:42:43] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4048.ulsfo.wmnet with reason: host reimage
[11:43:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T425488#11893753 (10Jclark-ctr) ps1-b2-eqiad.mgmt.eqiad.wmnet #1: Sensor: Line, AA:L3, Current Value: 12.61 A (current) Thresholds: High: 12.5 #2: Sensor: Phase, AA:L2-L3,...
[11:44:12] <wikibugs>	 (03PS1) 10Marostegui: db1194: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283730 (https://phabricator.wikimedia.org/T425388)
[11:44:46] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1194.eqiad.wmnet with reason: Reimage to Trixie
[11:44:48] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1194: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283730 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui)
[11:44:48] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2011.codfw.wmnet with reason: host reimage
[11:44:51] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1194: Reimage to Trixie
[11:45:00] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2160.codfw.wmnet with reason: Reboot
[11:45:39] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1194: Reimage to Trixie
[11:47:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet
[11:49:16] <logmsgbot>	 marostegui@cumin1003 reimage (PID 593480) is awaiting input
[11:49:19] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048 (T419961)', diff saved to https://phabricator.wikimedia.org/P92374 and previous config saved to /var/cache/conftool/dbconfig/20260506-114919-fceratto.json
[11:50:03] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage
[11:50:37] <moritzm>	 !log installing openjdk-17 security updates
[11:50:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:01] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[11:52:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T425488#11893777 (10Jclark-ctr) Rebalanced again  focusing more between AA and AB Cords.   Will monitor alerts
[11:53:04] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4048.ulsfo.wmnet with reason: host reimage
[11:56:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet
[11:57:28] <wikibugs>	 (03CR) 10Elukey: wikifunctions: use mesh for the evaluator endpoints (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[11:57:32] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb2011.codfw.wmnet with reason: host reimage
[11:58:47] <wikibugs>	 (03PS1) 10Effie Mouzeli: idp_test: switch to rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1283733 (https://phabricator.wikimedia.org/T419976)
[11:59:17] <wikibugs>	 (03PS2) 10Effie Mouzeli: (DNM) idp_test: switch to rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1283733 (https://phabricator.wikimedia.org/T419976)
[12:01:25] <wikibugs>	 (03PS1) 10Ladsgroup: Close Polish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283735 (https://phabricator.wikimedia.org/T421796)
[12:01:59] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] Remove unused 'writeapi' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński)
[12:02:51] <Amir1>	 jouncebot: nowandnext
[12:02:51] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 57 minute(s)
[12:02:51] <jouncebot>	 In 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1300)
[12:02:53] <wikibugs>	 (03PS2) 10Blake: (DNM) idp_test: switch to rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1283733 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[12:03:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11893806 (10Jclark-ctr) @elukey I am unable to connect to kafka-logging1007 looks like it might not be setting bmc to static
[12:03:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283735 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[12:05:00] <wikibugs>	 (03Merged) 10jenkins-bot: Close Polish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283735 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[12:05:15] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[12:05:22] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283735|Close Polish Wikinews (T421796)]]
[12:05:25] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[12:05:27] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 543090976 and 46 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:05:30] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host rdb2012.codfw.wmnet with OS trixie
[12:06:27] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:07:15] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283735|Close Polish Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:07:40] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment
[12:07:46] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[12:09:17] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: m3 on db1217 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3609.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:09:41] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[12:11:50] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283735|Close Polish Wikinews (T421796)]] (duration: 06m 28s)
[12:11:53] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[12:11:59] <wikibugs>	 (03CR) 10Jelto: [C:03+2] miscweb: add emptyDir to wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283627 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[12:14:31] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: add emptyDir to wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283627 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[12:14:44] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 227.34 ms
[12:14:50] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb2011.codfw.wmnet with OS trixie
[12:15:50] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[12:16:20] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[12:16:39] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4050.ulsfo.wmnet with OS trixie
[12:19:21] <jinxer-wm>	 FIRING: [6x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:19:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: opensearch_2@.service.d.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:20:20] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4048.ulsfo.wmnet with OS trixie
[12:20:52] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1194.eqiad.wmnet with OS trixie
[12:21:12] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2012.codfw.wmnet with reason: host reimage
[12:22:40] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Hieradata: Update IPs for ULSFO lvs hosts in rack 23 [puppet] - 10https://gerrit.wikimedia.org/r/1283545 (https://phabricator.wikimedia.org/T424686) (owner: 10Slyngshede)
[12:24:46] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb2012.codfw.wmnet with reason: host reimage
[12:26:31] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 790887976 and 60 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:27:23] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:04-1] rest gateway: defined anon-mediawiki class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler)
[12:28:33] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3532624 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:31:15] <wikibugs>	 (03PS3) 10Effie Mouzeli: idp_test: switch to rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1283733 (https://phabricator.wikimedia.org/T419976)
[12:33:05] <wikibugs>	 (03PS1) 10Jelto: miscweb: allow egress to text-lb for wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283743 (https://phabricator.wikimedia.org/T414405)
[12:34:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] d-i: Remove dhcpcd-base after installation completed [puppet] - 10https://gerrit.wikimedia.org/r/1280082 (https://phabricator.wikimedia.org/T414341) (owner: 10Muehlenhoff)
[12:35:05] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage
[12:36:38] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 8 hosts with reason: update
[12:37:33] <wikibugs>	 (03CR) 10Jelto: [C:03+2] miscweb: allow egress to text-lb for wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283743 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[12:38:41] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage
[12:39:06] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[12:40:02] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: allow egress to text-lb for wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283743 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[12:42:06] <wikibugs>	 (03PS1) 10Dpogorzelski: 1/3 ml-serve(grpc): etcd data for DNS Discovery [puppet] - 10https://gerrit.wikimedia.org/r/1283745 (https://phabricator.wikimedia.org/T424049)
[12:42:08] <wikibugs>	 (03PS1) 10Dpogorzelski: 2/3 ml-serve(grpc): add entry to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1283746 (https://phabricator.wikimedia.org/T424049)
[12:42:10] <wikibugs>	 (03PS1) 10Dpogorzelski: 3/3 ml-serve(grpc): add service to k8s pools [puppet] - 10https://gerrit.wikimedia.org/r/1283747 (https://phabricator.wikimedia.org/T424049)
[12:42:31] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] CoreRouterInterfaceDropPercent: fix ping disable (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi)
[12:42:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] 1/3 ml-serve(grpc): etcd data for DNS Discovery [puppet] - 10https://gerrit.wikimedia.org/r/1283745 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski)
[12:42:42] <wikibugs>	 (03Abandoned) 10Dpogorzelski: lvs: expose grpc port on ml-serve staging [puppet] - 10https://gerrit.wikimedia.org/r/1282328 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski)
[12:42:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] 2/3 ml-serve(grpc): add entry to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1283746 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski)
[12:43:18] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb2012.codfw.wmnet with OS trixie
[12:43:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] 3/3 ml-serve(grpc): add service to k8s pools [puppet] - 10https://gerrit.wikimedia.org/r/1283747 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski)
[12:43:42] <logmsgbot>	 jclark@cumin1003 provision (PID 639319) is awaiting input
[12:45:12] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[12:45:34] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: m3 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3463.97 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:45:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11893942 (10Jgreen)
[12:47:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11893955 (10Jgreen)
[12:49:59] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host lvs4008.ulsfo.wmnet with OS trixie
[12:50:09] <wikibugs>	 (03PS2) 10Dpogorzelski: ml-serve(grpc): step 1, etcd data for DNS Discovery [puppet] - 10https://gerrit.wikimedia.org/r/1283745 (https://phabricator.wikimedia.org/T424049)
[12:50:09] <wikibugs>	 (03PS2) 10Dpogorzelski: ml-serve(grpc): step 2, add entry to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1283746 (https://phabricator.wikimedia.org/T424049)
[12:50:09] <wikibugs>	 (03PS2) 10Dpogorzelski: ml-serve(grpc): step 3, add service to k8s pools [puppet] - 10https://gerrit.wikimedia.org/r/1283747 (https://phabricator.wikimedia.org/T424049)
[12:51:16] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1194: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1283748
[12:52:12] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1194: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1283748 (owner: 10Marostegui)
[12:53:34] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: m3 on db2160 is OK: OK slave_sql_lag Replication lag: 53.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:57:39] <jinxer-wm>	 FIRING: [4x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[12:57:44] <wikibugs>	 (03PS1) 10Ladsgroup: Close French Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283751 (https://phabricator.wikimedia.org/T421796)
[13:00:05] <jouncebot>	 Urbanecm and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1300).
[13:00:05] <jouncebot>	 alexsanford: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:47] <alexsanford>	 Hey! I'll go ahead with the backport for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1283028
[13:00:49] <wikibugs>	 (03PS1) 10Jelto: miscweb: bump wmf-navigator image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283753 (https://phabricator.wikimedia.org/T414405)
[13:01:16] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1194.eqiad.wmnet with OS trixie
[13:01:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by alexsanford@deploy1003 using scap backport" [extensions/WikimediaMessages] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283028 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford)
[13:03:22] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: m3 on db1217 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:04:36] <jinxer-wm>	 FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT
[13:05:29] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[13:05:51] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[13:05:59] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1194: after reimage to trixie
[13:08:28] <wikibugs>	 (03CR) 10Jelto: [C:03+2] miscweb: bump wmf-navigator image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283753 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[13:08:32] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage
[13:08:35] <wikibugs>	 (03PS1) 10Gkyziridis: changeprop: Configure all wikis for revertrisk-multilingual events. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283758 (https://phabricator.wikimedia.org/T415892)
[13:10:53] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: bump wmf-navigator image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283753 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[13:11:27] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4008.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:12:40] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host lvs4010.ulsfo.wmnet with OS trixie
[13:12:54] <wikibugs>	 (03Merged) 10jenkins-bot: Add messages related to mandatory 2FA for more groups [extensions/WikimediaMessages] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283028 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford)
[13:13:20] <logmsgbot>	 !log alexsanford@deploy1003 Started scap sync-world: Backport for [[gerrit:1283028|Add messages related to mandatory 2FA for more groups (T423119)]]
[13:13:23] <stashbot>	 T423119: FY25-26 Q4: Phase 2 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423119
[13:14:14] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[13:14:15] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage
[13:15:14] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[13:18:53] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entry for ganeti4008 mgmt - cmooney@cumin1003"
[13:18:59] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entry for ganeti4008 mgmt - cmooney@cumin1003"
[13:18:59] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:19:16] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache ganeti4008.mgmt.ulsfo.wmnet on all recursors
[13:19:24] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS trixie
[13:19:36] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) ganeti4008.mgmt.ulsfo.wmnet on all recursors
[13:20:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4008.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:21:57] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[13:24:27] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[13:24:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4008.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:25:45] <wikibugs>	 (03CR) 10Bking: "Thanks for the tip! I'll take a look and see if that could help us out." [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking)
[13:26:37] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[13:26:46] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[13:27:24] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:27:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti4008.ulsfo.wmnet']
[13:27:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti4008.ulsfo.wmnet']
[13:27:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti4008.ulsfo.wmnet']
[13:28:09] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ganeti4008.ulsfo.wmnet']
[13:28:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti4008.ulsfo.wmnet']
[13:28:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti4008.ulsfo.wmnet']
[13:30:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[13:31:07] <logmsgbot>	 !log alexsanford@deploy1003 alexsanford: Backport for [[gerrit:1283028|Add messages related to mandatory 2FA for more groups (T423119)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:31:10] <stashbot>	 T423119: FY25-26 Q4: Phase 2 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423119
[13:31:22] <wikibugs>	 (03CR) 10Bking: "Note that the new OpenSearch cluster is password-protected (unlike the current cluster hosting the ttmserver indices)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko)
[13:31:32] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage
[13:32:00] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[13:32:49] <logmsgbot>	 !log alexsanford@deploy1003 alexsanford: Continuing with deployment
[13:34:21] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:34:41] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache ganeti4008.ulsfo.wmnet on all recursors
[13:35:01] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) ganeti4008.ulsfo.wmnet on all recursors
[13:35:56] <wikibugs>	 (03PS1) 10Jgreen: Add frdata-new-eqiad.wikimedia.org and PTR for 208.80.155.12 [dns] - 10https://gerrit.wikimedia.org/r/1283769 (https://phabricator.wikimedia.org/T425539)
[13:35:59] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:36:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4008.ulsfo.wmnet with OS bookworm
[13:37:29] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest gateway: defined anon-mediawiki class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390)
[13:37:38] <wikibugs>	 (03CR) 10Daniel Kinzler: rest gateway: defined anon-mediawiki class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler)
[13:38:02] <wikibugs>	 (03Abandoned) 10Jgreen: nsca_frack_cfg.erb remove frqueue2002 and add frqueue2004 [puppet] - 10https://gerrit.wikimedia.org/r/1243894 (https://phabricator.wikimedia.org/T418393) (owner: 10Jgreen)
[13:38:31] <wikibugs>	 (03Abandoned) 10Jgreen: nsca_frack.cfg.erb create hostgroup fundraising-minio adding check-minio [puppet] - 10https://gerrit.wikimedia.org/r/1186566 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen)
[13:39:25] <wikibugs>	 (03CR) 10Jgreen: [C:03+2] Add frdata-new-eqiad.wikimedia.org and PTR for 208.80.155.12 [dns] - 10https://gerrit.wikimedia.org/r/1283769 (https://phabricator.wikimedia.org/T425539) (owner: 10Jgreen)
[13:39:31] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage
[13:41:21] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4046.ulsfo.wmnet with reason: host reimage
[13:42:39] <jinxer-wm>	 FIRING: [4x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[13:44:02] <logmsgbot>	 !log jgreen@dns1004 START - running authdns-update
[13:44:14] <logmsgbot>	 !log alexsanford@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283028|Add messages related to mandatory 2FA for more groups (T423119)]] (duration: 30m 53s)
[13:44:16] <stashbot>	 T423119: FY25-26 Q4: Phase 2 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423119
[13:45:15] <alexsanford>	 Done (backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1283028)
[13:45:25] <logmsgbot>	 !log jgreen@dns1004 END - running authdns-update
[13:45:29] <wikibugs>	 (03PS1) 10Hashar: zuul: rename web_port > finger_port and set it in conf [puppet] - 10https://gerrit.wikimedia.org/r/1283768
[13:45:29] <wikibugs>	 (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/output/1283768/6674/zuul2002.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar)
[13:45:32] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4046.ulsfo.wmnet with reason: host reimage
[13:47:45] <wikibugs>	 (03CR) 10Danielyepezgarces: [C:03+1] Enabling RSS extension for cowikimedia chapter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283048 (https://phabricator.wikimedia.org/T425440) (owner: 10Danielyepezgarces)
[13:49:18] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] Add user_groups to editAttemptStep schema [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283050 (https://phabricator.wikimedia.org/T424010) (owner: 10Kosta Harlan)
[13:50:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283751 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[13:50:55] <wikibugs>	 (03Merged) 10jenkins-bot: Close French Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283751 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[13:51:22] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283751|Close French Wikinews (T421796)]]
[13:51:23] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1194: after reimage to trixie
[13:51:25] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[13:53:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: restart to test fixes from T425301 - bking@cumin2002
[13:53:09] <stashbot>	 T425301: The cloudelastic chi cluster is red - https://phabricator.wikimedia.org/T425301
[13:53:46] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[13:55:11] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS trixie
[13:55:43] <kostajh>	 jouncebot: nowandnext
[13:55:43] <jouncebot>	 For the next 0 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1300)
[13:55:43] <jouncebot>	 In 0 hour(s) and 4 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1400)
[13:55:44] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283751|Close French Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:55:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283050 (https://phabricator.wikimedia.org/T424010) (owner: 10Kosta Harlan)
[13:56:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4008.ulsfo.wmnet with reason: host reimage
[13:56:13] <Amir1>	 kostajh: I'm closing two wikis
[13:56:23] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment
[13:56:27] <kostajh>	 Amir1: cool, can you let me know when you’re done?
[13:56:32] <Amir1>	 sure
[13:58:16] <wikibugs>	 (03PS1) 10Ladsgroup: Close German Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283783 (https://phabricator.wikimedia.org/T421796)
[13:59:21] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:00:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4008.ulsfo.wmnet with reason: host reimage
[14:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1400)
[14:01:12] <wikibugs>	 (03PS1) 10Jelto: miscweb: make sure wmf-navigator entrypoint executes sync script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283784 (https://phabricator.wikimedia.org/T414405)
[14:02:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts durum5001.eqsin.wmnet
[14:02:50] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283751|Close French Wikinews (T421796)]] (duration: 11m 28s)
[14:02:53] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[14:03:34] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2208 Backplane 0 error - https://phabricator.wikimedia.org/T425516#11894362 (10Papaul) @Jhancock.wm when you have time can you please look and see if there are any bad disks on this server? Thanks
[14:04:43] <wikibugs>	 (03CR) 10Jelto: [C:03+2] miscweb: make sure wmf-navigator entrypoint executes sync script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283784 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[14:04:47] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-04-29-001940 to 2026-05-05-223522 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283785 (https://phabricator.wikimedia.org/T414062)
[14:06:13] <wikibugs>	 (03CR) 10David Martin: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-04-29-001940 to 2026-05-05-223522 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283785 (https://phabricator.wikimedia.org/T414062) (owner: 10Jforrester)
[14:06:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283783 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[14:06:58] <icinga-wm>	 PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:06:58] <icinga-wm>	 PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:06:58] <icinga-wm>	 PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:06:58] <icinga-wm>	 PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:06:58] <icinga-wm>	 PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:06:58] <icinga-wm>	 PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:07:11] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: make sure wmf-navigator entrypoint executes sync script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283784 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[14:07:58] <wikibugs>	 (03Merged) 10jenkins-bot: Close German Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283783 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[14:08:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:08:23] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283783|Close German Wikinews (T421796)]]
[14:08:26] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[14:08:32] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-04-29-001940 to 2026-05-05-223522 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283785 (https://phabricator.wikimedia.org/T414062) (owner: 10Jforrester)
[14:08:33] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[14:09:06] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-04-21-184122 to 2026-05-05-223640 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283786 (https://phabricator.wikimedia.org/T414062)
[14:09:16] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[14:09:59] <logmsgbot>	 !log dmartin@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:10:17] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283783|Close German Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:10:38] <logmsgbot>	 !log dmartin@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:10:40] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:10:53] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment
[14:11:24] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4046.ulsfo.wmnet with OS trixie
[14:11:49] <logmsgbot>	 !log dmartin@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:12:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[14:12:45] <logmsgbot>	 !log dmartin@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:13:07] <logmsgbot>	 !log dmartin@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:13:53] <logmsgbot>	 !log dmartin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:14:49] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] idp_test: switch to rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1283733 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli)
[14:15:04] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283783|Close German Wikinews (T421796)]] (duration: 06m 40s)
[14:15:07] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[14:15:11] <Amir1>	 kostajh: over to you
[14:15:18] <wikibugs>	 (03CR) 10David Martin: [C:03+2] wikifunctions: Upgrade evaluators from 2026-04-21-184122 to 2026-05-05-223640 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283786 (https://phabricator.wikimedia.org/T414062) (owner: 10Jforrester)
[14:15:18] <kostajh>	 Amir1: thanks
[14:15:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[14:15:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:15:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum5001.eqsin.wmnet
[14:15:36] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11894493 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `durum5001.eqsin.wmnet` - durum5001.eqsin.wmnet (**PASS**)...
[14:15:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283050 (https://phabricator.wikimedia.org/T424010) (owner: 10Kosta Harlan)
[14:16:22] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] rest-gateway: generalize class overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278376 (https://phabricator.wikimedia.org/T424828) (owner: 10Daniel Kinzler)
[14:16:51] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] rest gateway: remove redundant bearerPayload case [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277703 (owner: 10Daniel Kinzler)
[14:16:54] <wikibugs>	 (03PS1) 10Joal: [WIP] Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791
[14:17:25] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-04-21-184122 to 2026-05-05-223640 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283786 (https://phabricator.wikimedia.org/T414062) (owner: 10Jforrester)
[14:17:30] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] rest gateway: defined anon-mediawiki class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler)
[14:18:32] <logmsgbot>	 !log dmartin@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:18:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (owner: 10Joal)
[14:19:08] <wikibugs>	 (03Merged) 10jenkins-bot: Add user_groups to editAttemptStep schema [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283050 (https://phabricator.wikimedia.org/T424010) (owner: 10Kosta Harlan)
[14:19:29] <logmsgbot>	 !log dmartin@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:19:35] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1283050|Add user_groups to editAttemptStep schema (T424010)]]
[14:19:38] <stashbot>	 T424010: Collect performer implicit groups in editattemptstep for hCaptcha rollout - https://phabricator.wikimedia.org/T424010
[14:19:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[14:20:13] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "Those rules are getting out of hands but let's give it a try !" [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi)
[14:20:19] <logmsgbot>	 !log dmartin@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:20:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[14:20:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[14:20:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4008.ulsfo.wmnet with OS bookworm
[14:21:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts durum5002.eqsin.wmnet
[14:21:27] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1283050|Add user_groups to editAttemptStep schema (T424010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:21:30] <logmsgbot>	 !log dmartin@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:21:54] <logmsgbot>	 !log dmartin@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:22:43] <logmsgbot>	 !log dmartin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:23:08] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage
[14:23:11] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Hieradata: Update IPs for ULSFO dns host in rack 23 [puppet] - 10https://gerrit.wikimedia.org/r/1283548 (https://phabricator.wikimedia.org/T424686) (owner: 10Slyngshede)
[14:24:29] <logmsgbot>	 jmm@cumin2002 decommission (PID 2811622) is awaiting input
[14:24:53] <wikibugs>	 (03PS1) 10Jelto: miscweb: make sure custom sidecar also has config.private environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283794 (https://phabricator.wikimedia.org/T414405)
[14:25:28] <sukhe>	 !log sudo cumin "C:bird" "disable-puppet 'merging CR 1282958'"
[14:25:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:26] <icinga-wm>	 PROBLEM - Host cp4052 is DOWN: PING CRITICAL - Packet loss = 100%
[14:26:44] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with deployment
[14:27:07] <jinxer-wm>	 FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[14:28:26] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage
[14:29:26] <wikibugs>	 (03PS3) 10Cwhite: add query object [software/ecs] - 10https://gerrit.wikimedia.org/r/1280737 (https://phabricator.wikimedia.org/T423986)
[14:30:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1400)
[14:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1430)
[14:30:49] <wikibugs>	 (03CR) 10Jelto: [C:03+2] miscweb: make sure custom sidecar also has config.private environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283794 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[14:30:51] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283050|Add user_groups to editAttemptStep schema (T424010)]] (duration: 11m 16s)
[14:30:54] <stashbot>	 T424010: Collect performer implicit groups in editattemptstep for hCaptcha rollout - https://phabricator.wikimedia.org/T424010
[14:31:01] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns7001.wikimedia.org [reason: testing bird change]
[14:31:09] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Bird: use the GUA v6 gateway instead of link-local [puppet] - 10https://gerrit.wikimedia.org/r/1282958 (owner: 10Ayounsi)
[14:31:28] <icinga-wm>	 RECOVERY - Host cp4052 is UP: PING OK - Packet loss = 0%, RTA = 71.15 ms
[14:31:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11894622 (10jijiki) >>! In T418916#11859808, @jijiki wrote: >>>! In T418916#11859537, @MLechvien-WMF wrote: >>> @Clement_Goubert i am having issues with these failing t...
[14:31:34] <wikibugs>	 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, and 2 others: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T291645#11894623 (10BTullis) a:03BTullis Assigning this to myself, s...
[14:32:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11894633 (10jijiki) >>! In T418922#11869335, @Jclark-ctr wrote: > @jhancock.wm eqiad servers failed install also.  @jijiki when you make change can...
[14:33:19] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: make sure custom sidecar also has config.private environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283794 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[14:33:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:34:21] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:34:47] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns7001.wikimedia.org [reason: testing bird change]
[14:35:37] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:36:13] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[14:36:49] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[14:37:22] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:37:36] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11894685 (10MoritzMuehlenhoff)
[14:40:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:40:40] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:41:22] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:41:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:42:54] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11894701 (10MoritzMuehlenhoff)
[14:43:14] <wikibugs>	 (03PS1) 10Jelto: misweb: also mount secrets in wmf-navigator data-sync sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283804 (https://phabricator.wikimedia.org/T414405)
[14:45:12] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Hieradata: Update IPs for ULSFO dns host in rack 23 [puppet] - 10https://gerrit.wikimedia.org/r/1283548 (https://phabricator.wikimedia.org/T424686) (owner: 10Slyngshede)
[14:45:25] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:46:21] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:46:59] <wikibugs>	 (03CR) 10Jelto: [C:03+2] misweb: also mount secrets in wmf-navigator data-sync sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283804 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[14:47:18] <logmsgbot>	 jmm@cumin2002 decommission (PID 2811622) is awaiting input
[14:48:06] <wikibugs>	 (03PS1) 10Ladsgroup: Close Chinese Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283805 (https://phabricator.wikimedia.org/T421796)
[14:49:39] <wikibugs>	 (03Merged) 10jenkins-bot: misweb: also mount secrets in wmf-navigator data-sync sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283804 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto)
[14:50:42] <Amir1>	 jouncebot: nowandnext
[14:50:42] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1400)
[14:50:42] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1430)
[14:50:43] <jouncebot>	 In 2 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1700)
[14:51:21] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:51:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[14:53:21] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:53:45] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS trixie
[14:54:30] <logmsgbot>	 jmm@cumin2002 decommission (PID 2811622) is awaiting input
[14:54:39] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[14:55:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[14:55:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:55:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum5002.eqsin.wmnet
[14:55:20] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11894788 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `durum5002.eqsin.wmnet` - durum5002.eqsin.wmnet (**PASS**)...
[14:55:53] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1121233768 and 98 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:57:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283805 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[14:57:40] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2208 Backplane 0 error - https://phabricator.wikimedia.org/T425516#11894823 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm HWC8010 this is usually caused by a firmware update that didn't take for whatever reason. F2 > iDRAC Settings > Hardware Configurati...
[14:58:07] <wikibugs>	 (03CR) 10Bking: "I took at look at what that would involve and I think I will avoid it for now since we need the alerts for a major upcoming migration ( T4" [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking)
[14:58:21] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 7 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:58:43] <wikibugs>	 (03PS2) 10Bking: data-platform: Add alerts for cirrus memory or I/O stalls [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T425301)
[14:59:02] <wikibugs>	 (03Merged) 10jenkins-bot: Close Chinese Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283805 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[14:59:25] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283805|Close Chinese Wikinews (T421796)]]
[14:59:28] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[14:59:53] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7433008 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:01:26] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283805|Close Chinese Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:01:52] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment
[15:01:56] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2157 - https://phabricator.wikimedia.org/T425242#11894855 (10Jhancock.wm) 05Open→03Resolved
[15:02:26] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[15:02:52] <logmsgbot>	 !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[15:04:00] <wikibugs>	 10SRE-SLO: Sloth dashboard performance improvement - https://phabricator.wikimedia.org/T425564 (10herron) 03NEW
[15:04:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T425488#11894874 (10phaultfinder)
[15:06:06] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283805|Close Chinese Wikinews (T421796)]] (duration: 06m 41s)
[15:06:11] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[15:08:32] <logmsgbot>	 !log jasmine@cumin2002 START - Cookbook sre.kafka.change-confluent-distro-version Change Confluent distribution for Kafka A:kafka-main-eqiad cluster: Change Confluent distribution.
[15:08:39] <sukhe>	 !log sudo cumin -b1 -s5 "C:bird and not dns4004*" "run-puppet-agent --enable 'merging CR 1282958'"
[15:08:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:10] <wikibugs>	 (03PS1) 10Cwhite: logstash: filter_on_templates - handle unknown data types [puppet] - 10https://gerrit.wikimedia.org/r/1283810
[15:09:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: opensearch_2@.service.d.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:09:28] <wikibugs>	 (03CR) 10Jasmine: [C:03+2] kafka-main: add eqiad-main cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1282999 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine)
[15:11:46] <logmsgbot>	 jasmine@cumin2002 change-confluent-distro-version (PID 2842433) is awaiting input
[15:13:39] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565 (10catherine.kelsey.wmde) 03NEW
[15:13:56] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566 (10catherine.kelsey.wmde) 03NEW
[15:14:13] <wikibugs>	 (03CR) 10Gehel: [C:04-1] data-platform: Add alerts for cirrus memory or I/O stalls (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T425301) (owner: 10Bking)
[15:14:45] <wikibugs>	 (03PS1) 10Cwhite: add unknown_data_type to normalized object [software/ecs] - 10https://gerrit.wikimedia.org/r/1283814
[15:14:55] <icinga-wm>	 PROBLEM - Check unit status of ipip-multiqueue-optimizer on lvs4008 is CRITICAL: CRITICAL: Status of the systemd unit ipip-multiqueue-optimizer https://wikitech.wikimedia.org/wiki/LVS%23IPIP_encapsulation_experiments
[15:15:29] <wikibugs>	 (03PS2) 10Cwhite: logstash: filter_on_templates - handle unknown data types [puppet] - 10https://gerrit.wikimedia.org/r/1283810
[15:18:07] <wikibugs>	 (03PS4) 10Cwhite: add query object [software/ecs] - 10https://gerrit.wikimedia.org/r/1280737 (https://phabricator.wikimedia.org/T423986)
[15:18:35] <wikibugs>	 (03PS5) 10Cwhite: add query object [software/ecs] - 10https://gerrit.wikimedia.org/r/1280737 (https://phabricator.wikimedia.org/T423986)
[15:19:22] <wikibugs>	 (03CR) 10Gehel: [C:04-1] "Minor comment on naming. I don't know enough about the metrics themselves to validate that part." [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T425301) (owner: 10Bking)
[15:25:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[15:30:45] <logmsgbot>	 !log jasmine@cumin2002 END (PASS) - Cookbook sre.kafka.change-confluent-distro-version (exit_code=0) Change Confluent distribution for Kafka A:kafka-main-eqiad cluster: Change Confluent distribution.
[15:30:47] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 225.03 ms
[15:34:07] <wikibugs>	 (03PS3) 10Bking: data-platform: Add alerts for cirrus memory or I/O stalls [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852)
[15:35:11] <wikibugs>	 (03CR) 10Bking: data-platform: Add alerts for cirrus memory or I/O stalls (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking)
[15:35:26] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host dns4004.wikimedia.org with OS bookworm
[15:36:46] <jinxer-wm>	 FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs4009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[15:37:09] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[15:38:22] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host lvs4008.ulsfo.wmnet with OS bookworm
[15:43:46] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:46:37] <icinga-wm>	 PROBLEM - Check unit status of ipip-multiqueue-optimizer on lvs4010 is CRITICAL: CRITICAL: Status of the systemd unit ipip-multiqueue-optimizer https://wikitech.wikimedia.org/wiki/LVS%23IPIP_encapsulation_experiments
[15:47:13] <wikibugs>	 (03PS1) 10Scott French: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283171
[15:50:16] <wikibugs>	 (03PS2) 10Elukey: wikifunctions: use mesh for the evaluator endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193)
[15:50:24] <wikibugs>	 (03CR) 10Elukey: wikifunctions: use mesh for the evaluator endpoints (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[15:54:21] <jinxer-wm>	 RESOLVED: [3x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:55:22] <wikibugs>	 (03PS1) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512)
[15:56:50] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey)
[15:57:25] <wikibugs>	 (03CR) 10Jforrester: wikifunctions: use mesh for the evaluator endpoints (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[15:57:27] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4004.wikimedia.org with reason: host reimage
[15:57:39] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "LGTM. Is this good to deploy?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[15:58:33] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage
[16:01:33] <wikibugs>	 (03CR) 10CDanis: profile::cache::haproxy: add webrequest-based ip reputation data (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey)
[16:01:44] <wikibugs>	 (03CR) 10Elukey: "It is yes, but we should be careful and test it in staging first, then on one DC etc.. I can help/assist when we do it! I tested the vario" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[16:03:00] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 221.97 ms
[16:04:51] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4004.wikimedia.org with reason: host reimage
[16:06:12] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] update pyyaml in dev [software/ecs] - 10https://gerrit.wikimedia.org/r/1280733 (owner: 10Cwhite)
[16:06:40] <wikibugs>	 (03Merged) 10jenkins-bot: update pyyaml in dev [software/ecs] - 10https://gerrit.wikimedia.org/r/1280733 (owner: 10Cwhite)
[16:07:36] <icinga-wm>	 PROBLEM - Recursive DNS on 198.35.26.34 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[16:07:45] <sukhe>	 yeah no worries
[16:09:03] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage
[16:09:04] <wikibugs>	 06SRE, 06collaboration-services, 10Observability-Alerting, 06SRE Observability (FY2025/2026-Q1): create a new place for prometheus/alertmanager checks not tied to physical machines - https://phabricator.wikimedia.org/T397264#11895315 (10Dzahn) please also see T361090#11804366
[16:09:21] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:12:36] <icinga-wm>	 PROBLEM - Recursive DNS on 2620:0:863:2:198:35:26:34 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[16:21:09] <wikibugs>	 (03PS4) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377)
[16:22:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko)
[16:23:46] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:25:39] <wikibugs>	 (03PS2) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512)
[16:27:00] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[16:27:18] <wikibugs>	 (03CR) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey)
[16:27:48] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey)
[16:28:16] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[16:28:44] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[16:28:47] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4008.ulsfo.wmnet with OS bookworm
[16:28:56] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[16:29:07] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[16:29:21] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:29:42] <wikibugs>	 (03CR) 10AKhatun: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282300 (https://phabricator.wikimedia.org/T425336) (owner: 10JavierMonton)
[16:29:48] <wikibugs>	 (03PS3) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512)
[16:30:00] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey)
[16:30:25] <wikibugs>	 (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282300 (https://phabricator.wikimedia.org/T425336) (owner: 10JavierMonton)
[16:30:34] <icinga-wm>	 RECOVERY - Recursive DNS on 2620:0:863:2:198:35:26:34 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[16:30:34] <icinga-wm>	 RECOVERY - Recursive DNS on 198.35.26.34 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[16:31:13] <wikibugs>	 (03CR) 10Dzahn: "it would break puppet on the main nodes (fine on executor nodes) because the base class is applied on both. don't worry though, I will ame" [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar)
[16:32:29] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282300 (https://phabricator.wikimedia.org/T425336) (owner: 10JavierMonton)
[16:33:46] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:34:21] <jinxer-wm>	 RESOLVED: [3x] JobUnavailable: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:37:05] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns4004.wikimedia.org with OS bookworm
[16:37:16] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] shellbox: Pick up newly rebuilt images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283171 (owner: 10Scott French)
[16:39:28] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] shellbox: Pick up newly rebuilt images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283171 (owner: 10Scott French)
[16:39:36] <wikibugs>	 06SRE, 10SRE-Access-Requests: logstash access - https://phabricator.wikimedia.org/T425575 (10KineticPelagic) 03NEW
[16:39:55] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[16:39:59] <wikibugs>	 (03PS5) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377)
[16:40:08] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4008.ulsfo.wmnet with OS trixie
[16:40:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko)
[16:45:04] <wikibugs>	 (03PS1) 10Andrew Bogott: magnum.conf: set nova-api compatibility to v2.15 [puppet] - 10https://gerrit.wikimedia.org/r/1283834 (https://phabricator.wikimedia.org/T393782)
[16:45:15] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[16:46:48] <wikibugs>	 (03PS1) 10Elukey: profile::kafka::mirror::alerts: fix max lag's group label [puppet] - 10https://gerrit.wikimedia.org/r/1283837
[16:46:53] <wikibugs>	 (03CR) 10Cathal Mooney: CoreRouterInterfaceDropPercent: fix ping disable (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi)
[16:46:56] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] CoreRouterInterfaceDropPercent: fix ping disable [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi)
[16:48:07] <wikibugs>	 (03PS6) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377)
[16:48:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] "indeed yeah it's quite cumbersome just to make the linter happy." [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi)
[16:48:35] <wikibugs>	 (03Merged) 10jenkins-bot: CoreRouterInterfaceDropPercent: fix ping disable [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi)
[16:49:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko)
[16:52:37] <topranks>	 !log rebooting asw1-22-ulsfo to upgrade SR-Linux OS on switch T408892
[16:52:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:40] <stashbot>	 T408892: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892
[16:53:03] <wikibugs>	 06SRE, 10SRE-Access-Requests: logstash access - https://phabricator.wikimedia.org/T425575#11895517 (10Aklapper) Hi @KineticPelagic, https://wikitech.wikimedia.org/wiki/Logstash#Authentication implies that you could request access to Logstash through the IDM tool. If that works for you, then please feel free to...
[16:53:22] <logmsgbot>	 !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw1-22-ulsfo,asw1-22-ulsfo IPv6 with reason: upgrading sr-linux on asw1-23-ulsfo
[16:58:20] <wikibugs>	 (03CR) 10JMeybohm: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1283837 (owner: 10Elukey)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1700)
[17:00:06] <icinga-wm>	 PROBLEM - Host durum4004 is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:10] <icinga-wm>	 PROBLEM - Host doh4003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:10] <icinga-wm>	 PROBLEM - Host doh4004 is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:13] <swfrench-wmf>	 o/
[17:00:14] <icinga-wm>	 PROBLEM - Host bast4006 is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:16] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy4004 is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:16] <icinga-wm>	 PROBLEM - Host hcaptcha-proxy4003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:35] <sukhe>	 "expected"
[17:00:40] <icinga-wm>	 PROBLEM - Host ncredir4004 is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:40] <icinga-wm>	 PROBLEM - Host install4004 is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:46] <icinga-wm>	 PROBLEM - Host durum4003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:46] <icinga-wm>	 PROBLEM - Host netflow4003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:46] <icinga-wm>	 PROBLEM - Host prometheus4003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:48] <icinga-wm>	 PROBLEM - Host tcp-proxy4004 is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:48] <icinga-wm>	 PROBLEM - Host tcp-proxy4003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:50] <icinga-wm>	 PROBLEM - Host ncredir4003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:00:53] <sukhe>	 we should probably extend the downtime
[17:01:14] <icinga-wm>	 RECOVERY - Host doh4003 is UP: PING OK - Packet loss = 0%, RTA = 71.57 ms
[17:01:14] <icinga-wm>	 RECOVERY - Host durum4003 is UP: PING OK - Packet loss = 0%, RTA = 71.53 ms
[17:01:14] <icinga-wm>	 RECOVERY - Host install4004 is UP: PING OK - Packet loss = 0%, RTA = 71.54 ms
[17:01:14] <icinga-wm>	 RECOVERY - Host prometheus4003 is UP: PING OK - Packet loss = 0%, RTA = 74.04 ms
[17:01:16] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy4004 is UP: PING OK - Packet loss = 0%, RTA = 71.55 ms
[17:01:16] <icinga-wm>	 RECOVERY - Host tcp-proxy4003 is UP: PING OK - Packet loss = 0%, RTA = 71.49 ms
[17:01:16] <icinga-wm>	 RECOVERY - Host tcp-proxy4004 is UP: PING OK - Packet loss = 0%, RTA = 71.35 ms
[17:01:24] <icinga-wm>	 RECOVERY - Host doh4004 is UP: PING WARNING - Packet loss = 90%, RTA = 73.88 ms
[17:01:25] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox: Pick up newly rebuilt images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283171 (owner: 10Scott French)
[17:01:26] <icinga-wm>	 RECOVERY - Host ncredir4004 is UP: PING OK - Packet loss = 0%, RTA = 75.95 ms
[17:01:26] <icinga-wm>	 RECOVERY - Host ncredir4003 is UP: PING OK - Packet loss = 0%, RTA = 71.26 ms
[17:01:34] <icinga-wm>	 RECOVERY - Host durum4004 is UP: PING OK - Packet loss = 0%, RTA = 71.72 ms
[17:01:42] <icinga-wm>	 RECOVERY - Host bast4006 is UP: PING OK - Packet loss = 0%, RTA = 71.52 ms
[17:01:48] <icinga-wm>	 RECOVERY - Host hcaptcha-proxy4003 is UP: PING OK - Packet loss = 0%, RTA = 71.56 ms
[17:02:12] <jinxer-wm>	 RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs4009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[17:02:14] <icinga-wm>	 RECOVERY - Host netflow4003 is UP: PING OK - Packet loss = 0%, RTA = 71.47 ms
[17:02:15] <logmsgbot>	 !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on 39 hosts with reason: ulsfo depooled for switch work
[17:02:28] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11895570 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a4b7dc3f-da06-4cb4-8580-9dac41f4da23) set by sukhe@cumin1003 for 3 days, 0:00:00...
[17:03:49] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283171 (owner: 10Scott French)
[17:04:25] <swfrench-wmf>	 FYI, I'll be deploying shellbox services shortly
[17:04:36] <jinxer-wm>	 FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT
[17:05:44] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:06:25] <wikibugs>	 (03PS7) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377)
[17:06:46] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply
[17:06:52] <jinxer-wm>	 RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[17:07:04] <wikibugs>	 (03CR) 10Atsuko: "added credentials to private/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko)
[17:07:12] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[17:07:13] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[17:07:23] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[17:07:25] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[17:07:37] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[17:07:38] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[17:07:54] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[17:07:55] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply
[17:08:14] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply
[17:08:15] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[17:08:37] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[17:08:51] <jinxer-wm>	 FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[17:10:28] <wikibugs>	 10ops-esams, 06SRE, 06Commons, 06DC-Ops, and 3 others: ESAMS serving an older revision of some overwritten files - https://phabricator.wikimedia.org/T425216#11895597 (10ssingh) >>! In T425216#11887650, @TheDJ wrote: > I believe there is a 24 hourly script that checks cross dc consistency or something  The...
[17:10:44] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:14:14] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply
[17:14:51] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply
[17:15:22] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[17:15:54] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[17:16:26] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[17:16:40] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[17:17:11] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[17:17:30] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[17:18:01] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply
[17:18:25] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply
[17:18:57] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[17:20:01] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[17:27:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[17:27:45] <logmsgbot>	 !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw1-23-ulsfo,asw1-23-ulsfo IPv6 with reason: upgrading sr-linux on asw1-23-ulsfo
[17:27:55] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11895651 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cc7686ab-d152-4291-9303-296008017c88) set by cmooney@cumin1003 for 1:00:00 on 2...
[17:28:11] <topranks>	 !log rebooting asw1-23-ulsfo to upgrade SR-Linux OS on switch T408892
[17:28:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:14] <stashbot>	 T408892: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892
[17:31:35] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[17:31:46] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:32:23] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[17:32:54] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[17:33:35] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[17:33:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr4-ulsfo:et-0/0/2 (Core: asw1-23-ulsfo:ethernet-1/56 {#G24090478750000318}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:34:07] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[17:34:22] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[17:34:53] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[17:35:11] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[17:35:42] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply
[17:35:44] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:36:06] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply
[17:36:37] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[17:36:46] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:37:40] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[17:38:13] <wikibugs>	 (03PS2) 10Dzahn: zuul: rename web_port > finger_port and set it in conf [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar)
[17:39:21] <jinxer-wm>	 FIRING: [16x] JobUnavailable: Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:41:13] <wikibugs>	 (03PS3) 10Dzahn: zuul: rename web_port > finger_port and set it in conf [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar)
[17:42:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[17:45:36] <wikibugs>	 (03PS4) 10Dzahn: zuul: rename web_port > finger_port and set it in conf [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar)
[17:48:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] magnum.conf: set nova-api compatibility to v2.15 [puppet] - 10https://gerrit.wikimedia.org/r/1283834 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott)
[17:49:52] <icinga-wm>	 RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 746 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:49:54] <icinga-wm>	 RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:49:54] <icinga-wm>	 RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:49:54] <icinga-wm>	 RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 746 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:49:54] <icinga-wm>	 RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:49:54] <icinga-wm>	 RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 750 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:54:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns
[17:55:29] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-23-ulsfo
[17:55:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[17:55:51] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device asw1-23-ulsfo
[17:59:10] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: restart to test fixes from T425301 - bking@cumin2002
[17:59:13] <stashbot>	 T425301: The cloudelastic chi cluster is red - https://phabricator.wikimedia.org/T425301
[17:59:24] <Amir1>	 jouncebot: nowandnext
[17:59:24] <jouncebot>	 For the next 0 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1700)
[17:59:25] <jouncebot>	 In 0 hour(s) and 0 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1800)
[17:59:44] <Amir1>	 I wait then :D
[18:00:05] <jouncebot>	 brennen and jeena: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1800). nyaa~
[18:00:08] <brennen>	 o/
[18:01:31] <brennen>	 currently blocking on T425582 / T425475
[18:01:32] <stashbot>	 T425582: DB  schema change in production - for ce_event_contributions - https://phabricator.wikimedia.org/T425582
[18:01:32] <stashbot>	 T425475: EventContribution: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cec.cec_references_delta' in 'SELECT' - https://phabricator.wikimedia.org/T425475
[18:01:52] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-23-ulsfo
[18:02:15] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-23-ulsfo
[18:05:27] <wikibugs>	 (03CR) 10VolkerE: [C:03+1] microsites: adjust monitoring string for design.wikimedia.org, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/1283213 (https://phabricator.wikimedia.org/T329991) (owner: 10Dzahn)
[18:07:10] <wikibugs>	 (03PS1) 10CDanis: Revert "haproxy: webrequest: capture ratelimiting headers" [puppet] - 10https://gerrit.wikimedia.org/r/1283858
[18:08:19] <wikibugs>	 (03PS5) 10Dzahn: zuul: rename web_port > finger_port and set it in conf [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar)
[18:12:30] <wikibugs>	 (03PS1) 10CDanis: benthos/webrequest: fix rl_class/trusted_req names [puppet] - 10https://gerrit.wikimedia.org/r/1283862 (https://phabricator.wikimedia.org/T419736)
[18:13:13] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283862 (https://phabricator.wikimedia.org/T419736) (owner: 10CDanis)
[18:13:46] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:13:51] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr4-ulsfo:et-0/0/2 (Core: asw1-23-ulsfo:ethernet-1/56 {#G24090478750000318}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[18:14:21] <jinxer-wm>	 RESOLVED: [16x] JobUnavailable: Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:15:44] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[18:19:10] <wikibugs>	 (03PS6) 10Dzahn: zuul: rename web_port > finger_port and set it in conf [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar)
[18:24:05] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "works now: https://puppet-compiler.wmflabs.org/output/1283768/8524/" [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar)
[18:24:23] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-05-05-223522 to 2026-05-06-154732 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283868
[18:24:37] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul: rename web_port > finger_port and set it in conf [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar)
[18:26:03] <wikibugs>	 (03PS1) 10Dduvall: zuul: Disallow job definitions in untrusted projects [puppet] - 10https://gerrit.wikimedia.org/r/1283870
[18:27:10] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1283858 (owner: 10CDanis)
[18:27:44] <wikibugs>	 (03PS2) 10CDanis: Revert "haproxy: webrequest: capture ratelimiting headers" [puppet] - 10https://gerrit.wikimedia.org/r/1283858
[18:28:22] <wikibugs>	 (03PS3) 10CDanis: Revert "haproxy: webrequest: capture ratelimiting headers" [puppet] - 10https://gerrit.wikimedia.org/r/1283858
[18:28:57] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar)
[18:29:22] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-05-05-223522 to 2026-05-06-154732 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283868 (owner: 10Jforrester)
[18:29:44] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host lvs4010.ulsfo.wmnet with OS bookworm
[18:30:44] <wikibugs>	 (03PS1) 10Ladsgroup: Close English Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283872 (https://phabricator.wikimedia.org/T421796)
[18:31:42] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-05-05-223522 to 2026-05-06-154732 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283868 (owner: 10Jforrester)
[18:31:54] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host cp4050.ulsfo.wmnet with OS trixie
[18:32:58] <brennen>	 !log 1.47.0-wmf.1 train status (T423910): blockers resolved, rolling to group1
[18:33:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:01] <stashbot>	 T423910: 1.47.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T423910
[18:33:30] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul: Disallow job definitions in untrusted projects [puppet] - 10https://gerrit.wikimedia.org/r/1283870 (owner: 10Dduvall)
[18:34:22] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.47.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283873 (https://phabricator.wikimedia.org/T423910)
[18:34:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283873 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot)
[18:35:19] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] add discovery names for gitlab [dns] - 10https://gerrit.wikimedia.org/r/1282437 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn)
[18:35:45] <logmsgbot>	 !log dzahn@dns1005 START - running authdns-update
[18:36:38] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.47.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283873 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot)
[18:37:16] <logmsgbot>	 !log dzahn@dns1005 END - running authdns-update
[18:38:46] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:39:33] <wikibugs>	 (03CR) 10Snwachukwu: "Sure, I'm available to deploy once this is merged." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281589 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans)
[18:39:55] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[18:40:16] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[18:40:34] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[18:40:55] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:41:15] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[18:42:14] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[18:42:48] <logmsgbot>	 !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.47.0-wmf.1  refs T423910
[18:42:52] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[18:42:53] <stashbot>	 T423910: 1.47.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T423910
[18:44:11] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: use mesh for the evaluator endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[18:44:56] <brennen>	 Amir1: maybe give it 5 minutes for the train to settle at group1 and if i don't yell by then feel free to use the rest of the window.
[18:45:06] <Amir1>	 ooooh, nice
[18:45:07] <Amir1>	 Thanks!
[18:45:27] <brennen>	 sure thing
[18:46:20] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: use mesh for the evaluator endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[18:46:32] <wikibugs>	 (03PS1) 10DDesouza: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283875 (https://phabricator.wikimedia.org/T329991)
[18:47:33] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[18:47:50] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[18:48:50] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage
[18:49:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS trixie
[18:49:12] <wikibugs>	 (03PS1) 10Jforrester: Revert "wikifunctions: use mesh for the evaluator endpoints" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283877
[18:49:17] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] Revert "wikifunctions: use mesh for the evaluator endpoints" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283877 (owner: 10Jforrester)
[18:50:40] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] Wikifunctions: Turn on import of references inside Wikidata statements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275996 (https://phabricator.wikimedia.org/T404652) (owner: 10David Martin)
[18:50:48] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283875 (https://phabricator.wikimedia.org/T329991) (owner: 10DDesouza)
[18:51:25] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "wikifunctions: use mesh for the evaluator endpoints" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283877 (owner: 10Jforrester)
[18:52:56] <wikibugs>	 (03Merged) 10jenkins-bot: Wikifunctions: Turn on import of references inside Wikidata statements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275996 (https://phabricator.wikimedia.org/T404652) (owner: 10David Martin)
[18:53:13] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283875 (https://phabricator.wikimedia.org/T329991) (owner: 10DDesouza)
[18:53:47] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[18:53:58] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[18:54:14] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[18:54:47] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage
[18:54:52] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage
[18:55:08] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[18:55:22] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[18:55:54] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[18:58:55] <wikibugs>	 (03CR) 10Eevans: [V:03+2 C:03+2] "@snwachukwu@wikimedia.org awesome, whenever you are ready." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281589 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans)
[18:59:05] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage
[18:59:09] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[18:59:22] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[18:59:23] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[18:59:35] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[18:59:36] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[18:59:48] <wikibugs>	 (03CR) 10Eevans: [C:03+2] revise-tone-task-generator: updated list of aqs cassandra nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281588 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans)
[18:59:51] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[19:01:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage
[19:01:39] <wikibugs>	 (03CR) 10Eevans: [C:03+2] "This has now been merged.  There is no rush to deploy, but the longer that you wait, the more likely it is that someone will be surprised " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281588 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans)
[19:05:29] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage
[19:09:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: opensearch_2@.service.d.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:10:34] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] "This seems reasonable to me.  I checked the client library we use and for the auth credentials it expects the keys `auth_type`, `username`" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko)
[19:14:52] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4010.ulsfo.wmnet with OS bookworm
[19:23:24] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4050.ulsfo.wmnet with OS trixie
[19:24:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1012.eqiad.wmnet with OS trixie
[19:28:45] <wikibugs>	 (03PS1) 10Bking: cloudelastic1012: remove host-specific overrides [puppet] - 10https://gerrit.wikimedia.org/r/1283903 (https://phabricator.wikimedia.org/T425300)
[19:29:42] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283903 (https://phabricator.wikimedia.org/T425300) (owner: 10Bking)
[19:30:56] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4010.ulsfo.wmnet with OS trixie
[19:44:03] <wikibugs>	 06SRE, 10DNS, 10Domains, 06Traffic-Icebox, 07HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071#11896068 (10Novem_Linguae) Should this be closed in favor of {T205378}? Seems like pretty soon browsers and servers will support technology that co...
[19:45:13] <wikibugs>	 (03CR) 10Bking: [C:03+2] cloudelastic1012: remove host-specific overrides [puppet] - 10https://gerrit.wikimedia.org/r/1283903 (https://phabricator.wikimedia.org/T425300) (owner: 10Bking)
[19:46:09] <wikibugs>	 06SRE, 06Traffic, 06Traffic-Icebox, 07HTTPS, 07Upstream: Support Encrypted Client Hello (ECH) on Wikimedia servers - https://phabricator.wikimedia.org/T205378#11896070 (10Novem_Linguae)
[19:56:07] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1283862 (https://phabricator.wikimedia.org/T419736) (owner: 10CDanis)
[19:57:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS trixie
[19:58:02] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T2000). Please do the needful.
[20:00:05] <jouncebot>	 SomeRandomDev: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:07] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1012.eqiad.wmnet with OS trixie
[20:00:10] <SomeRandomDev>	 Hi
[20:03:25] <wikibugs>	 (03CR) 10CDanis: [C:03+2] benthos/webrequest: fix rl_class/trusted_req names [puppet] - 10https://gerrit.wikimedia.org/r/1283862 (https://phabricator.wikimedia.org/T419736) (owner: 10CDanis)
[20:10:51] <SomeRandomDev>	 anybody here for the deployment window?
[20:12:03] <A_smart_kitten>	 RoanKattouw urbanecm TheresNoTime kindrobot cjming (friendly reping ^^)
[20:12:39] <RoanKattouw>	 Sorry about that, I can deploy 
[20:12:44] <SomeRandomDev>	 thanks!
[20:13:09] * TheresNoTime is also about if needed
[20:13:37] <SomeRandomDev>	 it's https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1281526; not 100% sure how I would test it, but it shouldn't change anything in theory
[20:14:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS trixie
[20:14:23] <wikibugs>	 10ops-eqiad, 06DC-Ops: Unresponsive management for an-presto1006.mgmt:22 - https://phabricator.wikimedia.org/T425590 (10phaultfinder) 03NEW
[20:16:20] <RoanKattouw>	 I see, I can test that by looking at the Logstash feeds this code logs to
[20:16:39] <SomeRandomDev>	 ah, great
[20:18:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281526 (https://phabricator.wikimedia.org/T336703) (owner: 10SomeRandomDeveloper)
[20:19:06] <wikibugs>	 (03Merged) 10jenkins-bot: Replace use of $wgRequest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281526 (https://phabricator.wikimedia.org/T336703) (owner: 10SomeRandomDeveloper)
[20:19:31] <logmsgbot>	 !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1281526|Replace use of $wgRequest (T336703)]]
[20:19:34] <stashbot>	 T336703: Replace use of $wgRequest in wmf-config (CommonSettings.php / throttle-analyze.php) - https://phabricator.wikimedia.org/T336703
[20:21:31] <logmsgbot>	 !log catrope@deploy1003 catrope, somerandomdeveloper: Backport for [[gerrit:1281526|Replace use of $wgRequest (T336703)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:21:50] <wikibugs>	 (03CR) 10ArielGlenn: [C:03+1] "Basically good, one question inline, one tiny nit also" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278376 (https://phabricator.wikimedia.org/T424828) (owner: 10Daniel Kinzler)
[20:24:29] <logmsgbot>	 !log catrope@deploy1003 catrope, somerandomdeveloper: Continuing with deployment
[20:25:54] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage
[20:28:43] <logmsgbot>	 !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1281526|Replace use of $wgRequest (T336703)]] (duration: 09m 12s)
[20:28:46] <stashbot>	 T336703: Replace use of $wgRequest in wmf-config (CommonSettings.php / throttle-analyze.php) - https://phabricator.wikimedia.org/T336703
[20:29:25] <SomeRandomDev>	 Thank you!
[20:29:26] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:29:49] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage
[20:32:00] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11896201 (10VRiley-WMF)
[20:33:37] <wikibugs>	 (03CR) 10ArielGlenn: rest gateway: defined anon-mediawiki class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler)
[20:33:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/1/5 (Transport: cr2-codfw:et-0/1/4 (Lumen, 449169461) {#changeme_lumen_patch}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[20:38:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[20:44:03] <wikibugs>	 06SRE, 10dev-images, 06Infrastructure-Foundations, 06Release-Engineering-Team (Priority Backlog 📥): Rebuild dev-images using a base image without mirrors.wikimedia.org in the apt sources - https://phabricator.wikimedia.org/T423972#11896214 (10brennen) a:03brennen
[20:58:02] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T2100)
[21:04:01] <wikibugs>	 (03PS1) 10Zabe: Disable GNSM on dewikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283953 (https://phabricator.wikimedia.org/T421798)
[21:04:36] <jinxer-wm>	 FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT
[21:04:51] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:05:54] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host pc1021
[21:06:55] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc1021
[21:07:42] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[21:09:51] <jinxer-wm>	 FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[21:11:47] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [pc1021] - vriley@cumin1003"
[21:11:53] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  [pc1021] - vriley@cumin1003"
[21:11:53] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:12:57] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:14:16] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:15:11] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:17:17] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:22:17] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:24:10] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:26:59] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:27:58] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1012.eqiad.wmnet with OS trixie
[21:28:26] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:29:32] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:31:36] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:43:31] <zabe>	 jouncebot: nowandnext
[21:43:31] <jouncebot>	 For the next 0 hour(s) and 16 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T2100)
[21:43:31] <jouncebot>	 In 0 hour(s) and 16 minute(s): Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T2200)
[21:43:47] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Disable GNSM on dewikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283953 (https://phabricator.wikimedia.org/T421798) (owner: 10Zabe)
[21:44:07] <wikibugs>	 (03PS4) 10Ryan Kemper: data-platform: Add alerts for cirrus memory or I/O stalls [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking)
[21:44:43] <wikibugs>	 (03Merged) 10jenkins-bot: Disable GNSM on dewikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283953 (https://phabricator.wikimedia.org/T421798) (owner: 10Zabe)
[21:44:48] <wikibugs>	 06SRE, 10SRE-Access-Requests: logstash access - https://phabricator.wikimedia.org/T425575#11896348 (10KineticPelagic) 05Open→03Invalid Made request through IDM.  Thank you, @Aklapper .
[21:45:38] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1283953|Disable GNSM on dewikinews (T421798)]]
[21:45:41] <stashbot>	 T421798: Undeploy GoogleNewsSitemap after 2026-05-04 - https://phabricator.wikimedia.org/T421798
[21:47:34] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1283953|Disable GNSM on dewikinews (T421798)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:48:23] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with deployment
[21:51:20] <wikibugs>	 (03PS5) 10Ryan Kemper: data-platform: Add alerts for cirrus memory/IO stalls [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking)
[21:51:49] <wikibugs>	 (03CR) 10Bking: [C:03+2] data-platform: Add alerts for cirrus memory/IO stalls [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking)
[21:52:00] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] "We should circle back to add runbooks but I'm fine getting this shipped first" [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking)
[21:52:34] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283953|Disable GNSM on dewikinews (T421798)]] (duration: 06m 56s)
[21:52:38] <stashbot>	 T421798: Undeploy GoogleNewsSitemap after 2026-05-04 - https://phabricator.wikimedia.org/T421798
[21:54:03] <wikibugs>	 (03PS1) 10Cathal Mooney: Rancid: fix srlinux.pm parsing of Nokia SR-Linux configs [puppet] - 10https://gerrit.wikimedia.org/r/1283969
[21:54:12] <wikibugs>	 (03Merged) 10jenkins-bot: data-platform: Add alerts for cirrus memory/IO stalls [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking)
[21:54:22] <wikibugs>	 (03PS4) 10Zabe: Undeploy GoogleNewsSitemap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277783 (https://phabricator.wikimedia.org/T421798)
[21:57:23] <wikibugs>	 (03PS2) 10Cathal Mooney: Rancid: fix srlinux.pm parsing of Nokia SR-Linux configs [puppet] - 10https://gerrit.wikimedia.org/r/1283969
[21:58:10] <wikibugs>	 (03PS1) 10Clare Ming: UBN fix: guard entry.serverTiming before forEach [extensions/TestKitchen] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283972 (https://phabricator.wikimedia.org/T425591)
[21:58:55] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Rancid: fix srlinux.pm parsing of Nokia SR-Linux configs [puppet] - 10https://gerrit.wikimedia.org/r/1283969 (owner: 10Cathal Mooney)
[22:00:00] <cjming>	 jouncebot: nowandnext
[22:00:00] <jouncebot>	 For the next 0 hour(s) and 59 minute(s): Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T2200)
[22:00:01] <jouncebot>	 In 7 hour(s) and 59 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T0600)
[22:00:01] <jouncebot>	 In 7 hour(s) and 59 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T0600)
[22:00:05] <jouncebot>	 Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T2200)
[22:00:42] <cjming>	 ok if i backport a UBN fix now-ish? 
[22:00:44] <cjming>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TestKitchen/+/1283972
[22:04:02] <cjming>	 it's the Readers deployment window now -- if there's no activity in the next 2-3 minutes, I will proceed with above patch unless someone else me otherwise
[22:04:34] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host pc1021
[22:04:38] <cjming>	 *someone else tells me otherwise
[22:05:35] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc1021
[22:06:04] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.dns.netbox
[22:06:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/TestKitchen] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283972 (https://phabricator.wikimedia.org/T425591) (owner: 10Clare Ming)
[22:07:46] <wikibugs>	 (03Merged) 10jenkins-bot: UBN fix: guard entry.serverTiming before forEach [extensions/TestKitchen] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283972 (https://phabricator.wikimedia.org/T425591) (owner: 10Clare Ming)
[22:08:06] <wikibugs>	 (03CR) 10ArielGlenn: "couple small things noted, really glad to see this patch land,the makefiles in nonstandard locations was making my teeth itch a bit :-)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler)
[22:08:11] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1283972|UBN fix: guard entry.serverTiming before forEach (T425591)]]
[22:08:14] <stashbot>	 T425591: TypeError: undefined is not an object (evaluating 'entry.serverTiming.forEach') - https://phabricator.wikimedia.org/T425591
[22:08:51] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:09:24] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:10:03] <logmsgbot>	 !log cjming@deploy1003 cjming: Backport for [[gerrit:1283972|UBN fix: guard entry.serverTiming before forEach (T425591)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:10:27] <logmsgbot>	 !log cjming@deploy1003 cjming: Continuing with deployment
[22:11:25] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:11:49] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:14:36] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283972|UBN fix: guard entry.serverTiming before forEach (T425591)]] (duration: 06m 25s)
[22:14:39] <stashbot>	 T425591: TypeError: undefined is not an object (evaluating 'entry.serverTiming.forEach') - https://phabricator.wikimedia.org/T425591
[22:14:40] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:14:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[22:14:58] <wikibugs>	 10ops-esams, 06SRE, 06Commons, 06DC-Ops, and 3 others: ESAMS serving an older revision of some overwritten files - https://phabricator.wikimedia.org/T425216#11896466 (10AlexisJazz) >>! In T425216#11895597, @ssingh wrote: >>>! In T425216#11887650, @TheDJ wrote: >> I believe there is a 24 hourly script that...
[22:15:01] <wikibugs>	 (03PS2) 10Hashar: zuul: use upstream "build node" semantic [puppet] - 10https://gerrit.wikimedia.org/r/1283968
[22:15:01] <wikibugs>	 (03CR) 10Hashar: "Hosts that have no differences (main and executor):" [puppet] - 10https://gerrit.wikimedia.org/r/1283968 (owner: 10Hashar)
[22:15:59] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:18:29] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host pc1021.eqiad.wmnet with OS trixie
[22:18:43] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11896468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host pc1021.eqiad.wmnet with OS trixie
[22:22:45] <Amir1>	 jouncebot: nowandnext
[22:22:45] <jouncebot>	 For the next 0 hour(s) and 37 minute(s): Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T2200)
[22:22:46] <jouncebot>	 In 7 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T0600)
[22:22:46] <jouncebot>	 In 7 hour(s) and 37 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T0600)
[22:23:50] <wikibugs>	 (03PS1) 10Jasmine: kafka-main: set codfw brokers inter-broker protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1283988 (https://phabricator.wikimedia.org/T419212)
[22:24:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[22:24:59] <wikibugs>	 (03CR) 10Jasmine: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283988 (https://phabricator.wikimedia.org/T419212) (owner: 10Jasmine)
[22:25:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283872 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[22:26:03] <wikibugs>	 (03Merged) 10jenkins-bot: Close English Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283872 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[22:26:29] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283872|Close English Wikinews (T421796)]]
[22:26:32] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[22:28:22] <wikibugs>	 (03PS1) 10Cathal Mooney: team-netops: CoreRouterInterfaceDropPercent - ingore missing series [alerts] - 10https://gerrit.wikimedia.org/r/1283993
[22:28:24] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283872|Close English Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:28:59] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment
[22:33:09] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283872|Close English Wikinews (T421796)]] (duration: 06m 40s)
[22:33:12] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[22:37:58] <wikibugs>	 (03PS1) 10Ladsgroup: Close Spanish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284004 (https://phabricator.wikimedia.org/T421796)
[22:40:55] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:41:25] <wikibugs>	 (03PS2) 10Jasmine: kafka-main: set codfw brokers inter-broker protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1283988 (https://phabricator.wikimedia.org/T419216)
[22:41:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284004 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[22:42:50] <wikibugs>	 (03Merged) 10jenkins-bot: Close Spanish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284004 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup)
[22:43:16] <wikibugs>	 (03PS1) 10SBassett: Enable CSPUseReportURIDirective in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284008 (https://phabricator.wikimedia.org/T424058)
[22:43:16] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1284004|Close Spanish Wikinews (T421796)]]
[22:43:19] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[22:43:45] <wikibugs>	 (03CR) 10SBassett: [C:04-1] "Hold for config deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284008 (https://phabricator.wikimedia.org/T424058) (owner: 10SBassett)
[22:45:12] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1284004|Close Spanish Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:46:11] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment
[22:48:29] <wikibugs>	 06SRE, 06Traffic: "Nahui Ollin" is enwiki's #1 article. Never heard of it? That's the problem - https://phabricator.wikimedia.org/T425600 (10AlexisJazz) 03NEW
[22:50:24] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1284004|Close Spanish Wikinews (T421796)]] (duration: 07m 08s)
[22:50:28] <stashbot>	 T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796
[22:56:31] <jinxer-wm>	 FIRING: Traffic on tunnel link: Alert for device cr1-drmrs.wikimedia.org - Traffic on tunnel link   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link
[23:00:49] <wikibugs>	 (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284014 (https://phabricator.wikimedia.org/T421797)
[23:01:31] <jinxer-wm>	 RESOLVED: Traffic on tunnel link: Device cr1-drmrs.wikimedia.org recovered from Traffic on tunnel link   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link
[23:03:40] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284014 (https://phabricator.wikimedia.org/T421797) (owner: 10Ladsgroup)
[23:05:20] <wikibugs>	 (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284014 (https://phabricator.wikimedia.org/T421797) (owner: 10Ladsgroup)
[23:12:25] <logmsgbot>	 !log ladsgroup@deploy1003 Synchronized portals/wikipedia.org/assets: Sync portals for removal of Wikinews (duration: 06m 12s)
[23:14:48] <logmsgbot>	 !log ladsgroup@deploy1003 Synchronized portals: Sync portals for removal of Wikinews (duration: 02m 22s)
[23:19:50] <wikibugs>	 (03PS1) 10Cwhite: ci test plz [puppet] - 10https://gerrit.wikimedia.org/r/1284024
[23:22:45] <wikibugs>	 06SRE, 10Pageviews-Anomaly, 06Traffic: "Nahui Ollin" is enwiki's #1 article. Never heard of it? That's the problem - https://phabricator.wikimedia.org/T425600#11896663 (10AntiCompositeNumber)
[23:38:45] <logmsgbot>	 !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host pc1021.eqiad.wmnet with OS trixie
[23:38:52] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11896690 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host pc1021.eqiad.wmnet with OS trixie executed with errors: - pc1021 (**FAIL**...
[23:40:34] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1284035
[23:40:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1284035 (owner: 10TrainBranchBot)
[23:41:22] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host pc1021.eqiad.wmnet with OS trixie
[23:41:36] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11896691 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host pc1021.eqiad.wmnet with OS trixie
[23:52:47] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1284035 (owner: 10TrainBranchBot)