[00:11:58] (03PS1) 10Ladsgroup: Close Italian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283117 (https://phabricator.wikimedia.org/T421796) [00:18:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283117 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [00:19:11] (03Merged) 10jenkins-bot: Close Italian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283117 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [00:19:25] FIRING: [4x] SystemdUnitFailed: opensearch_2@.service.d.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:58] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283117|Close Italian Wikinews (T421796)]] [00:20:01] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [00:21:55] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283117|Close Italian Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:23:07] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [00:24:53] !log herron@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1001.eqiad.wmnet with OS trixie [00:25:14] !log herron@cumin1003 START - Cookbook sre.hosts.move-vlan for host kafka-logging1001 [00:25:14] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host kafka-logging1001 [00:27:25] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283117|Close Italian Wikinews (T421796)]] (duration: 07m 26s) [00:27:28] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [00:27:42] (03PS1) 10Herron: kafka-logging1001: prep for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1283139 (https://phabricator.wikimedia.org/T417001) [00:29:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [00:34:55] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.130, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:35:24] that is me ^ [00:37:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [00:38:43] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [00:41:24] !log herron@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: host reimage [00:43:45] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 64.56 ms [00:44:25] (03PS2) 10Ladsgroup: Close Dutch Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283061 (https://phabricator.wikimedia.org/T421796) [00:44:37] FIRING: [6x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:45:30] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: host reimage [00:45:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283061 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [00:46:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#11892483 (10Papaul) [00:46:16] (03CR) 10Neriah: [C:03+1] Close Dutch Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283061 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [00:47:16] (03Merged) 10jenkins-bot: Close Dutch Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283061 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [00:47:40] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283061|Close Dutch Wikinews (T421796)]] [00:47:44] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [00:49:36] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283061|Close Dutch Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:49:57] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [00:51:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [00:54:06] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283061|Close Dutch Wikinews (T421796)]] (duration: 06m 26s) [00:54:09] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [01:04:36] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [01:05:20] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1001.eqiad.wmnet with OS trixie [01:10:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1283160 [01:10:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1283160 (owner: 10TrainBranchBot) [01:21:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1283160 (owner: 10TrainBranchBot) [01:35:44] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:00:41] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:19] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 37s) [02:07:56] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 5 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11892594 (10EvenTwist41) I understand the need to stop generating arbitrary thumbnail sizes, but was it really necessary to break exi... [02:09:21] FIRING: [8x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:17:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [02:26:18] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11892598 (10Papaul) All the servers in rack 23 are online and ready for re-image. I tested the re-image on cp4038 and completed with no issues after @ayounsi... [02:34:21] FIRING: [8x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:11] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 3989.41 ms [02:51:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [02:53:42] (03PS1) 10Dzahn: microsites: adjust monitoring string for design.wikimedia.org, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/1283213 (https://phabricator.wikimedia.org/T329991) [02:54:01] (03CR) 10CI reject: [V:04-1] microsites: adjust monitoring string for design.wikimedia.org, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/1283213 (https://phabricator.wikimedia.org/T329991) (owner: 10Dzahn) [02:54:11] (03PS2) 10Dzahn: microsites: adjust monitoring string for design.wikimedia.org, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/1283213 (https://phabricator.wikimedia.org/T329991) [02:54:20] (03CR) 10ArielGlenn: "Right, here's my official +1, ok by me to merge." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler) [02:54:25] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 309.00 ms [02:54:42] (03CR) 10Dzahn: [C:03+2] microsites: adjust monitoring string for design.wikimedia.org, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/1283213 (https://phabricator.wikimedia.org/T329991) (owner: 10Dzahn) [03:11:07] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/2 (Core: asw1-23-ulsfo:ethernet-1/55 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:16:52] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:17:03] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:52] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/2 (Core: asw1-23-ulsfo:ethernet-1/55 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:22:05] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 237.59 ms [03:22:15] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283048 (https://phabricator.wikimedia.org/T425440) (owner: 10Danielyepezgarces) [03:26:54] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [03:27:02] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [03:27:30] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [03:28:02] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [03:30:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [03:35:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [03:36:46] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs4009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [03:40:36] (03PS1) 10Andrew Bogott: magnum setup_capi.sh: export some vars [puppet] - 10https://gerrit.wikimedia.org/r/1283239 [03:41:40] (03CR) 10Andrew Bogott: [C:03+2] magnum setup_capi.sh: export some vars [puppet] - 10https://gerrit.wikimedia.org/r/1283239 (owner: 10Andrew Bogott) [04:06:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [04:06:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [04:19:40] FIRING: [4x] SystemdUnitFailed: opensearch_2@.service.d.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:21:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [04:21:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [04:33:47] PROBLEM - Host lvs4008 is DOWN: PING CRITICAL - Packet loss = 100% [04:33:51] PROBLEM - Host lvs4010 is DOWN: PING CRITICAL - Packet loss = 100% [04:34:07] PROBLEM - Host cp4040 is DOWN: PING CRITICAL - Packet loss = 100% [04:34:07] PROBLEM - Host cp4042 is DOWN: PING CRITICAL - Packet loss = 100% [04:34:07] PROBLEM - Host cp4046 is DOWN: PING CRITICAL - Packet loss = 100% [04:34:07] PROBLEM - Host cp4044 is DOWN: PING CRITICAL - Packet loss = 100% [04:34:07] PROBLEM - Host cp4050 is DOWN: PING CRITICAL - Packet loss = 100% [04:34:07] PROBLEM - Host cp4052 is DOWN: PING CRITICAL - Packet loss = 100% [04:34:07] PROBLEM - Host ganeti4006 is DOWN: PING CRITICAL - Packet loss = 100% [04:34:08] PROBLEM - Host cp4048 is DOWN: PING CRITICAL - Packet loss = 100% [04:34:08] PROBLEM - Host dns4004 is DOWN: PING CRITICAL - Packet loss = 100% [04:34:09] PROBLEM - Host ganeti4008 is DOWN: PING CRITICAL - Packet loss = 100% [04:35:11] FIRING: [4x] GanetiBGPDown: BGP session down between ganeti4006 and asw1-23-ulsfo - group ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [04:35:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [04:40:10] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.3.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282420 (https://phabricator.wikimedia.org/T419511) (owner: 10Santiago Faci) [04:42:34] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282420 (https://phabricator.wikimedia.org/T419511) (owner: 10Santiago Faci) [05:03:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1223 with weight 0 T425318', diff saved to https://phabricator.wikimedia.org/P92342 and previous config saved to /var/cache/conftool/dbconfig/20260506-050342-marostegui.json [05:03:46] T425318: Switchover s3 master (db1189 -> db1223) - https://phabricator.wikimedia.org/T425318 [05:03:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s3 T425318 [05:03:56] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1223 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1282279 (https://phabricator.wikimedia.org/T425318) (owner: 10Gerrit maintenance bot) [05:04:36] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [05:06:54] !log Starting s3 eqiad failover from db1189 to db1223 - T425318 [05:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T425318', diff saved to https://phabricator.wikimedia.org/P92343 and previous config saved to /var/cache/conftool/dbconfig/20260506-050755-marostegui.json [05:08:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1223 to s3 primary and set section read-write T425318', diff saved to https://phabricator.wikimedia.org/P92344 and previous config saved to /var/cache/conftool/dbconfig/20260506-050816-marostegui.json [05:09:03] (03CR) 10Marostegui: [C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1282280 (https://phabricator.wikimedia.org/T425318) (owner: 10Gerrit maintenance bot) [05:09:09] !log marostegui@dns1004 START - running authdns-update [05:09:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1189 T425318', diff saved to https://phabricator.wikimedia.org/P92345 and previous config saved to /var/cache/conftool/dbconfig/20260506-050948-marostegui.json [05:09:52] T425318: Switchover s3 master (db1189 -> db1223) - https://phabricator.wikimedia.org/T425318 [05:11:01] !log marostegui@dns1004 END - running authdns-update [05:12:29] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [05:12:29] (03PS1) 10Marostegui: db1189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283306 (https://phabricator.wikimedia.org/T424792) [05:13:52] (03CR) 10Marostegui: [C:03+2] db1189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283306 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui) [05:14:21] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1189.eqiad.wmnet with reason: Reimage to Trixie [05:14:26] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1189: Reimage to Trixie [05:14:29] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [05:14:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1189: Reimage to Trixie [05:15:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [05:16:54] (03CR) 10ArielGlenn: [C:03+1] "Let's see what the impact is." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282908 (https://phabricator.wikimedia.org/T425391) (owner: 10Daniel Kinzler) [05:17:30] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [05:18:37] marostegui@cumin1003 reimage (PID 455315) is awaiting input [05:19:37] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1189.eqiad.wmnet with OS trixie [05:21:12] (03PS1) 10Marostegui: db1191,db2208: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283483 (https://phabricator.wikimedia.org/T425388) [05:21:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1191.eqiad.wmnet with reason: Reimage to Trixie [05:21:54] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1191: Reimage to Trixie [05:22:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1191: Reimage to Trixie [05:23:37] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1191.eqiad.wmnet with OS trixie [05:24:22] (03CR) 10Marostegui: [C:03+2] db1191,db2208: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283483 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui) [05:24:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2208.codfw.wmnet with reason: Reimage to Trixie [05:24:36] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2208: Reimage to Trixie [05:25:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2208: Reimage to Trixie [05:26:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2208.codfw.wmnet with reason: Reimage to Trixie [05:26:45] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2208: Reimage to Trixie [05:26:52] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) depool db2208: Reimage to Trixie [05:33:52] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1189.eqiad.wmnet with reason: host reimage [05:35:59] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:35:59] 10ops-codfw, 06DBA, 06DC-Ops: db2208 PXE boot change not accessible - https://phabricator.wikimedia.org/T425506#11892785 (10Marostegui) p:05Triage→03Medium I've reseted the idrac but I still cannot reimage the host as I get that error - maybe this needs something else? [05:37:41] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage [05:39:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1189.eqiad.wmnet with reason: host reimage [05:43:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage [05:47:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2208.codfw.wmnet with reason: Idrac issues T425506 [05:47:09] T425506: db2208 PXE boot change not accessible - https://phabricator.wikimedia.org/T425506 [05:52:23] (03PS1) 10Ayounsi: mr1-ulsfo: remove device specific security_zones definition [homer/public] - 10https://gerrit.wikimedia.org/r/1283503 (https://phabricator.wikimedia.org/T421674) [05:55:13] (03PS1) 10Marostegui: Revert "db1189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1283507 [06:01:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1189.eqiad.wmnet with OS trixie [06:06:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1189: after reimage to trixie [06:06:52] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1191.eqiad.wmnet with OS trixie [06:09:22] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1191: after reimage to trixie [06:17:12] (03CR) 10Marostegui: [C:03+2] db1191: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283519 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui) [06:20:28] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti4006.ulsfo.wmnet [06:24:32] (03CR) 10Tiziano Fogli: [C:03+2] o11y/global: disable seasonality checks for small prom instances [alerts] - 10https://gerrit.wikimedia.org/r/1282935 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [06:24:42] (03CR) 10Tiziano Fogli: [C:03+2] o11y/global: adjust formatting [alerts] - 10https://gerrit.wikimedia.org/r/1282934 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [06:26:06] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [06:26:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283037 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [06:26:38] (03Merged) 10jenkins-bot: o11y/global: adjust formatting [alerts] - 10https://gerrit.wikimedia.org/r/1282934 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [06:26:52] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:26:57] (03Merged) 10jenkins-bot: o11y/global: disable seasonality checks for small prom instances [alerts] - 10https://gerrit.wikimedia.org/r/1282935 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [06:27:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283041 (https://phabricator.wikimedia.org/T425018) (owner: 10DCausse) [06:30:59] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti4006.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [06:34:04] jmm@cumin2002 decommission (PID 2485932) is awaiting input [06:34:37] FIRING: [6x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:40:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:41:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:45:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:45:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:47:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:48:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti4006.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [06:48:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:48:06] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti4006.ulsfo.wmnet [06:48:40] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti4008.ulsfo.wmnet [06:48:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [06:48:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [06:50:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:51:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1189: after reimage to trixie [06:52:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:52:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:54:48] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1191: after reimage to trixie [06:55:34] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [06:58:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [06:58:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [06:59:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:00:05] Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T0700). [07:00:05] awight, WMDE-Fisch, and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:12] o/ [07:00:15] I can deploy the first patches [07:00:28] sure [07:01:28] jmm@cumin2002 decommission (PID 2504587) is awaiting input [07:01:35] (03PS1) 10Slyngshede: Hieradata: Update IPs for ULSFO CP hosts in rack 23 [puppet] - 10https://gerrit.wikimedia.org/r/1283536 (https://phabricator.wikimedia.org/T424686) [07:04:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:04:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:06:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:06:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:07:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283033 (https://phabricator.wikimedia.org/T425433) (owner: 10Svantje Lilienthal) [07:11:09] \o [07:13:02] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2208 PXE boot change not accessible - https://phabricator.wikimedia.org/T425506#11892872 (10elukey) ` Traceback (most recent call last): File "/usr/lib/python3/dist-packages/spicerack/redfish.py", line 382, in request return self._api_client.request(method, uri,... [07:13:47] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti4008.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:14:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:14:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti4008.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:14:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:14:41] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti4008.ulsfo.wmnet [07:14:43] (03PS1) 10Slyngshede: Hieradata: Update IPs for ULSFO lvs hosts in rack 23 [puppet] - 10https://gerrit.wikimedia.org/r/1283545 (https://phabricator.wikimedia.org/T424686) [07:15:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:16:20] dcausse: This is slow. Would you like me to include your patches in my next SpiderPig batch? There's ~0% chance I would need to roll back the feature patch so the focus would be on your config changes, if you agree. [07:16:22] (03CR) 10Tiziano Fogli: CoreRouterInterfaceDropPercent: fix ping disable (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi) [07:17:28] awight: sure! [07:17:37] :-) [07:18:00] and I apologize in advance if something goes wrong with mine :) [07:18:16] (03PS1) 10Slyngshede: Hieradata: Update IPs for ULSFO dns host in rack 23 [puppet] - 10https://gerrit.wikimedia.org/r/1283548 (https://phabricator.wikimedia.org/T424686) [07:18:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:18:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:18:34] (03Merged) 10jenkins-bot: VE: Avoid counting all refs when listIndex is undefined [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283033 (https://phabricator.wikimedia.org/T425433) (owner: 10Svantje Lilienthal) [07:18:44] (03CR) 10Awight: [C:03+1] search: fix alt. completion indices to test keyword tokenizer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283037 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [07:19:17] !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1283033|VE: Avoid counting all refs when listIndex is undefined (T425433)]] [07:19:20] T425433: Wrong reuse message when creating a new reference - https://phabricator.wikimedia.org/T425433 [07:20:56] (03CR) 10Awight: [C:03+1] search: enable Latin-to-Devanagari transliteration second-chance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283041 (https://phabricator.wikimedia.org/T425018) (owner: 10DCausse) [07:20:57] (03PS1) 10Elukey: Set pki1001 to insetup to ease decom [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) [07:21:17] !log awight@deploy1003 awight, lilients: Backport for [[gerrit:1283033|VE: Avoid counting all refs when listIndex is undefined (T425433)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:21:26] (03PS2) 10Elukey: Set pki1001 to insetup to ease decom [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) [07:21:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:21:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:22:18] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [07:22:39] !log awight@deploy1003 awight, lilients: Continuing with deployment [07:23:34] (03CR) 10CI reject: [V:04-1] Set pki1001 to insetup to ease decom [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [07:25:21] (03CR) 10Awight: [C:03+1] VE: Avoid counting all refs when listIndex is undefined [extensions/Cite] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283101 (https://phabricator.wikimedia.org/T425433) (owner: 10WMDE-Fisch) [07:26:05] dcausse: Just to confirm, is there anything to test as your patches are deployed? I imagine that the search index takes some time to rebuild... [07:26:45] awight: yes, only the hindi transliteration will need a bit of testing [07:26:55] !log awight@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283033|VE: Avoid counting all refs when listIndex is undefined (T425433)]] (duration: 07m 37s) [07:26:58] T425433: Wrong reuse message when creating a new reference - https://phabricator.wikimedia.org/T425433 [07:27:12] dcausse: okay I'll wait then, once we get to the test phase [07:27:21] thanks! [07:28:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/Cite] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283101 (https://phabricator.wikimedia.org/T425433) (owner: 10WMDE-Fisch) [07:28:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283037 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [07:28:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283041 (https://phabricator.wikimedia.org/T425018) (owner: 10DCausse) [07:29:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:30:14] (03Merged) 10jenkins-bot: search: fix alt. completion indices to test keyword tokenizer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283037 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [07:30:22] (03Merged) 10jenkins-bot: search: enable Latin-to-Devanagari transliteration second-chance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283041 (https://phabricator.wikimedia.org/T425018) (owner: 10DCausse) [07:30:54] (03Merged) 10jenkins-bot: VE: Avoid counting all refs when listIndex is undefined [extensions/Cite] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283101 (https://phabricator.wikimedia.org/T425433) (owner: 10WMDE-Fisch) [07:31:24] !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1283101|VE: Avoid counting all refs when listIndex is undefined (T425433)]], [[gerrit:1283037|search: fix alt. completion indices to test keyword tokenizer (T420427)]], [[gerrit:1283041|search: enable Latin-to-Devanagari transliteration second-chance (T425018)]] [07:31:31] T420427: Search shouldn't trim trailing space when suggesting suggestions - https://phabricator.wikimedia.org/T420427 [07:31:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:31:32] T425018: Enable Latin-to-Devanagari Transliteration second-try search on Hindi Wikis - https://phabricator.wikimedia.org/T425018 [07:32:35] (03CR) 10Fabfur: [C:03+1] "LGTM, double checked with netbox" [puppet] - 10https://gerrit.wikimedia.org/r/1283536 (https://phabricator.wikimedia.org/T424686) (owner: 10Slyngshede) [07:33:19] !log awight@deploy1003 wmde-fisch, awight, dcausse: Backport for [[gerrit:1283101|VE: Avoid counting all refs when listIndex is undefined (T425433)]], [[gerrit:1283037|search: fix alt. completion indices to test keyword tokenizer (T420427)]], [[gerrit:1283041|search: enable Latin-to-Devanagari transliteration second-chance (T425018)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can [07:33:19] now be verified there. [07:33:25] T425433: Wrong reuse message when creating a new reference - https://phabricator.wikimedia.org/T425433 [07:33:32] testing [07:33:33] (03CR) 10Tiziano Fogli: "I noticed the same alert is defined in team-data-platform/stat_host.yaml. To improve maintainability, you could leverage YAML anchors and " [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [07:33:45] ty [07:34:49] (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1283545 (https://phabricator.wikimedia.org/T424686) (owner: 10Slyngshede) [07:34:53] (03PS3) 10Elukey: Set pki1001 to insetup to ease decom [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) [07:35:32] (03CR) 10Fabfur: [C:03+1] "Ok for me!" [puppet] - 10https://gerrit.wikimedia.org/r/1283548 (https://phabricator.wikimedia.org/T424686) (owner: 10Slyngshede) [07:35:59] Cite change lgtm [07:36:07] awight: lgtm [07:36:10] ack! [07:36:14] !log awight@deploy1003 wmde-fisch, awight, dcausse: Continuing with deployment [07:36:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:36:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:36:46] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs4009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [07:40:23] !log awight@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283101|VE: Avoid counting all refs when listIndex is undefined (T425433)]], [[gerrit:1283037|search: fix alt. completion indices to test keyword tokenizer (T420427)]], [[gerrit:1283041|search: enable Latin-to-Devanagari transliteration second-chance (T425018)]] (duration: 08m 58s) [07:40:28] T425433: Wrong reuse message when creating a new reference - https://phabricator.wikimedia.org/T425433 [07:40:29] T420427: Search shouldn't trim trailing space when suggesting suggestions - https://phabricator.wikimedia.org/T420427 [07:40:29] T425018: Enable Latin-to-Devanagari Transliteration second-try search on Hindi Wikis - https://phabricator.wikimedia.org/T425018 [07:41:00] 07sre-alert-triage, 06Data-Platform-SRE, 06ServiceOps new: Alert in need of triage: Kafka MirrorMaker main-codfw_to_main-eqiad dropped message count in last 30m (instance alert1002) - https://phabricator.wikimedia.org/T425339#11892950 (10JMeybohm) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282919... [07:41:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:42:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:45:39] (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Add dual stack cert support" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [07:46:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:46:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:46:39] (03CR) 10JMeybohm: [V:03+1 C:03+2] Revert "envoyproxy: global_tlsparams" [puppet] - 10https://gerrit.wikimedia.org/r/1282347 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [07:46:43] (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Add STEK configuration support" [puppet] - 10https://gerrit.wikimedia.org/r/1282346 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [07:46:48] (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Provide support for UDS upstreams" [puppet] - 10https://gerrit.wikimedia.org/r/1282345 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [07:46:53] (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Support alpn_protocols configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1282344 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [07:46:57] (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Support TLS min/max version config" [puppet] - 10https://gerrit.wikimedia.org/r/1282343 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [07:47:03] (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Allow configuring TLS handshake timeout" [puppet] - 10https://gerrit.wikimedia.org/r/1282342 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [07:47:07] (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Allow setting http2 protocol options" [puppet] - 10https://gerrit.wikimedia.org/r/1282341 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [07:47:13] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2208.codfw.wmnet with OS trixie [07:47:13] (03CR) 10JMeybohm: [C:03+2] Revert "envoyproxy: Allow disabling x-request-id generation" [puppet] - 10https://gerrit.wikimedia.org/r/1282340 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [07:47:18] (03CR) 10JMeybohm: [C:03+2] Revert "envoy: Allow disabling circuit breakers" [puppet] - 10https://gerrit.wikimedia.org/r/1282339 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [07:47:24] (03CR) 10JMeybohm: [C:03+2] Revert "envoy: Allow configuring delayed_closed_timeout" [puppet] - 10https://gerrit.wikimedia.org/r/1282338 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [07:48:19] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2208 PXE boot change not accessible - https://phabricator.wikimedia.org/T425506#11892986 (10Marostegui) 05Open→03Resolved After a cold restart it worked - seems that it needed more time after the cold reboot. [07:48:48] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [07:49:00] (03CR) 10Muehlenhoff: Set pki1001 to insetup to ease decom (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [07:50:55] (03CR) 10Muehlenhoff: Set pki1001 to insetup to ease decom (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [07:51:46] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [07:52:53] (03PS4) 10Elukey: Set pki1001 to insetup to ease decom [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) [07:53:18] (03CR) 10Elukey: Set pki1001 to insetup to ease decom (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [07:53:38] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [07:55:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:55:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:57:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:58:52] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2144: Replacing HW T418979 [07:58:52] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [07:58:55] T418979: Productionize db225[0-3] - https://phabricator.wikimedia.org/T418979 [07:59:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [07:59:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2144: Replacing HW T418979 [08:00:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2144.codfw.wmnet,db1151.eqiad.wmnet with reason: Replacing hw [08:00:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:01:35] (03PS1) 10Marostegui: mariadb: Productionize db2253 [puppet] - 10https://gerrit.wikimedia.org/r/1283619 (https://phabricator.wikimedia.org/T418979) [08:02:19] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2253 [puppet] - 10https://gerrit.wikimedia.org/r/1283619 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui) [08:02:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:02:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:02:50] (03PS1) 10Muehlenhoff: ferm: Absent the NRPE check when migrating from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283620 [08:03:41] (03PS2) 10Muehlenhoff: ferm: Absent the NRPE check when migrating from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283620 [08:06:27] !log EU morning deployment is done [08:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:27] (03PS1) 10Elukey: confluent::kafka: introduce the super-user-client.properties for Kafka 3 [puppet] - 10https://gerrit.wikimedia.org/r/1283621 [08:08:30] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2208.codfw.wmnet with OS trixie [08:09:09] (03CR) 10Majavah: "nrpe::plugin places files into a directory that has `recurse => true, purge => true`, so having a plugin defined as `ensure => absent` and" [puppet] - 10https://gerrit.wikimedia.org/r/1283620 (owner: 10Muehlenhoff) [08:09:13] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey) [08:09:25] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2208.codfw.wmnet with OS trixie [08:10:30] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [08:14:29] 10ops-codfw, 06DBA, 06DC-Ops: db2208 Backplane 0 error - https://phabricator.wikimedia.org/T425516 (10Marostegui) 03NEW [08:15:45] 10ops-codfw, 06DBA, 06DC-Ops: db2208 Backplane 0 error - https://phabricator.wikimedia.org/T425516#11893166 (10Marostegui) p:05Triage→03Medium [08:16:12] (03Abandoned) 10STran: Add exposure for experiment instrumentation [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280387 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [08:16:20] (03Abandoned) 10STran: Fix incorrect source in back instrumentation [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280386 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [08:19:40] FIRING: [4x] SystemdUnitFailed: opensearch_2@.service.d.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:20:51] (03PS1) 10Jelto: miscweb: add emptyDir to wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283627 (https://phabricator.wikimedia.org/T414405) [08:24:05] (03PS1) 10Marostegui: installserver: Add hosts for uefi /srv reusage [puppet] - 10https://gerrit.wikimedia.org/r/1283629 [08:25:16] (03PS2) 10Elukey: confluent::kafka: introduce the super-user-client.properties for Kafka 3 [puppet] - 10https://gerrit.wikimedia.org/r/1283621 [08:25:57] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey) [08:26:59] (03CR) 10Elukey: confluent::kafka: introduce the super-user-client.properties for Kafka 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey) [08:28:02] (03CR) 10Arnaudb: [C:03+1] add discovery names for gitlab [dns] - 10https://gerrit.wikimedia.org/r/1282437 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [08:29:02] (03CR) 10Marostegui: [C:03+2] installserver: Add hosts for uefi /srv reusage [puppet] - 10https://gerrit.wikimedia.org/r/1283629 (owner: 10Marostegui) [08:29:28] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2208.codfw.wmnet with OS trixie [08:31:25] (03CR) 10Arnaudb: [C:03+1] "looks good to me, thanks! I'm wondering how we'll advertise the port migration that will have to be done by gitlab users." [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [08:31:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:31:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:31:37] jouncebot: nowandnext [08:31:37] No deployments scheduled for the next 1 hour(s) and 28 minute(s) [08:31:37] In 1 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1000) [08:32:09] (03CR) 10Zabe: [C:03+2] Correctly support new file tables in RevisionDeleteUser [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281894 (https://phabricator.wikimedia.org/T424553) (owner: 10Zabe) [08:32:13] (03CR) 10Arnaudb: [C:03+1] add load balancer IPs for gitlab to geo DNS [dns] - 10https://gerrit.wikimedia.org/r/1282436 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [08:32:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:35:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:36:09] (03PS2) 10Cathal Mooney: CoreRouterInterfaceDropPercent: fix ping disable [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi) [08:36:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:36:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:38:04] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1283552 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [08:38:30] (03PS1) 10Marostegui: instances.yaml: Add db2253 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1283638 (https://phabricator.wikimedia.org/T418979) [08:38:34] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1039.eqiad.wmnet with reason: Maintenance [08:38:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1039 (T419961)', diff saved to https://phabricator.wikimedia.org/P92357 and previous config saved to /var/cache/conftool/dbconfig/20260506-083841-fceratto.json [08:39:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:39:50] (03PS3) 10Cathal Mooney: CoreRouterInterfaceDropPercent: fix ping disable [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi) [08:39:52] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2253 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1283638 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui) [08:40:16] (03PS1) 10Muehlenhoff: Switch pki2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283640 (https://phabricator.wikimedia.org/T416664) [08:40:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:41:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283640 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [08:42:21] (03CR) 10Cathal Mooney: CoreRouterInterfaceDropPercent: fix ping disable (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi) [08:42:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:43:29] (03CR) 10JMeybohm: [C:03+1] kafka-main: add eqiad-main cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1282999 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [08:43:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db2253 to ms2 T418973', diff saved to https://phabricator.wikimedia.org/P92358 and previous config saved to /var/cache/conftool/dbconfig/20260506-084337-marostegui.json [08:43:41] T418973: Productionize pc20[21-24] and pc10[21-24] - https://phabricator.wikimedia.org/T418973 [08:43:46] (03CR) 10CI reject: [V:04-1] Correctly support new file tables in RevisionDeleteUser [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281894 (https://phabricator.wikimedia.org/T424553) (owner: 10Zabe) [08:44:21] (03CR) 10Slyngshede: [C:03+2] Hieradata: Update IPs for ULSFO CP hosts in rack 23 [puppet] - 10https://gerrit.wikimedia.org/r/1283536 (https://phabricator.wikimedia.org/T424686) (owner: 10Slyngshede) [08:45:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:46:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:46:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:48:07] (03CR) 10Zabe: [C:03+2] "..." [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281894 (https://phabricator.wikimedia.org/T424553) (owner: 10Zabe) [08:49:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:50:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:51:06] (03PS11) 10JMeybohm: tlsproxy::envoy: Support ratelimit configuration [puppet] - 10https://gerrit.wikimedia.org/r/1228995 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [08:51:23] (03PS2) 10Tiziano Fogli: add query object [software/ecs] - 10https://gerrit.wikimedia.org/r/1280737 (https://phabricator.wikimedia.org/T423986) (owner: 10Cwhite) [08:51:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:52:12] (03PS1) 10Marostegui: db2253: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283644 (https://phabricator.wikimedia.org/T418979) [08:52:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:52:46] (03PS1) 10Majavah: P:ssl: Renew toolsbeta Prometheus certificate [puppet] - 10https://gerrit.wikimedia.org/r/1283645 [08:53:08] (03CR) 10Marostegui: [C:03+2] db2253: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283644 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui) [08:53:16] (03Merged) 10jenkins-bot: Correctly support new file tables in RevisionDeleteUser [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281894 (https://phabricator.wikimedia.org/T424553) (owner: 10Zabe) [08:53:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1039 (T419961)', diff saved to https://phabricator.wikimedia.org/P92359 and previous config saved to /var/cache/conftool/dbconfig/20260506-085321-fceratto.json [08:54:41] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1281894|Correctly support new file tables in RevisionDeleteUser (T424553)]] [08:55:01] (03CR) 10Majavah: [C:03+2] P:ssl: Renew toolsbeta Prometheus certificate [puppet] - 10https://gerrit.wikimedia.org/r/1283645 (owner: 10Majavah) [08:55:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:55:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:56:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:56:39] !log zabe@deploy1003 zabe: Backport for [[gerrit:1281894|Correctly support new file tables in RevisionDeleteUser (T424553)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:57:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:59:07] !log zabe@deploy1003 zabe: Continuing with deployment [08:59:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:00:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:01:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:01:43] 06SRE, 10Observability-Metrics, 10Prod-Kubernetes, 06ServiceOps new, and 2 others: write some recording rules for queries used in the appserver RED k8s dashboard - https://phabricator.wikimedia.org/T249663#11893353 (10MLechvien-WMF) @hnowlan can I confirm if we are targeting to complete this this quarter? [09:03:25] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1281894|Correctly support new file tables in RevisionDeleteUser (T424553)]] (duration: 08m 44s) [09:03:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1039', diff saved to https://phabricator.wikimedia.org/P92360 and previous config saved to /var/cache/conftool/dbconfig/20260506-090329-fceratto.json [09:04:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:04:36] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [09:06:34] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:07:07] (03CR) 10JMeybohm: "Shouldn't this change remove/move `modules/confluent/files/kafka/kafka3.sh` (to the template)?" [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey) [09:07:24] (03PS3) 10Muehlenhoff: ferm: Absent the NRPE check when migrating from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283620 [09:07:54] (03CR) 10CI reject: [V:04-1] ferm: Absent the NRPE check when migrating from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283620 (owner: 10Muehlenhoff) [09:10:07] (03PS4) 10Muehlenhoff: ferm: Absent the NRPE check when migrating from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283620 [09:13:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1039', diff saved to https://phabricator.wikimedia.org/P92361 and previous config saved to /var/cache/conftool/dbconfig/20260506-091337-fceratto.json [09:14:04] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2253: Replacing HW T418979 [09:14:04] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [09:14:04] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [09:14:04] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2253: Replacing HW T418979 [09:14:08] T418979: Productionize db225[0-3] - https://phabricator.wikimedia.org/T418979 [09:14:34] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [09:15:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool ms2 T418979ç', diff saved to https://phabricator.wikimedia.org/P92362 and previous config saved to /var/cache/conftool/dbconfig/20260506-091513-marostegui.json [09:15:34] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:15:37] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup2005.codfw.wmnet with reason: update [09:16:34] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:16:53] !log ayounsi@cumin1003 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [09:17:22] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp4040.ulsfo.wmnet with OS trixie [09:17:25] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [09:23:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1039 (T419961)', diff saved to https://phabricator.wikimedia.org/P92363 and previous config saved to /var/cache/conftool/dbconfig/20260506-092345-fceratto.json [09:23:53] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti4006/8 mgmt - ayounsi@cumin1003" [09:24:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1040.eqiad.wmnet with reason: Maintenance [09:24:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1040 (T419961)', diff saved to https://phabricator.wikimedia.org/P92364 and previous config saved to /var/cache/conftool/dbconfig/20260506-092414-fceratto.json [09:26:58] ayounsi@cumin1003 netbox (PID 501175) is awaiting input [09:27:08] !log jmm@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4006.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:27:44] (03PS1) 10Marostegui: db2253: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1283664 [09:27:58] (03PS3) 10Elukey: confluent::kafka: introduce the super-user-client.properties for Kafka 3 [puppet] - 10https://gerrit.wikimedia.org/r/1283621 [09:28:33] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:28:46] (03CR) 10Marostegui: [C:03+2] db2253: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1283664 (owner: 10Marostegui) [09:29:13] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti4006/8 mgmt - ayounsi@cumin1003" [09:29:13] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:31:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283620 (owner: 10Muehlenhoff) [09:31:07] (03CR) 10Muehlenhoff: "This wasn't about about /usr/local/lib/nagios/plugins, but the /etc/nagios/nrpe.d cfg still present after moving from nftables->ferm. For " [puppet] - 10https://gerrit.wikimedia.org/r/1283620 (owner: 10Muehlenhoff) [09:31:47] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4006.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:32:20] !log jmm@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4006.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:32:46] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey) [09:35:33] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:35:41] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:35:59] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:37:21] (03CR) 10Elukey: "done!" [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey) [09:38:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4006.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:38:41] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:40:04] !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage [09:40:33] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:45:03] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage [09:46:33] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:47:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1040 (T419961)', diff saved to https://phabricator.wikimedia.org/P92365 and previous config saved to /var/cache/conftool/dbconfig/20260506-094744-fceratto.json [09:48:49] (03PS2) 10Daniel Kinzler: rest gateway: defined anon-mediawiki class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) [09:49:00] (03CR) 10Daniel Kinzler: rest gateway: defined anon-mediawiki class (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler) [09:49:08] (03PS2) 10Daniel Kinzler: rest-gateway: add anon-app ratelimit class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282908 (https://phabricator.wikimedia.org/T425391) [09:49:17] jouncebot: nowandnext [09:49:17] No deployments scheduled for the next 0 hour(s) and 10 minute(s) [09:49:17] In 0 hour(s) and 10 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1000) [09:52:53] (03CR) 10JMeybohm: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey) [09:55:18] (03CR) 10Elukey: lvs: expose grpc port on ml-serve staging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1282328 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [09:55:51] !log jmm@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4008.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:57:13] (03CR) 10Elukey: confluent::kafka: introduce the super-user-client.properties for Kafka 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (owner: 10Elukey) [09:57:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1040', diff saved to https://phabricator.wikimedia.org/P92366 and previous config saved to /var/cache/conftool/dbconfig/20260506-095752-fceratto.json [09:59:51] jmm@cumin2002 provision (PID 2625414) is awaiting input [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1000) [10:01:05] (03PS6) 10Tiziano Fogli: rsyslog: forward thanos-query-frontend logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1275799 (https://phabricator.wikimedia.org/T423986) [10:01:05] (03PS9) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) [10:01:05] (03PS1) 10Tiziano Fogli: logstash/ecs: import ecs 1.11.0-8 template file [puppet] - 10https://gerrit.wikimedia.org/r/1283683 (https://phabricator.wikimedia.org/T423986) [10:02:45] PROBLEM - Druid historical on an-druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:03:38] (03CR) 10CI reject: [V:04-1] logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [10:03:50] (03PS2) 10Federico Ceratto: common, site, ferm: Remove dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/1278452 (https://phabricator.wikimedia.org/T416582) [10:05:01] (03CR) 10Federico Ceratto: "Updated removing the yaml file." [puppet] - 10https://gerrit.wikimedia.org/r/1278452 (https://phabricator.wikimedia.org/T416582) (owner: 10Federico Ceratto) [10:05:58] (03CR) 10Tiziano Fogli: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [10:08:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1040', diff saved to https://phabricator.wikimedia.org/P92367 and previous config saved to /var/cache/conftool/dbconfig/20260506-100800-fceratto.json [10:08:45] RECOVERY - Druid historical on an-druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:09:54] 06SRE: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528 (10elukey) 03NEW [10:10:03] (03PS4) 10Elukey: confluent::kafka: introduce the super-user-client.properties for Kafka 3 [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (https://phabricator.wikimedia.org/T425528) [10:10:55] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4040.ulsfo.wmnet with OS trixie [10:11:53] (03CR) 10Elukey: [C:03+1] Switch pki2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283640 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [10:14:55] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:15:42] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 3 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11893589 (10ABran-WMF) Small update after another round of Pontoon testing. A few things changed while testing the patch: * Rs... [10:16:44] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp4042.ulsfo.wmnet with OS trixie [10:16:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:17:39] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp4044.ulsfo.wmnet with OS trixie [10:18:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1040 (T419961)', diff saved to https://phabricator.wikimedia.org/P92368 and previous config saved to /var/cache/conftool/dbconfig/20260506-101808-fceratto.json [10:18:29] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1048.eqiad.wmnet with reason: Maintenance [10:18:29] (03PS10) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) [10:18:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1048 (T419961)', diff saved to https://phabricator.wikimedia.org/P92369 and previous config saved to /var/cache/conftool/dbconfig/20260506-101836-fceratto.json [10:18:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [10:18:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:20:55] (03CR) 10CI reject: [V:04-1] logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [10:22:19] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti4006.ulsfo.wmnet'] [10:22:36] (03CR) 10Marostegui: [C:03+1] common, site, ferm: Remove dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/1278452 (https://phabricator.wikimedia.org/T416582) (owner: 10Federico Ceratto) [10:22:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti4006.ulsfo.wmnet'] [10:23:12] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti4006.ulsfo.wmnet'] [10:23:20] (03CR) 10Federico Ceratto: [C:03+2] common, site, ferm: Remove dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/1278452 (https://phabricator.wikimedia.org/T416582) (owner: 10Federico Ceratto) [10:26:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:27:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:27:19] (03PS1) 10Elukey: wikifunctions: use mesh for the evaluator endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) [10:29:14] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ganeti4006.ulsfo.wmnet'] [10:30:51] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [10:32:34] (03CR) 10Elukey: wikifunctions: use mesh for the evaluator endpoints (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [10:33:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4006.ulsfo.wmnet with OS bookworm [10:33:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [10:33:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:34:37] FIRING: [6x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:35:53] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 117.28 ms [10:39:41] !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage [10:40:45] !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage [10:44:16] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage [10:45:46] (03PS1) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) [10:46:38] (03CR) 10CI reject: [V:04-1] translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [10:48:29] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage [10:53:33] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage [10:54:08] (03CR) 10Nikerabbit: translate: add opensearch-ttmserver-test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [10:56:01] (03PS2) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) [10:58:26] (03PS3) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) [10:58:53] (03CR) 10Atsuko: "set writable to false" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [10:59:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage [11:00:04] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1100). [11:09:03] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1217.eqiad.wmnet with reason: Reboot [11:09:56] (03PS2) 10Effie Mouzeli: site.pp: add role for rdb2011 and rdb2012 [puppet] - 10https://gerrit.wikimedia.org/r/1277429 (https://phabricator.wikimedia.org/T418261) [11:10:36] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4042.ulsfo.wmnet with OS trixie [11:11:23] PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:11:29] PROBLEM - haproxy failover on dbproxy1028 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:11:29] PROBLEM - haproxy failover on dbproxy1029 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:11:29] PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:11:29] PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:11:29] PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:11:29] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:11:29] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:12:29] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1277429 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [11:12:53] federico3: ^ [11:13:24] looking [11:14:36] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4044.ulsfo.wmnet with OS trixie [11:18:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048 (T419961)', diff saved to https://phabricator.wikimedia.org/P92370 and previous config saved to /var/cache/conftool/dbconfig/20260506-111854-fceratto.json [11:19:42] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp4048.ulsfo.wmnet with OS trixie [11:20:16] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp4050.ulsfo.wmnet with OS trixie [11:20:44] (03CR) 10Blake: [C:03+1] site.pp: add role for rdb2011 and rdb2012 [puppet] - 10https://gerrit.wikimedia.org/r/1277429 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [11:20:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [11:21:11] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [11:22:03] (03CR) 10JMeybohm: [C:03+1] confluent::kafka: introduce the super-user-client.properties for Kafka 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283621 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [11:22:04] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: add role for rdb2011 and rdb2012 [puppet] - 10https://gerrit.wikimedia.org/r/1277429 (https://phabricator.wikimedia.org/T418261) (owner: 10Effie Mouzeli) [11:24:16] jmm@cumin2002 reimage (PID 2656424) is awaiting input [11:25:23] RECOVERY - haproxy failover on dbproxy1025 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:25:29] RECOVERY - haproxy failover on dbproxy1028 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:25:29] RECOVERY - haproxy failover on dbproxy1029 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:25:29] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:25:29] RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:25:29] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:25:29] RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:25:29] RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:25:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [11:29:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048', diff saved to https://phabricator.wikimedia.org/P92371 and previous config saved to /var/cache/conftool/dbconfig/20260506-112903-fceratto.json [11:30:27] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host rdb2011.codfw.wmnet with OS trixie [11:35:29] (03CR) 10Jforrester: wikifunctions: use mesh for the evaluator endpoints (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [11:36:46] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs4009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [11:39:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048', diff saved to https://phabricator.wikimedia.org/P92372 and previous config saved to /var/cache/conftool/dbconfig/20260506-113910-fceratto.json [11:41:36] !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage [11:42:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [11:42:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4006.ulsfo.wmnet with OS bookworm [11:42:43] !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4048.ulsfo.wmnet with reason: host reimage [11:43:17] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T425488#11893753 (10Jclark-ctr) ps1-b2-eqiad.mgmt.eqiad.wmnet #1: Sensor: Line, AA:L3, Current Value: 12.61 A (current) Thresholds: High: 12.5 #2: Sensor: Phase, AA:L2-L3,... [11:44:12] (03PS1) 10Marostegui: db1194: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283730 (https://phabricator.wikimedia.org/T425388) [11:44:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1194.eqiad.wmnet with reason: Reimage to Trixie [11:44:48] (03CR) 10Marostegui: [C:03+2] db1194: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1283730 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui) [11:44:48] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2011.codfw.wmnet with reason: host reimage [11:44:51] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1194: Reimage to Trixie [11:45:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2160.codfw.wmnet with reason: Reboot [11:45:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1194: Reimage to Trixie [11:47:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet [11:49:16] marostegui@cumin1003 reimage (PID 593480) is awaiting input [11:49:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048 (T419961)', diff saved to https://phabricator.wikimedia.org/P92374 and previous config saved to /var/cache/conftool/dbconfig/20260506-114919-fceratto.json [11:50:03] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage [11:50:37] !log installing openjdk-17 security updates [11:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:01] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:52:30] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T425488#11893777 (10Jclark-ctr) Rebalanced again focusing more between AA and AB Cords. Will monitor alerts [11:53:04] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4048.ulsfo.wmnet with reason: host reimage [11:56:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet [11:57:28] (03CR) 10Elukey: wikifunctions: use mesh for the evaluator endpoints (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [11:57:32] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb2011.codfw.wmnet with reason: host reimage [11:58:47] (03PS1) 10Effie Mouzeli: idp_test: switch to rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1283733 (https://phabricator.wikimedia.org/T419976) [11:59:17] (03PS2) 10Effie Mouzeli: (DNM) idp_test: switch to rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1283733 (https://phabricator.wikimedia.org/T419976) [12:01:25] (03PS1) 10Ladsgroup: Close Polish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283735 (https://phabricator.wikimedia.org/T421796) [12:01:59] (03CR) 10Jforrester: [C:03+1] Remove unused 'writeapi' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 (owner: 10Bartosz Dziewoński) [12:02:51] jouncebot: nowandnext [12:02:51] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [12:02:51] In 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1300) [12:02:53] (03PS2) 10Blake: (DNM) idp_test: switch to rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1283733 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [12:03:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11893806 (10Jclark-ctr) @elukey I am unable to connect to kafka-logging1007 looks like it might not be setting bmc to static [12:03:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283735 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [12:05:00] (03Merged) 10jenkins-bot: Close Polish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283735 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [12:05:15] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:05:22] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283735|Close Polish Wikinews (T421796)]] [12:05:25] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [12:05:27] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 543090976 and 46 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:05:30] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host rdb2012.codfw.wmnet with OS trixie [12:06:27] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:07:15] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283735|Close Polish Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:07:40] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [12:07:46] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:09:17] PROBLEM - MariaDB Replica Lag: m3 on db1217 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3609.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:09:41] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [12:11:50] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283735|Close Polish Wikinews (T421796)]] (duration: 06m 28s) [12:11:53] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [12:11:59] (03CR) 10Jelto: [C:03+2] miscweb: add emptyDir to wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283627 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [12:14:31] (03Merged) 10jenkins-bot: miscweb: add emptyDir to wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283627 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [12:14:44] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 227.34 ms [12:14:50] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb2011.codfw.wmnet with OS trixie [12:15:50] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [12:16:20] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [12:16:39] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4050.ulsfo.wmnet with OS trixie [12:19:21] FIRING: [6x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:19:40] FIRING: [4x] SystemdUnitFailed: opensearch_2@.service.d.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:20:20] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4048.ulsfo.wmnet with OS trixie [12:20:52] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1194.eqiad.wmnet with OS trixie [12:21:12] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2012.codfw.wmnet with reason: host reimage [12:22:40] (03CR) 10Slyngshede: [C:03+2] Hieradata: Update IPs for ULSFO lvs hosts in rack 23 [puppet] - 10https://gerrit.wikimedia.org/r/1283545 (https://phabricator.wikimedia.org/T424686) (owner: 10Slyngshede) [12:24:46] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb2012.codfw.wmnet with reason: host reimage [12:26:31] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 790887976 and 60 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:27:23] (03CR) 10Daniel Kinzler: [C:04-1] rest gateway: defined anon-mediawiki class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler) [12:28:33] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3532624 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:15] (03PS3) 10Effie Mouzeli: idp_test: switch to rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1283733 (https://phabricator.wikimedia.org/T419976) [12:33:05] (03PS1) 10Jelto: miscweb: allow egress to text-lb for wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283743 (https://phabricator.wikimedia.org/T414405) [12:34:08] (03CR) 10Muehlenhoff: [C:03+2] d-i: Remove dhcpcd-base after installation completed [puppet] - 10https://gerrit.wikimedia.org/r/1280082 (https://phabricator.wikimedia.org/T414341) (owner: 10Muehlenhoff) [12:35:05] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage [12:36:38] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 8 hosts with reason: update [12:37:33] (03CR) 10Jelto: [C:03+2] miscweb: allow egress to text-lb for wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283743 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [12:38:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage [12:39:06] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:40:02] (03Merged) 10jenkins-bot: miscweb: allow egress to text-lb for wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283743 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [12:42:06] (03PS1) 10Dpogorzelski: 1/3 ml-serve(grpc): etcd data for DNS Discovery [puppet] - 10https://gerrit.wikimedia.org/r/1283745 (https://phabricator.wikimedia.org/T424049) [12:42:08] (03PS1) 10Dpogorzelski: 2/3 ml-serve(grpc): add entry to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1283746 (https://phabricator.wikimedia.org/T424049) [12:42:10] (03PS1) 10Dpogorzelski: 3/3 ml-serve(grpc): add service to k8s pools [puppet] - 10https://gerrit.wikimedia.org/r/1283747 (https://phabricator.wikimedia.org/T424049) [12:42:31] (03CR) 10Tiziano Fogli: [C:03+1] CoreRouterInterfaceDropPercent: fix ping disable (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi) [12:42:37] (03CR) 10CI reject: [V:04-1] 1/3 ml-serve(grpc): etcd data for DNS Discovery [puppet] - 10https://gerrit.wikimedia.org/r/1283745 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [12:42:42] (03Abandoned) 10Dpogorzelski: lvs: expose grpc port on ml-serve staging [puppet] - 10https://gerrit.wikimedia.org/r/1282328 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [12:42:58] (03CR) 10CI reject: [V:04-1] 2/3 ml-serve(grpc): add entry to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1283746 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [12:43:18] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb2012.codfw.wmnet with OS trixie [12:43:22] (03CR) 10CI reject: [V:04-1] 3/3 ml-serve(grpc): add service to k8s pools [puppet] - 10https://gerrit.wikimedia.org/r/1283747 (https://phabricator.wikimedia.org/T424049) (owner: 10Dpogorzelski) [12:43:42] jclark@cumin1003 provision (PID 639319) is awaiting input [12:45:12] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [12:45:34] PROBLEM - MariaDB Replica Lag: m3 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3463.97 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:45:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11893942 (10Jgreen) [12:47:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11893955 (10Jgreen) [12:49:59] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host lvs4008.ulsfo.wmnet with OS trixie [12:50:09] (03PS2) 10Dpogorzelski: ml-serve(grpc): step 1, etcd data for DNS Discovery [puppet] - 10https://gerrit.wikimedia.org/r/1283745 (https://phabricator.wikimedia.org/T424049) [12:50:09] (03PS2) 10Dpogorzelski: ml-serve(grpc): step 2, add entry to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1283746 (https://phabricator.wikimedia.org/T424049) [12:50:09] (03PS2) 10Dpogorzelski: ml-serve(grpc): step 3, add service to k8s pools [puppet] - 10https://gerrit.wikimedia.org/r/1283747 (https://phabricator.wikimedia.org/T424049) [12:51:16] (03PS1) 10Marostegui: Revert "db1194: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1283748 [12:52:12] (03CR) 10Marostegui: [C:03+2] Revert "db1194: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1283748 (owner: 10Marostegui) [12:53:34] RECOVERY - MariaDB Replica Lag: m3 on db2160 is OK: OK slave_sql_lag Replication lag: 53.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:57:39] FIRING: [4x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [12:57:44] (03PS1) 10Ladsgroup: Close French Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283751 (https://phabricator.wikimedia.org/T421796) [13:00:05] Urbanecm and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1300). [13:00:05] alexsanford: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:47] Hey! I'll go ahead with the backport for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1283028 [13:00:49] (03PS1) 10Jelto: miscweb: bump wmf-navigator image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283753 (https://phabricator.wikimedia.org/T414405) [13:01:16] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1194.eqiad.wmnet with OS trixie [13:01:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by alexsanford@deploy1003 using scap backport" [extensions/WikimediaMessages] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283028 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [13:03:22] RECOVERY - MariaDB Replica Lag: m3 on db1217 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:04:36] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [13:05:29] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [13:05:51] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [13:05:59] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1194: after reimage to trixie [13:08:28] (03CR) 10Jelto: [C:03+2] miscweb: bump wmf-navigator image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283753 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [13:08:32] !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [13:08:35] (03PS1) 10Gkyziridis: changeprop: Configure all wikis for revertrisk-multilingual events. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283758 (https://phabricator.wikimedia.org/T415892) [13:10:53] (03Merged) 10jenkins-bot: miscweb: bump wmf-navigator image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283753 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [13:11:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4008.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:12:40] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host lvs4010.ulsfo.wmnet with OS trixie [13:12:54] (03Merged) 10jenkins-bot: Add messages related to mandatory 2FA for more groups [extensions/WikimediaMessages] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283028 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [13:13:20] !log alexsanford@deploy1003 Started scap sync-world: Backport for [[gerrit:1283028|Add messages related to mandatory 2FA for more groups (T423119)]] [13:13:23] T423119: FY25-26 Q4: Phase 2 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423119 [13:14:14] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [13:14:15] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [13:15:14] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [13:18:53] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entry for ganeti4008 mgmt - cmooney@cumin1003" [13:18:59] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entry for ganeti4008 mgmt - cmooney@cumin1003" [13:18:59] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:19:16] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache ganeti4008.mgmt.ulsfo.wmnet on all recursors [13:19:24] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp4046.ulsfo.wmnet with OS trixie [13:19:36] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) ganeti4008.mgmt.ulsfo.wmnet on all recursors [13:20:05] !log jmm@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4008.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:21:57] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:24:27] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [13:24:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4008.mgmt.ulsfo.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:25:45] (03CR) 10Bking: "Thanks for the tip! I'll take a look and see if that could help us out." [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [13:26:37] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [13:26:46] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:27:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:27:28] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti4008.ulsfo.wmnet'] [13:27:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti4008.ulsfo.wmnet'] [13:27:57] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti4008.ulsfo.wmnet'] [13:28:09] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ganeti4008.ulsfo.wmnet'] [13:28:31] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti4008.ulsfo.wmnet'] [13:28:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti4008.ulsfo.wmnet'] [13:30:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [13:31:07] !log alexsanford@deploy1003 alexsanford: Backport for [[gerrit:1283028|Add messages related to mandatory 2FA for more groups (T423119)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:31:10] T423119: FY25-26 Q4: Phase 2 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423119 [13:31:22] (03CR) 10Bking: "Note that the new OpenSearch cluster is password-protected (unlike the current cluster hosting the ttmserver indices)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [13:31:32] !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [13:32:00] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:32:49] !log alexsanford@deploy1003 alexsanford: Continuing with deployment [13:34:21] FIRING: [4x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:34:41] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache ganeti4008.ulsfo.wmnet on all recursors [13:35:01] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) ganeti4008.ulsfo.wmnet on all recursors [13:35:56] (03PS1) 10Jgreen: Add frdata-new-eqiad.wikimedia.org and PTR for 208.80.155.12 [dns] - 10https://gerrit.wikimedia.org/r/1283769 (https://phabricator.wikimedia.org/T425539) [13:35:59] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:36:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4008.ulsfo.wmnet with OS bookworm [13:37:29] (03PS3) 10Daniel Kinzler: rest gateway: defined anon-mediawiki class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) [13:37:38] (03CR) 10Daniel Kinzler: rest gateway: defined anon-mediawiki class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler) [13:38:02] (03Abandoned) 10Jgreen: nsca_frack_cfg.erb remove frqueue2002 and add frqueue2004 [puppet] - 10https://gerrit.wikimedia.org/r/1243894 (https://phabricator.wikimedia.org/T418393) (owner: 10Jgreen) [13:38:31] (03Abandoned) 10Jgreen: nsca_frack.cfg.erb create hostgroup fundraising-minio adding check-minio [puppet] - 10https://gerrit.wikimedia.org/r/1186566 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen) [13:39:25] (03CR) 10Jgreen: [C:03+2] Add frdata-new-eqiad.wikimedia.org and PTR for 208.80.155.12 [dns] - 10https://gerrit.wikimedia.org/r/1283769 (https://phabricator.wikimedia.org/T425539) (owner: 10Jgreen) [13:39:31] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [13:41:21] !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4046.ulsfo.wmnet with reason: host reimage [13:42:39] FIRING: [4x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:44:02] !log jgreen@dns1004 START - running authdns-update [13:44:14] !log alexsanford@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283028|Add messages related to mandatory 2FA for more groups (T423119)]] (duration: 30m 53s) [13:44:16] T423119: FY25-26 Q4: Phase 2 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423119 [13:45:15] Done (backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1283028) [13:45:25] !log jgreen@dns1004 END - running authdns-update [13:45:29] (03PS1) 10Hashar: zuul: rename web_port > finger_port and set it in conf [puppet] - 10https://gerrit.wikimedia.org/r/1283768 [13:45:29] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/output/1283768/6674/zuul2002.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar) [13:45:32] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4046.ulsfo.wmnet with reason: host reimage [13:47:45] (03CR) 10Danielyepezgarces: [C:03+1] Enabling RSS extension for cowikimedia chapter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283048 (https://phabricator.wikimedia.org/T425440) (owner: 10Danielyepezgarces) [13:49:18] (03CR) 10Tchanders: [C:03+1] Add user_groups to editAttemptStep schema [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283050 (https://phabricator.wikimedia.org/T424010) (owner: 10Kosta Harlan) [13:50:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283751 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [13:50:55] (03Merged) 10jenkins-bot: Close French Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283751 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [13:51:22] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283751|Close French Wikinews (T421796)]] [13:51:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1194: after reimage to trixie [13:51:25] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [13:53:06] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: restart to test fixes from T425301 - bking@cumin2002 [13:53:09] T425301: The cloudelastic chi cluster is red - https://phabricator.wikimedia.org/T425301 [13:53:46] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:55:11] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS trixie [13:55:43] jouncebot: nowandnext [13:55:43] For the next 0 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1300) [13:55:43] In 0 hour(s) and 4 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1400) [13:55:44] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283751|Close French Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:55:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283050 (https://phabricator.wikimedia.org/T424010) (owner: 10Kosta Harlan) [13:56:03] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4008.ulsfo.wmnet with reason: host reimage [13:56:13] kostajh: I'm closing two wikis [13:56:23] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [13:56:27] Amir1: cool, can you let me know when you’re done? [13:56:32] sure [13:58:16] (03PS1) 10Ladsgroup: Close German Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283783 (https://phabricator.wikimedia.org/T421796) [13:59:21] FIRING: [4x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:00:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4008.ulsfo.wmnet with reason: host reimage [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1400) [14:01:12] (03PS1) 10Jelto: miscweb: make sure wmf-navigator entrypoint executes sync script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283784 (https://phabricator.wikimedia.org/T414405) [14:02:23] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts durum5001.eqsin.wmnet [14:02:50] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283751|Close French Wikinews (T421796)]] (duration: 11m 28s) [14:02:53] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [14:03:34] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2208 Backplane 0 error - https://phabricator.wikimedia.org/T425516#11894362 (10Papaul) @Jhancock.wm when you have time can you please look and see if there are any bad disks on this server? Thanks [14:04:43] (03CR) 10Jelto: [C:03+2] miscweb: make sure wmf-navigator entrypoint executes sync script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283784 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:04:47] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-04-29-001940 to 2026-05-05-223522 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283785 (https://phabricator.wikimedia.org/T414062) [14:06:13] (03CR) 10David Martin: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-04-29-001940 to 2026-05-05-223522 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283785 (https://phabricator.wikimedia.org/T414062) (owner: 10Jforrester) [14:06:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283783 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [14:06:58] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [14:06:58] PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [14:06:58] PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [14:06:58] PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [14:06:58] PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [14:06:58] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [14:07:11] (03Merged) 10jenkins-bot: miscweb: make sure wmf-navigator entrypoint executes sync script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283784 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:07:58] (03Merged) 10jenkins-bot: Close German Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283783 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [14:08:21] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:08:23] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283783|Close German Wikinews (T421796)]] [14:08:26] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [14:08:32] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-04-29-001940 to 2026-05-05-223522 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283785 (https://phabricator.wikimedia.org/T414062) (owner: 10Jforrester) [14:08:33] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [14:09:06] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-04-21-184122 to 2026-05-05-223640 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283786 (https://phabricator.wikimedia.org/T414062) [14:09:16] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [14:09:59] !log dmartin@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:10:17] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283783|Close German Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:10:38] !log dmartin@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:10:40] FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:10:53] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [14:11:24] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4046.ulsfo.wmnet with OS trixie [14:11:49] !log dmartin@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:12:20] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:12:45] !log dmartin@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:13:07] !log dmartin@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:13:53] !log dmartin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:14:49] (03CR) 10Slyngshede: [C:03+1] idp_test: switch to rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1283733 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [14:15:04] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283783|Close German Wikinews (T421796)]] (duration: 06m 40s) [14:15:07] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [14:15:11] kostajh: over to you [14:15:18] (03CR) 10David Martin: [C:03+2] wikifunctions: Upgrade evaluators from 2026-04-21-184122 to 2026-05-05-223640 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283786 (https://phabricator.wikimedia.org/T414062) (owner: 10Jforrester) [14:15:18] Amir1: thanks [14:15:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:15:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:15:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum5001.eqsin.wmnet [14:15:36] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11894493 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `durum5001.eqsin.wmnet` - durum5001.eqsin.wmnet (**PASS**)... [14:15:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283050 (https://phabricator.wikimedia.org/T424010) (owner: 10Kosta Harlan) [14:16:22] (03CR) 10Bartosz Dziewoński: [C:03+1] rest-gateway: generalize class overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278376 (https://phabricator.wikimedia.org/T424828) (owner: 10Daniel Kinzler) [14:16:51] (03CR) 10Bartosz Dziewoński: [C:03+1] rest gateway: remove redundant bearerPayload case [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277703 (owner: 10Daniel Kinzler) [14:16:54] (03PS1) 10Joal: [WIP] Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 [14:17:25] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-04-21-184122 to 2026-05-05-223640 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283786 (https://phabricator.wikimedia.org/T414062) (owner: 10Jforrester) [14:17:30] (03CR) 10Bartosz Dziewoński: [C:03+1] rest gateway: defined anon-mediawiki class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler) [14:18:32] !log dmartin@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:18:48] (03CR) 10CI reject: [V:04-1] [WIP] Add auth_proxy.httpd_cas module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283791 (owner: 10Joal) [14:19:08] (03Merged) 10jenkins-bot: Add user_groups to editAttemptStep schema [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283050 (https://phabricator.wikimedia.org/T424010) (owner: 10Kosta Harlan) [14:19:29] !log dmartin@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:19:35] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1283050|Add user_groups to editAttemptStep schema (T424010)]] [14:19:38] T424010: Collect performer implicit groups in editattemptstep for hCaptcha rollout - https://phabricator.wikimedia.org/T424010 [14:19:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [14:20:13] (03CR) 10Ayounsi: [C:03+1] "Those rules are getting out of hands but let's give it a try !" [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi) [14:20:19] !log dmartin@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:20:40] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [14:20:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [14:20:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4008.ulsfo.wmnet with OS bookworm [14:21:04] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts durum5002.eqsin.wmnet [14:21:27] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1283050|Add user_groups to editAttemptStep schema (T424010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:21:30] !log dmartin@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:21:54] !log dmartin@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:22:43] !log dmartin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:23:08] !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [14:23:11] (03CR) 10Ssingh: [C:03+1] Hieradata: Update IPs for ULSFO dns host in rack 23 [puppet] - 10https://gerrit.wikimedia.org/r/1283548 (https://phabricator.wikimedia.org/T424686) (owner: 10Slyngshede) [14:24:29] jmm@cumin2002 decommission (PID 2811622) is awaiting input [14:24:53] (03PS1) 10Jelto: miscweb: make sure custom sidecar also has config.private environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283794 (https://phabricator.wikimedia.org/T414405) [14:25:28] !log sudo cumin "C:bird" "disable-puppet 'merging CR 1282958'" [14:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:26] PROBLEM - Host cp4052 is DOWN: PING CRITICAL - Packet loss = 100% [14:26:44] !log kharlan@deploy1003 kharlan: Continuing with deployment [14:27:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:28:26] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [14:29:26] (03PS3) 10Cwhite: add query object [software/ecs] - 10https://gerrit.wikimedia.org/r/1280737 (https://phabricator.wikimedia.org/T423986) [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1400) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1430) [14:30:49] (03CR) 10Jelto: [C:03+2] miscweb: make sure custom sidecar also has config.private environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283794 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:30:51] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283050|Add user_groups to editAttemptStep schema (T424010)]] (duration: 11m 16s) [14:30:54] T424010: Collect performer implicit groups in editattemptstep for hCaptcha rollout - https://phabricator.wikimedia.org/T424010 [14:31:01] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns7001.wikimedia.org [reason: testing bird change] [14:31:09] (03CR) 10Ssingh: [C:03+2] Bird: use the GUA v6 gateway instead of link-local [puppet] - 10https://gerrit.wikimedia.org/r/1282958 (owner: 10Ayounsi) [14:31:28] RECOVERY - Host cp4052 is UP: PING OK - Packet loss = 0%, RTA = 71.15 ms [14:31:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11894622 (10jijiki) >>! In T418916#11859808, @jijiki wrote: >>>! In T418916#11859537, @MLechvien-WMF wrote: >>> @Clement_Goubert i am having issues with these failing t... [14:31:34] 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, and 2 others: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T291645#11894623 (10BTullis) a:03BTullis Assigning this to myself, s... [14:32:09] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11894633 (10jijiki) >>! In T418922#11869335, @Jclark-ctr wrote: > @jhancock.wm eqiad servers failed install also. @jijiki when you make change can... [14:33:19] (03Merged) 10jenkins-bot: miscweb: make sure custom sidecar also has config.private environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283794 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:33:22] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:34:21] FIRING: [4x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:34:47] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns7001.wikimedia.org [reason: testing bird change] [14:35:37] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:36:13] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [14:36:49] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [14:37:22] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:37:36] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11894685 (10MoritzMuehlenhoff) [14:40:22] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:40:40] FIRING: [8x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:41:22] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:41:31] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:42:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11894701 (10MoritzMuehlenhoff) [14:43:14] (03PS1) 10Jelto: misweb: also mount secrets in wmf-navigator data-sync sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283804 (https://phabricator.wikimedia.org/T414405) [14:45:12] (03CR) 10Slyngshede: [C:03+2] Hieradata: Update IPs for ULSFO dns host in rack 23 [puppet] - 10https://gerrit.wikimedia.org/r/1283548 (https://phabricator.wikimedia.org/T424686) (owner: 10Slyngshede) [14:45:25] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1077.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:46:21] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:46:59] (03CR) 10Jelto: [C:03+2] misweb: also mount secrets in wmf-navigator data-sync sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283804 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:47:18] jmm@cumin2002 decommission (PID 2811622) is awaiting input [14:48:06] (03PS1) 10Ladsgroup: Close Chinese Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283805 (https://phabricator.wikimedia.org/T421796) [14:49:39] (03Merged) 10jenkins-bot: misweb: also mount secrets in wmf-navigator data-sync sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283804 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:50:42] jouncebot: nowandnext [14:50:42] For the next 0 hour(s) and 9 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1400) [14:50:42] For the next 0 hour(s) and 9 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1430) [14:50:43] In 2 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1700) [14:51:21] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:51:24] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:53:21] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:53:45] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS trixie [14:54:30] jmm@cumin2002 decommission (PID 2811622) is awaiting input [14:54:39] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:55:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:55:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:55:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum5002.eqsin.wmnet [14:55:20] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11894788 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `durum5002.eqsin.wmnet` - durum5002.eqsin.wmnet (**PASS**)... [14:55:53] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1121233768 and 98 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:57:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283805 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [14:57:40] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2208 Backplane 0 error - https://phabricator.wikimedia.org/T425516#11894823 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm HWC8010 this is usually caused by a firmware update that didn't take for whatever reason. F2 > iDRAC Settings > Hardware Configurati... [14:58:07] (03CR) 10Bking: "I took at look at what that would involve and I think I will avoid it for now since we need the alerts for a major upcoming migration ( T4" [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [14:58:21] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 7 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:58:43] (03PS2) 10Bking: data-platform: Add alerts for cirrus memory or I/O stalls [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T425301) [14:59:02] (03Merged) 10jenkins-bot: Close Chinese Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283805 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [14:59:25] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283805|Close Chinese Wikinews (T421796)]] [14:59:28] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [14:59:53] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7433008 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:01:26] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283805|Close Chinese Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:01:52] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [15:01:56] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2157 - https://phabricator.wikimedia.org/T425242#11894855 (10Jhancock.wm) 05Open→03Resolved [15:02:26] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [15:02:52] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [15:04:00] 10SRE-SLO: Sloth dashboard performance improvement - https://phabricator.wikimedia.org/T425564 (10herron) 03NEW [15:04:51] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T425488#11894874 (10phaultfinder) [15:06:06] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283805|Close Chinese Wikinews (T421796)]] (duration: 06m 41s) [15:06:11] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [15:08:32] !log jasmine@cumin2002 START - Cookbook sre.kafka.change-confluent-distro-version Change Confluent distribution for Kafka A:kafka-main-eqiad cluster: Change Confluent distribution. [15:08:39] !log sudo cumin -b1 -s5 "C:bird and not dns4004*" "run-puppet-agent --enable 'merging CR 1282958'" [15:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:10] (03PS1) 10Cwhite: logstash: filter_on_templates - handle unknown data types [puppet] - 10https://gerrit.wikimedia.org/r/1283810 [15:09:25] FIRING: [4x] SystemdUnitFailed: opensearch_2@.service.d.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:28] (03CR) 10Jasmine: [C:03+2] kafka-main: add eqiad-main cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1282999 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [15:11:46] jasmine@cumin2002 change-confluent-distro-version (PID 2842433) is awaiting input [15:13:39] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565 (10catherine.kelsey.wmde) 03NEW [15:13:56] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566 (10catherine.kelsey.wmde) 03NEW [15:14:13] (03CR) 10Gehel: [C:04-1] data-platform: Add alerts for cirrus memory or I/O stalls (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T425301) (owner: 10Bking) [15:14:45] (03PS1) 10Cwhite: add unknown_data_type to normalized object [software/ecs] - 10https://gerrit.wikimedia.org/r/1283814 [15:14:55] PROBLEM - Check unit status of ipip-multiqueue-optimizer on lvs4008 is CRITICAL: CRITICAL: Status of the systemd unit ipip-multiqueue-optimizer https://wikitech.wikimedia.org/wiki/LVS%23IPIP_encapsulation_experiments [15:15:29] (03PS2) 10Cwhite: logstash: filter_on_templates - handle unknown data types [puppet] - 10https://gerrit.wikimedia.org/r/1283810 [15:18:07] (03PS4) 10Cwhite: add query object [software/ecs] - 10https://gerrit.wikimedia.org/r/1280737 (https://phabricator.wikimedia.org/T423986) [15:18:35] (03PS5) 10Cwhite: add query object [software/ecs] - 10https://gerrit.wikimedia.org/r/1280737 (https://phabricator.wikimedia.org/T423986) [15:19:22] (03CR) 10Gehel: [C:04-1] "Minor comment on naming. I don't know enough about the metrics themselves to validate that part." [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T425301) (owner: 10Bking) [15:25:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [15:30:45] !log jasmine@cumin2002 END (PASS) - Cookbook sre.kafka.change-confluent-distro-version (exit_code=0) Change Confluent distribution for Kafka A:kafka-main-eqiad cluster: Change Confluent distribution. [15:30:47] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 225.03 ms [15:34:07] (03PS3) 10Bking: data-platform: Add alerts for cirrus memory or I/O stalls [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) [15:35:11] (03CR) 10Bking: data-platform: Add alerts for cirrus memory or I/O stalls (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [15:35:26] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host dns4004.wikimedia.org with OS bookworm [15:36:46] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs4009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [15:37:09] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:38:22] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host lvs4008.ulsfo.wmnet with OS bookworm [15:43:46] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:46:37] PROBLEM - Check unit status of ipip-multiqueue-optimizer on lvs4010 is CRITICAL: CRITICAL: Status of the systemd unit ipip-multiqueue-optimizer https://wikitech.wikimedia.org/wiki/LVS%23IPIP_encapsulation_experiments [15:47:13] (03PS1) 10Scott French: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283171 [15:50:16] (03PS2) 10Elukey: wikifunctions: use mesh for the evaluator endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) [15:50:24] (03CR) 10Elukey: wikifunctions: use mesh for the evaluator endpoints (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [15:54:21] RESOLVED: [3x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:55:22] (03PS1) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) [15:56:50] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [15:57:25] (03CR) 10Jforrester: wikifunctions: use mesh for the evaluator endpoints (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [15:57:27] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [15:57:39] (03CR) 10Jforrester: [C:03+1] "LGTM. Is this good to deploy?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [15:58:33] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [16:01:33] (03CR) 10CDanis: profile::cache::haproxy: add webrequest-based ip reputation data (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [16:01:44] (03CR) 10Elukey: "It is yes, but we should be careful and test it in staging first, then on one DC etc.. I can help/assist when we do it! I tested the vario" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [16:03:00] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 221.97 ms [16:04:51] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [16:06:12] (03CR) 10Cwhite: [C:03+2] update pyyaml in dev [software/ecs] - 10https://gerrit.wikimedia.org/r/1280733 (owner: 10Cwhite) [16:06:40] (03Merged) 10jenkins-bot: update pyyaml in dev [software/ecs] - 10https://gerrit.wikimedia.org/r/1280733 (owner: 10Cwhite) [16:07:36] PROBLEM - Recursive DNS on 198.35.26.34 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [16:07:45] yeah no worries [16:09:03] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [16:09:04] 06SRE, 06collaboration-services, 10Observability-Alerting, 06SRE Observability (FY2025/2026-Q1): create a new place for prometheus/alertmanager checks not tied to physical machines - https://phabricator.wikimedia.org/T397264#11895315 (10Dzahn) please also see T361090#11804366 [16:09:21] FIRING: [5x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:12:36] PROBLEM - Recursive DNS on 2620:0:863:2:198:35:26:34 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [16:21:09] (03PS4) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) [16:22:21] (03CR) 10CI reject: [V:04-1] translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [16:23:46] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:25:39] (03PS2) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) [16:27:00] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:27:18] (03CR) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [16:27:48] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [16:28:16] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:28:44] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:28:47] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4008.ulsfo.wmnet with OS bookworm [16:28:56] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:29:07] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:29:21] FIRING: [3x] JobUnavailable: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:29:42] (03CR) 10AKhatun: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282300 (https://phabricator.wikimedia.org/T425336) (owner: 10JavierMonton) [16:29:48] (03PS3) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) [16:30:00] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [16:30:25] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282300 (https://phabricator.wikimedia.org/T425336) (owner: 10JavierMonton) [16:30:34] RECOVERY - Recursive DNS on 2620:0:863:2:198:35:26:34 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [16:30:34] RECOVERY - Recursive DNS on 198.35.26.34 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [16:31:13] (03CR) 10Dzahn: "it would break puppet on the main nodes (fine on executor nodes) because the base class is applied on both. don't worry though, I will ame" [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar) [16:32:29] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282300 (https://phabricator.wikimedia.org/T425336) (owner: 10JavierMonton) [16:33:46] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:34:21] RESOLVED: [3x] JobUnavailable: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:37:05] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns4004.wikimedia.org with OS bookworm [16:37:16] (03CR) 10RLazarus: [C:03+1] shellbox: Pick up newly rebuilt images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283171 (owner: 10Scott French) [16:39:28] (03CR) 10RLazarus: [C:03+1] shellbox: Pick up newly rebuilt images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283171 (owner: 10Scott French) [16:39:36] 06SRE, 10SRE-Access-Requests: logstash access - https://phabricator.wikimedia.org/T425575 (10KineticPelagic) 03NEW [16:39:55] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:39:59] (03PS5) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) [16:40:08] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4008.ulsfo.wmnet with OS trixie [16:40:47] (03CR) 10CI reject: [V:04-1] translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [16:45:04] (03PS1) 10Andrew Bogott: magnum.conf: set nova-api compatibility to v2.15 [puppet] - 10https://gerrit.wikimedia.org/r/1283834 (https://phabricator.wikimedia.org/T393782) [16:45:15] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:46:48] (03PS1) 10Elukey: profile::kafka::mirror::alerts: fix max lag's group label [puppet] - 10https://gerrit.wikimedia.org/r/1283837 [16:46:53] (03CR) 10Cathal Mooney: CoreRouterInterfaceDropPercent: fix ping disable (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi) [16:46:56] (03CR) 10Cathal Mooney: [C:03+2] CoreRouterInterfaceDropPercent: fix ping disable [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi) [16:48:07] (03PS6) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) [16:48:30] (03CR) 10Cathal Mooney: [C:03+2] "indeed yeah it's quite cumbersome just to make the linter happy." [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi) [16:48:35] (03Merged) 10jenkins-bot: CoreRouterInterfaceDropPercent: fix ping disable [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi) [16:49:42] (03CR) 10CI reject: [V:04-1] translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [16:52:37] !log rebooting asw1-22-ulsfo to upgrade SR-Linux OS on switch T408892 [16:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:40] T408892: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892 [16:53:03] 06SRE, 10SRE-Access-Requests: logstash access - https://phabricator.wikimedia.org/T425575#11895517 (10Aklapper) Hi @KineticPelagic, https://wikitech.wikimedia.org/wiki/Logstash#Authentication implies that you could request access to Logstash through the IDM tool. If that works for you, then please feel free to... [16:53:22] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw1-22-ulsfo,asw1-22-ulsfo IPv6 with reason: upgrading sr-linux on asw1-23-ulsfo [16:58:20] (03CR) 10JMeybohm: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1283837 (owner: 10Elukey) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1700) [17:00:06] PROBLEM - Host durum4004 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:10] PROBLEM - Host doh4003 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:10] PROBLEM - Host doh4004 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:13] o/ [17:00:14] PROBLEM - Host bast4006 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:16] PROBLEM - Host hcaptcha-proxy4004 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:16] PROBLEM - Host hcaptcha-proxy4003 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:35] "expected" [17:00:40] PROBLEM - Host ncredir4004 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:40] PROBLEM - Host install4004 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:46] PROBLEM - Host durum4003 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:46] PROBLEM - Host netflow4003 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:46] PROBLEM - Host prometheus4003 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:48] PROBLEM - Host tcp-proxy4004 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:48] PROBLEM - Host tcp-proxy4003 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:50] PROBLEM - Host ncredir4003 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:53] we should probably extend the downtime [17:01:14] RECOVERY - Host doh4003 is UP: PING OK - Packet loss = 0%, RTA = 71.57 ms [17:01:14] RECOVERY - Host durum4003 is UP: PING OK - Packet loss = 0%, RTA = 71.53 ms [17:01:14] RECOVERY - Host install4004 is UP: PING OK - Packet loss = 0%, RTA = 71.54 ms [17:01:14] RECOVERY - Host prometheus4003 is UP: PING OK - Packet loss = 0%, RTA = 74.04 ms [17:01:16] RECOVERY - Host hcaptcha-proxy4004 is UP: PING OK - Packet loss = 0%, RTA = 71.55 ms [17:01:16] RECOVERY - Host tcp-proxy4003 is UP: PING OK - Packet loss = 0%, RTA = 71.49 ms [17:01:16] RECOVERY - Host tcp-proxy4004 is UP: PING OK - Packet loss = 0%, RTA = 71.35 ms [17:01:24] RECOVERY - Host doh4004 is UP: PING WARNING - Packet loss = 90%, RTA = 73.88 ms [17:01:25] (03CR) 10Scott French: [C:03+2] shellbox: Pick up newly rebuilt images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283171 (owner: 10Scott French) [17:01:26] RECOVERY - Host ncredir4004 is UP: PING OK - Packet loss = 0%, RTA = 75.95 ms [17:01:26] RECOVERY - Host ncredir4003 is UP: PING OK - Packet loss = 0%, RTA = 71.26 ms [17:01:34] RECOVERY - Host durum4004 is UP: PING OK - Packet loss = 0%, RTA = 71.72 ms [17:01:42] RECOVERY - Host bast4006 is UP: PING OK - Packet loss = 0%, RTA = 71.52 ms [17:01:48] RECOVERY - Host hcaptcha-proxy4003 is UP: PING OK - Packet loss = 0%, RTA = 71.56 ms [17:02:12] RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs4009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [17:02:14] RECOVERY - Host netflow4003 is UP: PING OK - Packet loss = 0%, RTA = 71.47 ms [17:02:15] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on 39 hosts with reason: ulsfo depooled for switch work [17:02:28] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11895570 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a4b7dc3f-da06-4cb4-8580-9dac41f4da23) set by sukhe@cumin1003 for 3 days, 0:00:00... [17:03:49] (03Merged) 10jenkins-bot: shellbox: Pick up newly rebuilt images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283171 (owner: 10Scott French) [17:04:25] FYI, I'll be deploying shellbox services shortly [17:04:36] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [17:05:44] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:06:25] (03PS7) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) [17:06:46] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [17:06:52] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:07:04] (03CR) 10Atsuko: "added credentials to private/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [17:07:12] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [17:07:13] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [17:07:23] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [17:07:25] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [17:07:37] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [17:07:38] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:07:54] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:07:55] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [17:08:14] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [17:08:15] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [17:08:37] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [17:08:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:10:28] 10ops-esams, 06SRE, 06Commons, 06DC-Ops, and 3 others: ESAMS serving an older revision of some overwritten files - https://phabricator.wikimedia.org/T425216#11895597 (10ssingh) >>! In T425216#11887650, @TheDJ wrote: > I believe there is a 24 hourly script that checks cross dc consistency or something The... [17:10:44] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:14:14] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [17:14:51] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [17:15:22] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [17:15:54] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [17:16:26] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [17:16:40] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [17:17:11] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:17:30] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:18:01] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [17:18:25] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [17:18:57] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [17:20:01] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [17:27:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [17:27:45] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw1-23-ulsfo,asw1-23-ulsfo IPv6 with reason: upgrading sr-linux on asw1-23-ulsfo [17:27:55] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11895651 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cc7686ab-d152-4291-9303-296008017c88) set by cmooney@cumin1003 for 1:00:00 on 2... [17:28:11] !log rebooting asw1-23-ulsfo to upgrade SR-Linux OS on switch T408892 [17:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:14] T408892: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892 [17:31:35] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [17:31:46] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:32:23] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [17:32:54] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [17:33:35] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [17:33:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr4-ulsfo:et-0/0/2 (Core: asw1-23-ulsfo:ethernet-1/56 {#G24090478750000318}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:34:07] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [17:34:22] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [17:34:53] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:35:11] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:35:42] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [17:35:44] FIRING: [4x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:36:06] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [17:36:37] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [17:36:46] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:37:40] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [17:38:13] (03PS2) 10Dzahn: zuul: rename web_port > finger_port and set it in conf [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar) [17:39:21] FIRING: [16x] JobUnavailable: Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:41:13] (03PS3) 10Dzahn: zuul: rename web_port > finger_port and set it in conf [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar) [17:42:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [17:45:36] (03PS4) 10Dzahn: zuul: rename web_port > finger_port and set it in conf [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar) [17:48:11] (03CR) 10Andrew Bogott: [C:03+2] magnum.conf: set nova-api compatibility to v2.15 [puppet] - 10https://gerrit.wikimedia.org/r/1283834 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [17:49:52] RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 746 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [17:49:54] RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [17:49:54] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [17:49:54] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 746 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [17:49:54] RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [17:49:54] RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 750 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [17:54:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [17:55:29] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-23-ulsfo [17:55:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [17:55:51] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device asw1-23-ulsfo [17:59:10] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: restart to test fixes from T425301 - bking@cumin2002 [17:59:13] T425301: The cloudelastic chi cluster is red - https://phabricator.wikimedia.org/T425301 [17:59:24] jouncebot: nowandnext [17:59:24] For the next 0 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1700) [17:59:25] In 0 hour(s) and 0 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1800) [17:59:44] I wait then :D [18:00:05] brennen and jeena: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T1800). nyaa~ [18:00:08] o/ [18:01:31] currently blocking on T425582 / T425475 [18:01:32] T425582: DB schema change in production - for ce_event_contributions - https://phabricator.wikimedia.org/T425582 [18:01:32] T425475: EventContribution: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cec.cec_references_delta' in 'SELECT' - https://phabricator.wikimedia.org/T425475 [18:01:52] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-23-ulsfo [18:02:15] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-23-ulsfo [18:05:27] (03CR) 10VolkerE: [C:03+1] microsites: adjust monitoring string for design.wikimedia.org, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/1283213 (https://phabricator.wikimedia.org/T329991) (owner: 10Dzahn) [18:07:10] (03PS1) 10CDanis: Revert "haproxy: webrequest: capture ratelimiting headers" [puppet] - 10https://gerrit.wikimedia.org/r/1283858 [18:08:19] (03PS5) 10Dzahn: zuul: rename web_port > finger_port and set it in conf [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar) [18:12:30] (03PS1) 10CDanis: benthos/webrequest: fix rl_class/trusted_req names [puppet] - 10https://gerrit.wikimedia.org/r/1283862 (https://phabricator.wikimedia.org/T419736) [18:13:13] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283862 (https://phabricator.wikimedia.org/T419736) (owner: 10CDanis) [18:13:46] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:13:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr4-ulsfo:et-0/0/2 (Core: asw1-23-ulsfo:ethernet-1/56 {#G24090478750000318}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:14:21] RESOLVED: [16x] JobUnavailable: Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:15:44] FIRING: [4x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:19:10] (03PS6) 10Dzahn: zuul: rename web_port > finger_port and set it in conf [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar) [18:24:05] (03CR) 10Dzahn: [V:03+1] "works now: https://puppet-compiler.wmflabs.org/output/1283768/8524/" [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar) [18:24:23] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-05-05-223522 to 2026-05-06-154732 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283868 [18:24:37] (03CR) 10Dzahn: [C:03+2] zuul: rename web_port > finger_port and set it in conf [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar) [18:26:03] (03PS1) 10Dduvall: zuul: Disallow job definitions in untrusted projects [puppet] - 10https://gerrit.wikimedia.org/r/1283870 [18:27:10] (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1283858 (owner: 10CDanis) [18:27:44] (03PS2) 10CDanis: Revert "haproxy: webrequest: capture ratelimiting headers" [puppet] - 10https://gerrit.wikimedia.org/r/1283858 [18:28:22] (03PS3) 10CDanis: Revert "haproxy: webrequest: capture ratelimiting headers" [puppet] - 10https://gerrit.wikimedia.org/r/1283858 [18:28:57] (03CR) 10Dzahn: [C:03+2] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1283768 (owner: 10Hashar) [18:29:22] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-05-05-223522 to 2026-05-06-154732 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283868 (owner: 10Jforrester) [18:29:44] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host lvs4010.ulsfo.wmnet with OS bookworm [18:30:44] (03PS1) 10Ladsgroup: Close English Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283872 (https://phabricator.wikimedia.org/T421796) [18:31:42] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-05-05-223522 to 2026-05-06-154732 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283868 (owner: 10Jforrester) [18:31:54] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host cp4050.ulsfo.wmnet with OS trixie [18:32:58] !log 1.47.0-wmf.1 train status (T423910): blockers resolved, rolling to group1 [18:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:01] T423910: 1.47.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T423910 [18:33:30] (03CR) 10Dzahn: [C:03+2] zuul: Disallow job definitions in untrusted projects [puppet] - 10https://gerrit.wikimedia.org/r/1283870 (owner: 10Dduvall) [18:34:22] (03PS1) 10TrainBranchBot: group1 to 1.47.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283873 (https://phabricator.wikimedia.org/T423910) [18:34:24] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283873 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot) [18:35:19] (03CR) 10Dzahn: [C:03+2] add discovery names for gitlab [dns] - 10https://gerrit.wikimedia.org/r/1282437 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [18:35:45] !log dzahn@dns1005 START - running authdns-update [18:36:38] (03Merged) 10jenkins-bot: group1 to 1.47.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283873 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot) [18:37:16] !log dzahn@dns1005 END - running authdns-update [18:38:46] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:39:33] (03CR) 10Snwachukwu: "Sure, I'm available to deploy once this is merged." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281589 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [18:39:55] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:40:16] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:40:34] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [18:40:55] FIRING: [8x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:41:15] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [18:42:14] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [18:42:48] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.47.0-wmf.1 refs T423910 [18:42:52] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [18:42:53] T423910: 1.47.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T423910 [18:44:11] (03CR) 10Jforrester: [C:03+2] wikifunctions: use mesh for the evaluator endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [18:44:56] Amir1: maybe give it 5 minutes for the train to settle at group1 and if i don't yell by then feel free to use the rest of the window. [18:45:06] ooooh, nice [18:45:07] Thanks! [18:45:27] sure thing [18:46:20] (03Merged) 10jenkins-bot: wikifunctions: use mesh for the evaluator endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283701 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [18:46:32] (03PS1) 10DDesouza: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283875 (https://phabricator.wikimedia.org/T329991) [18:47:33] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:47:50] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:48:50] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [18:49:00] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS trixie [18:49:12] (03PS1) 10Jforrester: Revert "wikifunctions: use mesh for the evaluator endpoints" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283877 [18:49:17] (03CR) 10Jforrester: [C:03+2] Revert "wikifunctions: use mesh for the evaluator endpoints" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283877 (owner: 10Jforrester) [18:50:40] (03CR) 10Jforrester: [C:03+2] Wikifunctions: Turn on import of references inside Wikidata statements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275996 (https://phabricator.wikimedia.org/T404652) (owner: 10David Martin) [18:50:48] (03CR) 10DDesouza: [C:03+2] miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283875 (https://phabricator.wikimedia.org/T329991) (owner: 10DDesouza) [18:51:25] (03Merged) 10jenkins-bot: Revert "wikifunctions: use mesh for the evaluator endpoints" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283877 (owner: 10Jforrester) [18:52:56] (03Merged) 10jenkins-bot: Wikifunctions: Turn on import of references inside Wikidata statements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275996 (https://phabricator.wikimedia.org/T404652) (owner: 10David Martin) [18:53:13] (03Merged) 10jenkins-bot: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283875 (https://phabricator.wikimedia.org/T329991) (owner: 10DDesouza) [18:53:47] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:53:58] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:54:14] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [18:54:47] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage [18:54:52] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [18:55:08] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [18:55:22] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [18:55:54] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [18:58:55] (03CR) 10Eevans: [V:03+2 C:03+2] "@snwachukwu@wikimedia.org awesome, whenever you are ready." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281589 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [18:59:05] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage [18:59:09] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [18:59:22] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [18:59:23] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [18:59:35] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [18:59:36] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [18:59:48] (03CR) 10Eevans: [C:03+2] revise-tone-task-generator: updated list of aqs cassandra nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281588 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [18:59:51] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:01:09] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [19:01:39] (03CR) 10Eevans: [C:03+2] "This has now been merged. There is no rush to deploy, but the longer that you wait, the more likely it is that someone will be surprised " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281588 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [19:05:29] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [19:09:40] FIRING: [2x] SystemdUnitFailed: opensearch_2@.service.d.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:10:34] (03CR) 10Ebernhardson: [C:03+1] "This seems reasonable to me. I checked the client library we use and for the auth credentials it expects the keys `auth_type`, `username`" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [19:14:52] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4010.ulsfo.wmnet with OS bookworm [19:23:24] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4050.ulsfo.wmnet with OS trixie [19:24:05] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1012.eqiad.wmnet with OS trixie [19:28:45] (03PS1) 10Bking: cloudelastic1012: remove host-specific overrides [puppet] - 10https://gerrit.wikimedia.org/r/1283903 (https://phabricator.wikimedia.org/T425300) [19:29:42] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283903 (https://phabricator.wikimedia.org/T425300) (owner: 10Bking) [19:30:56] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4010.ulsfo.wmnet with OS trixie [19:44:03] 06SRE, 10DNS, 10Domains, 06Traffic-Icebox, 07HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071#11896068 (10Novem_Linguae) Should this be closed in favor of {T205378}? Seems like pretty soon browsers and servers will support technology that co... [19:45:13] (03CR) 10Bking: [C:03+2] cloudelastic1012: remove host-specific overrides [puppet] - 10https://gerrit.wikimedia.org/r/1283903 (https://phabricator.wikimedia.org/T425300) (owner: 10Bking) [19:46:09] 06SRE, 06Traffic, 06Traffic-Icebox, 07HTTPS, 07Upstream: Support Encrypted Client Hello (ECH) on Wikimedia servers - https://phabricator.wikimedia.org/T205378#11896070 (10Novem_Linguae) [19:56:07] (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1283862 (https://phabricator.wikimedia.org/T419736) (owner: 10CDanis) [19:57:09] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS trixie [19:58:02] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T2000). Please do the needful. [20:00:05] SomeRandomDev: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1012.eqiad.wmnet with OS trixie [20:00:10] Hi [20:03:25] (03CR) 10CDanis: [C:03+2] benthos/webrequest: fix rl_class/trusted_req names [puppet] - 10https://gerrit.wikimedia.org/r/1283862 (https://phabricator.wikimedia.org/T419736) (owner: 10CDanis) [20:10:51] anybody here for the deployment window? [20:12:03] RoanKattouw urbanecm TheresNoTime kindrobot cjming (friendly reping ^^) [20:12:39] Sorry about that, I can deploy [20:12:44] thanks! [20:13:09] * TheresNoTime is also about if needed [20:13:37] it's https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1281526; not 100% sure how I would test it, but it shouldn't change anything in theory [20:14:14] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS trixie [20:14:23] 10ops-eqiad, 06DC-Ops: Unresponsive management for an-presto1006.mgmt:22 - https://phabricator.wikimedia.org/T425590 (10phaultfinder) 03NEW [20:16:20] I see, I can test that by looking at the Logstash feeds this code logs to [20:16:39] ah, great [20:18:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281526 (https://phabricator.wikimedia.org/T336703) (owner: 10SomeRandomDeveloper) [20:19:06] (03Merged) 10jenkins-bot: Replace use of $wgRequest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281526 (https://phabricator.wikimedia.org/T336703) (owner: 10SomeRandomDeveloper) [20:19:31] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1281526|Replace use of $wgRequest (T336703)]] [20:19:34] T336703: Replace use of $wgRequest in wmf-config (CommonSettings.php / throttle-analyze.php) - https://phabricator.wikimedia.org/T336703 [20:21:31] !log catrope@deploy1003 catrope, somerandomdeveloper: Backport for [[gerrit:1281526|Replace use of $wgRequest (T336703)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:21:50] (03CR) 10ArielGlenn: [C:03+1] "Basically good, one question inline, one tiny nit also" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278376 (https://phabricator.wikimedia.org/T424828) (owner: 10Daniel Kinzler) [20:24:29] !log catrope@deploy1003 catrope, somerandomdeveloper: Continuing with deployment [20:25:54] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [20:28:43] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1281526|Replace use of $wgRequest (T336703)]] (duration: 09m 12s) [20:28:46] T336703: Replace use of $wgRequest in wmf-config (CommonSettings.php / throttle-analyze.php) - https://phabricator.wikimedia.org/T336703 [20:29:25] Thank you! [20:29:26] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:29:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1012.eqiad.wmnet with reason: host reimage [20:32:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11896201 (10VRiley-WMF) [20:33:37] (03CR) 10ArielGlenn: rest gateway: defined anon-mediawiki class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler) [20:33:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/1/5 (Transport: cr2-codfw:et-0/1/4 (Lumen, 449169461) {#changeme_lumen_patch}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:38:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:44:03] 06SRE, 10dev-images, 06Infrastructure-Foundations, 06Release-Engineering-Team (Priority Backlog 📥): Rebuild dev-images using a base image without mirrors.wikimedia.org in the apt sources - https://phabricator.wikimedia.org/T423972#11896214 (10brennen) a:03brennen [20:58:02] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T2100) [21:04:01] (03PS1) 10Zabe: Disable GNSM on dewikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283953 (https://phabricator.wikimedia.org/T421798) [21:04:36] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [21:04:51] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:05:54] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host pc1021 [21:06:55] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc1021 [21:07:42] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [21:09:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:11:47] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [pc1021] - vriley@cumin1003" [21:11:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [pc1021] - vriley@cumin1003" [21:11:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:12:57] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:14:16] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:15:11] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:17:17] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:22:17] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:24:10] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:26:59] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:27:58] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1012.eqiad.wmnet with OS trixie [21:28:26] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:29:32] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:31:36] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:43:31] jouncebot: nowandnext [21:43:31] For the next 0 hour(s) and 16 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T2100) [21:43:31] In 0 hour(s) and 16 minute(s): Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T2200) [21:43:47] (03CR) 10Zabe: [C:03+2] Disable GNSM on dewikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283953 (https://phabricator.wikimedia.org/T421798) (owner: 10Zabe) [21:44:07] (03PS4) 10Ryan Kemper: data-platform: Add alerts for cirrus memory or I/O stalls [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [21:44:43] (03Merged) 10jenkins-bot: Disable GNSM on dewikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283953 (https://phabricator.wikimedia.org/T421798) (owner: 10Zabe) [21:44:48] 06SRE, 10SRE-Access-Requests: logstash access - https://phabricator.wikimedia.org/T425575#11896348 (10KineticPelagic) 05Open→03Invalid Made request through IDM. Thank you, @Aklapper . [21:45:38] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1283953|Disable GNSM on dewikinews (T421798)]] [21:45:41] T421798: Undeploy GoogleNewsSitemap after 2026-05-04 - https://phabricator.wikimedia.org/T421798 [21:47:34] !log zabe@deploy1003 zabe: Backport for [[gerrit:1283953|Disable GNSM on dewikinews (T421798)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:48:23] !log zabe@deploy1003 zabe: Continuing with deployment [21:51:20] (03PS5) 10Ryan Kemper: data-platform: Add alerts for cirrus memory/IO stalls [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [21:51:49] (03CR) 10Bking: [C:03+2] data-platform: Add alerts for cirrus memory/IO stalls [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [21:52:00] (03CR) 10Ryan Kemper: [C:03+1] "We should circle back to add runbooks but I'm fine getting this shipped first" [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [21:52:34] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283953|Disable GNSM on dewikinews (T421798)]] (duration: 06m 56s) [21:52:38] T421798: Undeploy GoogleNewsSitemap after 2026-05-04 - https://phabricator.wikimedia.org/T421798 [21:54:03] (03PS1) 10Cathal Mooney: Rancid: fix srlinux.pm parsing of Nokia SR-Linux configs [puppet] - 10https://gerrit.wikimedia.org/r/1283969 [21:54:12] (03Merged) 10jenkins-bot: data-platform: Add alerts for cirrus memory/IO stalls [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) (owner: 10Bking) [21:54:22] (03PS4) 10Zabe: Undeploy GoogleNewsSitemap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277783 (https://phabricator.wikimedia.org/T421798) [21:57:23] (03PS2) 10Cathal Mooney: Rancid: fix srlinux.pm parsing of Nokia SR-Linux configs [puppet] - 10https://gerrit.wikimedia.org/r/1283969 [21:58:10] (03PS1) 10Clare Ming: UBN fix: guard entry.serverTiming before forEach [extensions/TestKitchen] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283972 (https://phabricator.wikimedia.org/T425591) [21:58:55] (03CR) 10Cathal Mooney: [C:03+2] Rancid: fix srlinux.pm parsing of Nokia SR-Linux configs [puppet] - 10https://gerrit.wikimedia.org/r/1283969 (owner: 10Cathal Mooney) [22:00:00] jouncebot: nowandnext [22:00:00] For the next 0 hour(s) and 59 minute(s): Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T2200) [22:00:01] In 7 hour(s) and 59 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T0600) [22:00:01] In 7 hour(s) and 59 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T0600) [22:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T2200) [22:00:42] ok if i backport a UBN fix now-ish? [22:00:44] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TestKitchen/+/1283972 [22:04:02] it's the Readers deployment window now -- if there's no activity in the next 2-3 minutes, I will proceed with above patch unless someone else me otherwise [22:04:34] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host pc1021 [22:04:38] *someone else tells me otherwise [22:05:35] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc1021 [22:06:04] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [22:06:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/TestKitchen] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283972 (https://phabricator.wikimedia.org/T425591) (owner: 10Clare Ming) [22:07:46] (03Merged) 10jenkins-bot: UBN fix: guard entry.serverTiming before forEach [extensions/TestKitchen] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283972 (https://phabricator.wikimedia.org/T425591) (owner: 10Clare Ming) [22:08:06] (03CR) 10ArielGlenn: "couple small things noted, really glad to see this patch land,the makefiles in nonstandard locations was making my teeth itch a bit :-)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler) [22:08:11] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1283972|UBN fix: guard entry.serverTiming before forEach (T425591)]] [22:08:14] T425591: TypeError: undefined is not an object (evaluating 'entry.serverTiming.forEach') - https://phabricator.wikimedia.org/T425591 [22:08:51] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:09:24] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:10:03] !log cjming@deploy1003 cjming: Backport for [[gerrit:1283972|UBN fix: guard entry.serverTiming before forEach (T425591)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:10:27] !log cjming@deploy1003 cjming: Continuing with deployment [22:11:25] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:11:49] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:14:36] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283972|UBN fix: guard entry.serverTiming before forEach (T425591)]] (duration: 06m 25s) [22:14:39] T425591: TypeError: undefined is not an object (evaluating 'entry.serverTiming.forEach') - https://phabricator.wikimedia.org/T425591 [22:14:40] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:14:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [22:14:58] 10ops-esams, 06SRE, 06Commons, 06DC-Ops, and 3 others: ESAMS serving an older revision of some overwritten files - https://phabricator.wikimedia.org/T425216#11896466 (10AlexisJazz) >>! In T425216#11895597, @ssingh wrote: >>>! In T425216#11887650, @TheDJ wrote: >> I believe there is a 24 hourly script that... [22:15:01] (03PS2) 10Hashar: zuul: use upstream "build node" semantic [puppet] - 10https://gerrit.wikimedia.org/r/1283968 [22:15:01] (03CR) 10Hashar: "Hosts that have no differences (main and executor):" [puppet] - 10https://gerrit.wikimedia.org/r/1283968 (owner: 10Hashar) [22:15:59] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:18:29] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host pc1021.eqiad.wmnet with OS trixie [22:18:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11896468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host pc1021.eqiad.wmnet with OS trixie [22:22:45] jouncebot: nowandnext [22:22:45] For the next 0 hour(s) and 37 minute(s): Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260506T2200) [22:22:46] In 7 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T0600) [22:22:46] In 7 hour(s) and 37 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T0600) [22:23:50] (03PS1) 10Jasmine: kafka-main: set codfw brokers inter-broker protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1283988 (https://phabricator.wikimedia.org/T419212) [22:24:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [22:24:59] (03CR) 10Jasmine: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283988 (https://phabricator.wikimedia.org/T419212) (owner: 10Jasmine) [22:25:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283872 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [22:26:03] (03Merged) 10jenkins-bot: Close English Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283872 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [22:26:29] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283872|Close English Wikinews (T421796)]] [22:26:32] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [22:28:22] (03PS1) 10Cathal Mooney: team-netops: CoreRouterInterfaceDropPercent - ingore missing series [alerts] - 10https://gerrit.wikimedia.org/r/1283993 [22:28:24] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283872|Close English Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:28:59] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [22:33:09] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283872|Close English Wikinews (T421796)]] (duration: 06m 40s) [22:33:12] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [22:37:58] (03PS1) 10Ladsgroup: Close Spanish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284004 (https://phabricator.wikimedia.org/T421796) [22:40:55] FIRING: [8x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:41:25] (03PS2) 10Jasmine: kafka-main: set codfw brokers inter-broker protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1283988 (https://phabricator.wikimedia.org/T419216) [22:41:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284004 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [22:42:50] (03Merged) 10jenkins-bot: Close Spanish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284004 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [22:43:16] (03PS1) 10SBassett: Enable CSPUseReportURIDirective in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284008 (https://phabricator.wikimedia.org/T424058) [22:43:16] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1284004|Close Spanish Wikinews (T421796)]] [22:43:19] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [22:43:45] (03CR) 10SBassett: [C:04-1] "Hold for config deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284008 (https://phabricator.wikimedia.org/T424058) (owner: 10SBassett) [22:45:12] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1284004|Close Spanish Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:46:11] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [22:48:29] 06SRE, 06Traffic: "Nahui Ollin" is enwiki's #1 article. Never heard of it? That's the problem - https://phabricator.wikimedia.org/T425600 (10AlexisJazz) 03NEW [22:50:24] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1284004|Close Spanish Wikinews (T421796)]] (duration: 07m 08s) [22:50:28] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [22:56:31] FIRING: Traffic on tunnel link: Alert for device cr1-drmrs.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [23:00:49] (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284014 (https://phabricator.wikimedia.org/T421797) [23:01:31] RESOLVED: Traffic on tunnel link: Device cr1-drmrs.wikimedia.org recovered from Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [23:03:40] (03CR) 10Ladsgroup: [C:03+2] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284014 (https://phabricator.wikimedia.org/T421797) (owner: 10Ladsgroup) [23:05:20] (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284014 (https://phabricator.wikimedia.org/T421797) (owner: 10Ladsgroup) [23:12:25] !log ladsgroup@deploy1003 Synchronized portals/wikipedia.org/assets: Sync portals for removal of Wikinews (duration: 06m 12s) [23:14:48] !log ladsgroup@deploy1003 Synchronized portals: Sync portals for removal of Wikinews (duration: 02m 22s) [23:19:50] (03PS1) 10Cwhite: ci test plz [puppet] - 10https://gerrit.wikimedia.org/r/1284024 [23:22:45] 06SRE, 10Pageviews-Anomaly, 06Traffic: "Nahui Ollin" is enwiki's #1 article. Never heard of it? That's the problem - https://phabricator.wikimedia.org/T425600#11896663 (10AntiCompositeNumber) [23:38:45] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host pc1021.eqiad.wmnet with OS trixie [23:38:52] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11896690 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host pc1021.eqiad.wmnet with OS trixie executed with errors: - pc1021 (**FAIL**... [23:40:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1284035 [23:40:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1284035 (owner: 10TrainBranchBot) [23:41:22] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host pc1021.eqiad.wmnet with OS trixie [23:41:36] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11896691 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host pc1021.eqiad.wmnet with OS trixie [23:52:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1284035 (owner: 10TrainBranchBot)