[00:05:42] (03PS1) 10Dzahn: add load balancer IPs for gitlab to geo DNS [dns] - 10https://gerrit.wikimedia.org/r/1282436 [00:06:12] RECOVERY - OpenSearch unassigned shard check - 9200 on cloudelastic1008 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [00:07:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282434 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [00:08:18] (03Merged) 10jenkins-bot: Close Catalan Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282434 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [00:09:12] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1282434|Close Catalan Wikinews (T421796)]] [00:09:15] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [00:10:56] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1282434|Close Catalan Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:11:54] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [00:14:55] (03PS1) 10Dzahn: add discovery names for gitlab [dns] - 10https://gerrit.wikimedia.org/r/1282437 [00:16:02] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282434|Close Catalan Wikinews (T421796)]] (duration: 06m 50s) [00:16:05] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [00:16:14] RECOVERY - OpenSearch unassigned shard check - 9200 on cloudelastic1010 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [00:20:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:23:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [00:25:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:26:12] RECOVERY - OpenSearch unassigned shard check - 9200 on cloudelastic1009 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [00:31:12] RECOVERY - OpenSearch unassigned shard check - 9200 on cloudelastic1007 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [00:31:14] RECOVERY - OpenSearch unassigned shard check - 9200 on cloudelastic1012 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [00:31:14] RECOVERY - OpenSearch unassigned shard check - 9200 on cloudelastic1011 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [00:42:38] PROBLEM - dump of db_inventory in eqiad on backupmon1001 is CRITICAL: Last dump for db_inventory at eqiad (db1215) taken on 2026-05-05 00:38:41 is 3 MiB, but the previous one was 3 MiB, a change of -19.9 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:44:33] (03CR) 10Scott French: [C:03+1] delete mwmaint.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1282430 (owner: 10Dzahn) [00:44:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [00:49:17] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 384708728 and 44 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:25] (03CR) 10Scott French: [C:03+1] Revert "envoy: Allow configuring delayed_closed_timeout" [puppet] - 10https://gerrit.wikimedia.org/r/1282338 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [00:49:33] (03CR) 10Scott French: [C:03+1] Revert "envoy: Allow disabling circuit breakers" [puppet] - 10https://gerrit.wikimedia.org/r/1282339 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [00:49:56] (03CR) 10Scott French: [C:03+1] Revert "envoyproxy: Allow disabling x-request-id generation" [puppet] - 10https://gerrit.wikimedia.org/r/1282340 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [00:50:24] (03CR) 10Scott French: [C:03+1] Revert "envoyproxy: Allow setting http2 protocol options" [puppet] - 10https://gerrit.wikimedia.org/r/1282341 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [00:50:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:51:01] (03CR) 10Scott French: [C:03+1] Revert "envoyproxy: Allow configuring TLS handshake timeout" [puppet] - 10https://gerrit.wikimedia.org/r/1282342 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [00:52:20] (03CR) 10Scott French: [C:03+1] Revert "envoyproxy: Support TLS min/max version config" [puppet] - 10https://gerrit.wikimedia.org/r/1282343 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [00:53:52] (03CR) 10Scott French: [C:03+1] Revert "envoyproxy: Support alpn_protocols configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1282344 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [00:55:45] RESOLVED: [3x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:58:16] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:04:41] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [01:07:40] PROBLEM - dump of db_inventory in codfw on backupmon1001 is CRITICAL: Last dump for db_inventory at codfw (db2185) taken on 2026-05-05 00:37:50 is 3 MiB, but the previous one was 3 MiB, a change of -19.9 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:09:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.47.0-wmf.1 [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282439 (https://phabricator.wikimedia.org/T423910) [01:09:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.47.0-wmf.1 [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282439 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot) [01:09:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [01:09:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [01:09:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1282440 [01:09:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1282440 (owner: 10TrainBranchBot) [01:10:16] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 504610352 and 41 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:10:22] (03CR) 10Scott French: "Thanks, Janis!" [puppet] - 10https://gerrit.wikimedia.org/r/1282347 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [01:14:54] 10ops-esams, 06SRE, 06Commons, 06DC-Ops, and 3 others: ESAMS serving an older revision of some overwritten files - https://phabricator.wikimedia.org/T425216#11888186 (10AlexisJazz) >>! In T425216#11887650, @TheDJ wrote: > I believe there is a 24 hourly script that checks cross dc consistency or something... [01:15:16] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 4099984 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:16:34] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [01:18:46] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [01:21:05] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/1.47.0-wmf.1 [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282439 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot) [01:21:12] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns infor for new switches - pt1979@cumin2002" [01:21:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns infor for new switches - pt1979@cumin2002" [01:21:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:22:48] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1282440 (owner: 10TrainBranchBot) [01:25:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [01:25:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [01:28:27] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:29:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [01:29:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [01:30:01] 👀 [01:34:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [01:53:13] (03CR) 10Thcipriani: [C:03+2] Branch commit for wmf/1.47.0-wmf.1 [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282439 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot) [01:57:42] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/1.47.0-wmf.1 [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282439 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T0200) [02:00:50] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:06:17] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 148481208 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:07:17] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:07:29] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 39s) [02:09:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [02:09:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [02:20:52] (03CR) 10Thcipriani: "This should merge during the switchover. It changes all integration machines default Java to 21. The machines are currently hooked to the " [puppet] - 10https://gerrit.wikimedia.org/r/1271032 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [02:23:36] (03CR) 10Thcipriani: "I think we're ready for this, but I'll defer to @dduvall@wikimedia.org – Dan, is this one ready?" [puppet] - 10https://gerrit.wikimedia.org/r/1271042 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [02:29:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [02:29:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [02:32:30] FIRING: Traffic bill over quota: Alert for device cr2-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [02:34:21] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:30] FIRING: [3x] Traffic bill over quota: Alert for device cr2-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [02:44:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [02:44:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [02:52:24] PROBLEM - Check correctness of the icinga configuration on alert1002 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [02:52:30] FIRING: [3x] Traffic bill over quota: Alert for device cr2-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [02:55:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [02:55:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [02:57:30] RESOLVED: [2x] Traffic bill over quota: Alert for device cr2-eqord.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [03:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T0300) [03:04:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [03:04:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [03:17:15] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11888290 (10Papaul) All the servers in rack 22 are connected to the new switch and all the link are up I just tested cp4037 but all others should be online.... [03:20:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [03:25:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [03:34:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [03:34:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [03:35:40] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:41:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [03:41:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [03:56:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [03:56:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T0400) [04:03:14] !log mwpresync@deploy1003 Pruned MediaWiki: 1.46.0-wmf.23 (duration: 03m 12s) [04:03:31] RESOLVED: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [04:13:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [04:13:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [04:15:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [04:15:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [04:17:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [04:42:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [04:43:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [04:43:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [05:00:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [05:00:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [05:01:20] o/ [05:02:08] damn, wrong time, coming back 2hr later [05:04:35] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [05:22:06] (03CR) 10Arnaudb: [C:03+1] "thanks for that change, looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1282395 (https://phabricator.wikimedia.org/T240266) (owner: 10Dzahn) [05:27:52] (03PS1) 10Marostegui: db1156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282707 (https://phabricator.wikimedia.org/T424615) [05:28:27] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:28:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 13 hosts with reason: Sanitarium s2 master: reimage to Debian Trixie [05:29:33] (03PS1) 10Giuseppe Lavagetto: Deploy patterns as inline patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1282708 [05:29:40] (03CR) 10Marostegui: [C:03+2] db1156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282707 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui) [05:30:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1156.eqiad.wmnet with reason: Reimage to Trixie [05:30:15] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1156: Reimage to Trixie [05:31:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1156: Reimage to Trixie [05:33:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1156.eqiad.wmnet with OS trixie [05:36:53] (03PS1) 10Ayounsi: ulsfo Bird (dns, ganeti, VMs) peer with ToR switch [puppet] - 10https://gerrit.wikimedia.org/r/1282711 (https://phabricator.wikimedia.org/T408892) [05:37:04] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282711 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [05:38:40] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Deploy patterns as inline patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1282708 (owner: 10Giuseppe Lavagetto) [05:39:33] (03PS2) 10Giuseppe Lavagetto: Adding post-deploy step [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1265349 [05:42:50] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "patterns_as_inline_patterns - oblivian@cumin1003" [05:42:52] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: patterns_as_inline_patterns - oblivian@cumin1003 [05:43:46] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: patterns_as_inline_patterns - oblivian@cumin1003 [05:43:47] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "patterns_as_inline_patterns - oblivian@cumin1003" [05:46:09] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1156.eqiad.wmnet with reason: host reimage [05:46:25] (03PS2) 10Ayounsi: ulsfo Bird (dns, ganeti, VMs) peer with ToR switch [puppet] - 10https://gerrit.wikimedia.org/r/1282711 (https://phabricator.wikimedia.org/T408892) [05:47:55] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282711 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [05:49:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1156.eqiad.wmnet with reason: host reimage [05:58:24] (03PS1) 10Marostegui: Revert "db1156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282722 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T0600). [06:00:31] (03CR) 10Marostegui: [C:03+2] Revert "db1156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282722 (owner: 10Marostegui) [06:00:33] (03CR) 10Ayounsi: [C:03+2] ulsfo Bird (dns, ganeti, VMs) peer with ToR switch [puppet] - 10https://gerrit.wikimedia.org/r/1282711 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [06:01:07] (03PS1) 10Jgiannelos: Errors added below ref list dirty when not responsive [extensions/Cite] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282723 (https://phabricator.wikimedia.org/T384599) [06:01:15] marostegui: yes [06:01:24] XioNoX: XDDDDDDD [06:01:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/Cite] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282723 (https://phabricator.wikimedia.org/T384599) (owner: 10Jgiannelos) [06:01:33] done [06:01:34] :) [06:05:22] RECOVERY - Host hcaptcha-proxy4003 is UP: PING OK - Packet loss = 0%, RTA = 71.85 ms [06:05:22] RECOVERY - Host doh4004 is UP: PING OK - Packet loss = 0%, RTA = 71.50 ms [06:05:22] RECOVERY - Host bast4006 is UP: PING OK - Packet loss = 0%, RTA = 71.56 ms [06:05:22] RECOVERY - Host netflow4003 is UP: PING OK - Packet loss = 0%, RTA = 71.51 ms [06:05:22] RECOVERY - Host ncredir4003 is UP: PING OK - Packet loss = 0%, RTA = 71.49 ms [06:05:26] RECOVERY - Host ncredir4004 is UP: PING OK - Packet loss = 0%, RTA = 71.52 ms [06:05:45] hello VMs ! [06:07:22] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:07:22] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=ulsfo - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:07:24] FIRING: FNMNotReported: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [06:07:24] RECOVERY - Host hcaptcha-proxy4004 is UP: PING OK - Packet loss = 0%, RTA = 71.65 ms [06:07:24] RECOVERY - Host install4004 is UP: PING OK - Packet loss = 0%, RTA = 71.56 ms [06:07:24] RECOVERY - Host doh4003 is UP: PING OK - Packet loss = 0%, RTA = 71.56 ms [06:07:24] RECOVERY - Host tcp-proxy4003 is UP: PING OK - Packet loss = 0%, RTA = 71.48 ms [06:07:24] RECOVERY - Host prometheus4003 is UP: PING OK - Packet loss = 0%, RTA = 71.56 ms [06:07:24] RECOVERY - Host tcp-proxy4004 is UP: PING OK - Packet loss = 0%, RTA = 71.62 ms [06:07:26] RECOVERY - Host durum4004 is UP: PING OK - Packet loss = 0%, RTA = 71.52 ms [06:07:26] RECOVERY - Host durum4003 is UP: PING OK - Packet loss = 0%, RTA = 71.50 ms [06:07:28] FIRING: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:08:31] <_joe_> XioNoX: was that you? [06:08:32] that might be because prometheus4003 is back online ^ [06:08:55] I've only brought ulsfo back online [06:08:59] er, ulsfo VMs [06:09:16] I don't see any issue on the linked dashboard [06:09:21] FIRING: [30x] JobUnavailable: Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:10:43] FIRING: [6x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:10:52] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/2 (Core: asw1-23-ulsfo:ethernet-1/55 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:11:11] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1156.eqiad.wmnet with OS trixie [06:11:52] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:12:24] RESOLVED: FNMNotReported: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [06:14:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1156: after reimage to trixie [06:15:14] FIRING: [4x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:15:22] (03PS1) 10Majavah: P:wmcs: metricsinfra: Set thanos-query alert.query-url [puppet] - 10https://gerrit.wikimedia.org/r/1282730 (https://phabricator.wikimedia.org/T424312) [06:15:54] (03PS1) 10Ayounsi: ulsfo LVS: peer with the ToR switch [puppet] - 10https://gerrit.wikimedia.org/r/1282731 (https://phabricator.wikimedia.org/T408892) [06:16:09] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282731 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [06:16:31] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1282730 (https://phabricator.wikimedia.org/T424312) (owner: 10Majavah) [06:20:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_ulsfo02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:23:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [06:23:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [06:25:14] FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:28:46] (03PS1) 10Muehlenhoff: thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1282752 [06:33:37] (03PS2) 10Ayounsi: ulsfo LVS: peer with the ToR switch [puppet] - 10https://gerrit.wikimedia.org/r/1282731 (https://phabricator.wikimedia.org/T408892) [06:33:37] (03PS1) 10Ayounsi: LVS BGP: peer with the gateway if no exception is set [puppet] - 10https://gerrit.wikimedia.org/r/1282764 [06:34:08] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [06:35:25] (03CR) 10Ayounsi: [C:03+1] Assign the hcaptcha::proxy role to hcaptcha-proxy5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1280353 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [06:35:36] (03CR) 10Ayounsi: [C:03+1] Assign the durum role for durum5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1282351 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [06:36:08] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282731 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [06:39:42] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1282731 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [06:40:00] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 75%, RTA = 6777.07 ms [06:40:47] (03PS1) 10Marostegui: db1170,db2222: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282776 (https://phabricator.wikimedia.org/T425388) [06:42:01] (03CR) 10Marostegui: [C:03+2] db1170,db2222: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282776 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui) [06:42:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2222.codfw.wmnet with reason: Reimage to Trixie [06:42:48] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2222: Reimage to Trixie [06:42:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1170.eqiad.wmnet with reason: Reimage to Trixie [06:42:57] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1170: Reimage to Trixie [06:43:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2222: Reimage to Trixie [06:43:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1170: Reimage to Trixie [06:44:28] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2222.codfw.wmnet with OS trixie [06:44:40] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1170.eqiad.wmnet with OS trixie [06:45:02] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 234.33 ms [06:45:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [06:45:46] (03PS2) 10Ayounsi: LVS BGP: peer with the gateway if no exception is set [puppet] - 10https://gerrit.wikimedia.org/r/1282764 [06:47:37] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282779 [06:48:36] (03PS1) 10Ayounsi: ulsfo liberica BGP: peer with the ToR switch [puppet] - 10https://gerrit.wikimedia.org/r/1282780 (https://phabricator.wikimedia.org/T408892) [06:49:48] (03CR) 10Mmartorana: [C:03+1] "recheck" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [06:50:01] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282780 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [06:50:03] (03PS2) 10Ayounsi: ulsfo liberica BGP: peer with the ToR switch [puppet] - 10https://gerrit.wikimedia.org/r/1282780 (https://phabricator.wikimedia.org/T408892) [06:50:35] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282780 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [06:54:41] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1282780 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [06:58:35] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1170.eqiad.wmnet with reason: host reimage [06:58:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install5004.wikimedia.org [06:58:50] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:00:04] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T0700). nyaa~ [07:00:04] nya_1F616EMO and nemo-yiannis: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1156: after reimage to trixie [07:00:11] I can deploy the backports this morning. [07:00:35] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2222.codfw.wmnet with reason: host reimage [07:00:45] nya_1F616EMO: Hi, I'm about to deploy the logo change. [07:03:34] nya_1F616EMO: Thanks for the two-part deployment; it looks like the usages were already removed yesterday. [07:03:35] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [07:03:40] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host install5004.wikimedia.org [07:03:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281967 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO) [07:03:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1170.eqiad.wmnet with reason: host reimage [07:04:35] (03Merged) 10jenkins-bot: zhwikinews: (2/2) revert 20th anniversary logo change (assets) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281967 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO) [07:05:01] !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1281967|zhwikinews: (2/2) revert 20th anniversary logo change (assets) (T420165)]] [07:05:04] T420165: Requesting temporary logo change for zh.wikinews.org - https://phabricator.wikimedia.org/T420165 [07:06:46] !log awight@deploy1003 awight, 1f616emo: Backport for [[gerrit:1281967|zhwikinews: (2/2) revert 20th anniversary logo change (assets) (T420165)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:07:30] !log awight@deploy1003 awight, 1f616emo: Continuing with deployment [07:07:52] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2222.codfw.wmnet with reason: host reimage [07:11:44] !log awight@deploy1003 Finished scap sync-world: Backport for [[gerrit:1281967|zhwikinews: (2/2) revert 20th anniversary logo change (assets) (T420165)]] (duration: 06m 43s) [07:11:47] T420165: Requesting temporary logo change for zh.wikinews.org - https://phabricator.wikimedia.org/T420165 [07:15:45] (03CR) 10Ayounsi: [C:03+2] ulsfo liberica BGP: peer with the ToR switch [puppet] - 10https://gerrit.wikimedia.org/r/1282780 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [07:15:52] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:20:36] (03PS1) 10Awight: Errors added below ref list dirty when not responsive [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) [07:20:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [07:21:14] (03PS1) 10Ayounsi: ulsfo: update and add missing includes [dns] - 10https://gerrit.wikimedia.org/r/1282805 (https://phabricator.wikimedia.org/T408892) [07:21:40] (03PS2) 10Jgiannelos: Errors added below ref list dirty when not responsive [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) (owner: 10Awight) [07:21:59] (03CR) 10CI reject: [V:04-1] ulsfo: update and add missing includes [dns] - 10https://gerrit.wikimedia.org/r/1282805 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [07:24:10] (03PS3) 10Daniel Kinzler: rest gateway: remove redundant bearerPayload case [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277703 [07:24:56] (03PS1) 10Marostegui: Revert "db1170,db2222: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282806 [07:25:12] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [07:25:31] FIRING: LibericaStaleConfig: Liberica instance lvs4009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=ulsfo&var-instance=lvs4009 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [07:25:32] (03CR) 10CI reject: [V:04-1] Errors added below ref list dirty when not responsive [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) (owner: 10Awight) [07:25:32] (03CR) 10Marostegui: [C:03+2] Revert "db1170,db2222: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282806 (owner: 10Marostegui) [07:26:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1170.eqiad.wmnet with OS trixie [07:27:42] (03CR) 10Jgiannelos: "recheck" [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) (owner: 10Awight) [07:28:44] (03Abandoned) 10Awight: Errors added below ref list dirty when not responsive [extensions/Cite] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282723 (https://phabricator.wikimedia.org/T384599) (owner: 10Jgiannelos) [07:30:14] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 0%, RTA = 681.46 ms [07:30:19] (03PS2) 10Daniel Kinzler: rest gateway: add more known cg-nat addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273510 [07:30:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1170: after reimage to trixie [07:31:01] (03CR) 10Daniel Kinzler: [C:04-1] "Chart bump!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278376 (https://phabricator.wikimedia.org/T424828) (owner: 10Daniel Kinzler) [07:31:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2222.codfw.wmnet with OS trixie [07:33:23] (03CR) 10Jgiannelos: "recheck" [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) (owner: 10Awight) [07:33:25] (03CR) 10Zabe: [C:03+2] "It was not the same failure. And it also looks random. So actually worth a second retry." [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282439 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot) [07:33:53] 06SRE, 13Patch-Needs-Improvement: Some SAL log entries (e.g. switchdc, scap backport) are getting cut off because long lines are being split over IRC - https://phabricator.wikimedia.org/T285709#11888396 (10Volans) Given the new title removing SRE Infrastructure Foundations as tcpircbot/logmsgbot is not owned/m... [07:35:04] (03PS2) 10Ayounsi: ulsfo: update and add missing includes [dns] - 10https://gerrit.wikimedia.org/r/1282805 (https://phabricator.wikimedia.org/T408892) [07:35:15] (03PS4) 10Daniel Kinzler: rest gateway: remove redundant bearerPayload case [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277703 [07:35:49] (03CR) 10CI reject: [V:04-1] ulsfo: update and add missing includes [dns] - 10https://gerrit.wikimedia.org/r/1282805 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [07:36:17] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2222: after reimage to trixie [07:36:30] (03PS10) 10Daniel Kinzler: rest gateway: rate limits for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) [07:38:17] (03Merged) 10jenkins-bot: Branch commit for wmf/1.47.0-wmf.1 [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282439 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot) [07:38:28] (03PS5) 10Daniel Kinzler: rest-gateway: generalize class overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278376 (https://phabricator.wikimedia.org/T424828) [07:38:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [07:38:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [07:39:24] (03PS1) 10Awight: Temporarily disable some parser tests [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282810 [07:39:25] (03PS5) 10Daniel Kinzler: rest gateway: remove redundant bearerPayload case [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277703 [07:39:45] (03PS3) 10Daniel Kinzler: rest gateway: add more known cg-nat addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273510 [07:39:56] (03PS11) 10Daniel Kinzler: rest gateway: rate limits for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) [07:40:30] (03PS3) 10Awight: Errors added below ref list dirty when not responsive [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) [07:40:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282810 (owner: 10Awight) [07:40:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) (owner: 10Awight) [07:41:55] (03CR) 10CI reject: [V:04-1] Temporarily disable some parser tests [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282810 (owner: 10Awight) [07:43:09] (03CR) 10CI reject: [V:04-1] Errors added below ref list dirty when not responsive [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) (owner: 10Awight) [07:44:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good by my limited understanding from reading up on the Phab tasks" [dns] - 10https://gerrit.wikimedia.org/r/1282805 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [07:45:45] !log zabe@deploy1003:~$ mwscript namespaceDupes.php scnwiki --fix # T425378 [07:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:48] T425378: Invalid pages with names of special pages exist on scnwiki - https://phabricator.wikimedia.org/T425378 [07:49:37] (03Restored) 10Awight: Errors added below ref list dirty when not responsive [extensions/Cite] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282723 (https://phabricator.wikimedia.org/T384599) (owner: 10Jgiannelos) [07:49:50] (03PS6) 10Daniel Kinzler: rest gateway: remove redundant bearerPayload case [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277703 [07:50:07] (03PS4) 10Daniel Kinzler: rest gateway: add more known cg-nat addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273510 [07:50:14] (03PS12) 10Daniel Kinzler: rest gateway: rate limits for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) [07:50:14] FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:50:39] !log zabe@deploy1003:~$ foreachwiki refreshImageMetadata --force --mediatype AUDIO --mime audio/midi # T414645 [07:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:42] T414645: Midi files with duration 0 due to not having tempo events - https://phabricator.wikimedia.org/T414645 [07:50:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s3 T424864 [07:50:46] T424864: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T424864 [07:51:19] (03CR) 10Atsuko: [C:03+2] wmnet: add additional opensearch clusters [dns] - 10https://gerrit.wikimedia.org/r/1281462 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [07:51:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2205 with weight 0 T424864', diff saved to https://phabricator.wikimedia.org/P92239 and previous config saved to /var/cache/conftool/dbconfig/20260505-075156-marostegui.json [07:52:08] (03CR) 10Awight: [C:03+2] "Thank you!" [extensions/Cite] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282723 (https://phabricator.wikimedia.org/T384599) (owner: 10Jgiannelos) [07:52:15] (03PS1) 10Mszwarc: Switch 'autoconfirmed' to use APCOND_AGE_FROM_EDIT on certain wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282850 (https://phabricator.wikimedia.org/T418484) [07:52:21] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1279390 (https://phabricator.wikimedia.org/T424864) (owner: 10Gerrit maintenance bot) [07:52:50] !log Starting s3 codfw failover from db2209 to db2205 - T424864 [07:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:08] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [07:54:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2155 (T419961)', diff saved to https://phabricator.wikimedia.org/P92241 and previous config saved to /var/cache/conftool/dbconfig/20260505-075416-fceratto.json [07:55:17] !log EU morning deployment was fun [07:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:43] FIRING: [6x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:55:53] (03PS1) 10Jelto: miscweb: update wmf-navigator image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282871 (https://phabricator.wikimedia.org/T414405) [07:56:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282850 (https://phabricator.wikimedia.org/T418484) (owner: 10Mszwarc) [07:56:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2205 to s3 primary T424864', diff saved to https://phabricator.wikimedia.org/P92242 and previous config saved to /var/cache/conftool/dbconfig/20260505-075654-marostegui.json [07:56:58] T424864: Switchover s3 master (db2209 -> db2205) - https://phabricator.wikimedia.org/T424864 [07:57:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2209 T424864', diff saved to https://phabricator.wikimedia.org/P92243 and previous config saved to /var/cache/conftool/dbconfig/20260505-075746-marostegui.json [07:58:33] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [07:58:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [07:58:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [07:59:52] (03PS1) 10Marostegui: db2209: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282874 (https://phabricator.wikimedia.org/T424792) [08:00:10] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [08:00:20] (03PS1) 10Jgiannelos: Bump base images to latest [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1282875 (https://phabricator.wikimedia.org/T425304) [08:00:55] (03CR) 10Marostegui: [C:03+2] db2209: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282874 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui) [08:01:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2209.codfw.wmnet with reason: Reimage to Trixie [08:01:12] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2209: Reimage to Trixie [08:01:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2209: Reimage to Trixie [08:01:33] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ulsfo includes - ayounsi@cumin1003" [08:01:57] (03CR) 10Ayounsi: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1282805 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [08:01:58] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ulsfo includes - ayounsi@cumin1003" [08:01:58] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:02:39] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2209.codfw.wmnet with OS trixie [08:02:52] (03CR) 10Ayounsi: [C:03+2] ulsfo: update and add missing includes [dns] - 10https://gerrit.wikimedia.org/r/1282805 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [08:03:00] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 32faef3126d9cf6f8d91fa0ce3ecc8c875b039f8, dns.git is 4cb885285fff61ab8b2a53e63c07beec72ab3291) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:03:00] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 32faef3126d9cf6f8d91fa0ce3ecc8c875b039f8, dns.git is 4cb885285fff61ab8b2a53e63c07beec72ab3291) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:03:00] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 32faef3126d9cf6f8d91fa0ce3ecc8c875b039f8, dns.git is 4cb885285fff61ab8b2a53e63c07beec72ab3291) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:03:00] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 32faef3126d9cf6f8d91fa0ce3ecc8c875b039f8, dns.git is 4cb885285fff61ab8b2a53e63c07beec72ab3291) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:03:00] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 32faef3126d9cf6f8d91fa0ce3ecc8c875b039f8, dns.git is 4cb885285fff61ab8b2a53e63c07beec72ab3291) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:03:00] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 32faef3126d9cf6f8d91fa0ce3ecc8c875b039f8, dns.git is 4cb885285fff61ab8b2a53e63c07beec72ab3291) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:03:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T419961)', diff saved to https://phabricator.wikimedia.org/P92245 and previous config saved to /var/cache/conftool/dbconfig/20260505-080301-fceratto.json [08:03:04] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 32faef3126d9cf6f8d91fa0ce3ecc8c875b039f8, dns.git is 4cb885285fff61ab8b2a53e63c07beec72ab3291) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:03:04] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 32faef3126d9cf6f8d91fa0ce3ecc8c875b039f8, dns.git is 4cb885285fff61ab8b2a53e63c07beec72ab3291) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:03:08] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 32faef3126d9cf6f8d91fa0ce3ecc8c875b039f8, dns.git is 4cb885285fff61ab8b2a53e63c07beec72ab3291) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:03:08] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 32faef3126d9cf6f8d91fa0ce3ecc8c875b039f8, dns.git is 4cb885285fff61ab8b2a53e63c07beec72ab3291) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:03:08] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 32faef3126d9cf6f8d91fa0ce3ecc8c875b039f8, dns.git is 4cb885285fff61ab8b2a53e63c07beec72ab3291) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:03:08] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 32faef3126d9cf6f8d91fa0ce3ecc8c875b039f8, dns.git is 4cb885285fff61ab8b2a53e63c07beec72ab3291) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:03:10] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 32faef3126d9cf6f8d91fa0ce3ecc8c875b039f8, dns.git is 4cb885285fff61ab8b2a53e63c07beec72ab3291) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:03:10] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 32faef3126d9cf6f8d91fa0ce3ecc8c875b039f8, dns.git is 4cb885285fff61ab8b2a53e63c07beec72ab3291) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:03:13] !log ayounsi@dns1004 START - running authdns-update [08:04:36] (03CR) 10Jgiannelos: "It looks like this is building images but I don't have a way to test them with actual maps content locally. Can we try them on staging?" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1282875 (https://phabricator.wikimedia.org/T425304) (owner: 10Jgiannelos) [08:05:10] (03Merged) 10jenkins-bot: Errors added below ref list dirty when not responsive [extensions/Cite] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282723 (https://phabricator.wikimedia.org/T384599) (owner: 10Jgiannelos) [08:05:24] (03CR) 10Jgiannelos: "Sorry image and not images (we reuse the same for pregen and vector server)" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1282875 (https://phabricator.wikimedia.org/T425304) (owner: 10Jgiannelos) [08:05:30] !log ayounsi@dns1004 END - running authdns-update [08:08:00] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:08:00] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:08:00] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:08:00] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:08:00] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:08:00] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:08:04] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:08:04] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:08:08] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:08:08] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:08:08] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:08:08] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:08:10] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:08:10] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [08:08:23] (03PS1) 10Majavah: openstack: encapi: Always use explicit SQL JOIN syntax [puppet] - 10https://gerrit.wikimedia.org/r/1282878 (https://phabricator.wikimedia.org/T416588) [08:08:25] (03PS1) 10Majavah: openstack: wmcs-enc-cli: Add subcommand to manually delete projects [puppet] - 10https://gerrit.wikimedia.org/r/1282879 (https://phabricator.wikimedia.org/T416588) [08:08:37] !log zabe@deploy1003:~$ foreachwiki refreshImageMetadata --broken-only --mediatype AUDIO --mime audio/flac # T414641 [08:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:39] T414641: Some flac files have duration 0 - https://phabricator.wikimedia.org/T414641 [08:08:41] (03CR) 10JMeybohm: [C:03+1] Move pki.discovery.wmnet's eqiad endpoint to pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1282391 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [08:10:43] FIRING: [4x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:13:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P92247 and previous config saved to /var/cache/conftool/dbconfig/20260505-081309-fceratto.json [08:13:38] (03CR) 10Muehlenhoff: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1282391 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [08:13:59] (03PS2) 10Jgiannelos: Sandbox*Test: Fix CI issues for now, so that we're unblocked [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281895 (owner: 10Zabe) [08:13:59] (03PS2) 10Jgiannelos: Temporarily disable some parser tests [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282810 (owner: 10Awight) [08:14:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install5004.wikimedia.org [08:14:21] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:15:14] RESOLVED: [4x] BFDdown: BFD session down between cr4-ulsfo and 198.35.26.7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:15:19] (03PS1) 10Daniel Kinzler: rest gateway: defined anon-mediawiki class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) [08:16:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1170: after reimage to trixie [08:16:08] (03CR) 10Jgiannelos: [C:04-1] Errors added below ref list dirty when not responsive [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) (owner: 10Awight) [08:16:23] (03CR) 10Jgiannelos: Errors added below ref list dirty when not responsive [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) (owner: 10Awight) [08:16:38] !log zabe@deploy1003:~$ foreachwiki refreshImageMetadata --broken-only --mediatype AUDIO --mime audio/x-flac # T414641 [08:16:39] (03CR) 10Jgiannelos: [C:04-1] Temporarily disable some parser tests [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282810 (owner: 10Awight) [08:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:41] T414641: Some flac files have duration 0 - https://phabricator.wikimedia.org/T414641 [08:17:31] (03PS1) 10Marostegui: Revert "db2209: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282885 [08:17:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [08:19:45] (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1279465 (https://phabricator.wikimedia.org/T419736) (owner: 10CDanis) [08:20:32] jmm@cumin2002 makevm (PID 1617142) is awaiting input [08:20:44] (03CR) 10Fabfur: [C:03+1] Add x_trusted_request and x_wmf_ratelimit_class to webrequest live streams [puppet] - 10https://gerrit.wikimedia.org/r/1279489 (https://phabricator.wikimedia.org/T419736) (owner: 10Aleksandar Mastilovic) [08:22:00] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2209.codfw.wmnet with reason: host reimage [08:22:04] hello! there is a bug in the wmf/1.47.0-wmf.1 branch of WikibaseLexeme and the fix is already in master. since the train deployment hasn't happened yet, can we just cherry-pick the fix onto wmf/1.47.0-wmf.1 or is any further action needed? [08:22:09] (03CR) 10Jelto: [C:03+2] miscweb: update wmf-navigator image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282871 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [08:22:24] (03CR) 10Daniel Kinzler: [C:04-1] rest gateway: defined anon-mediawiki class (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler) [08:22:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2222: after reimage to trixie [08:23:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P92250 and previous config saved to /var/cache/conftool/dbconfig/20260505-082318-fceratto.json [08:24:22] (03CR) 10CI reject: [V:04-1] Sandbox*Test: Fix CI issues for now, so that we're unblocked [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281895 (owner: 10Zabe) [08:24:23] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:24:31] jakob_WMDE: wmf.1 already exists on the deployment server, and usually at this time would've been synced to test wikis automatically, but the test wiki sync seems to have failed for this train some reason. I would wait for the testwikis issue to be sorted and then backport as usual [08:24:34] (03Merged) 10jenkins-bot: miscweb: update wmf-navigator image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282871 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [08:25:40] taavi: thanks! backport as usual, i.e. via one of the backport windows? [08:27:36] (03Abandoned) 10Jgiannelos: Temporarily disable some parser tests [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282810 (owner: 10Awight) [08:27:41] (03Abandoned) 10Jgiannelos: Sandbox*Test: Fix CI issues for now, so that we're unblocked [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281895 (owner: 10Zabe) [08:28:49] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install5004.wikimedia.org - jmm@cumin2002" [08:28:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install5004.wikimedia.org - jmm@cumin2002" [08:28:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:28:56] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install5004.wikimedia.org on all recursors [08:29:15] (03CR) 10Elukey: [C:03+1] Bump base images to latest [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1282875 (https://phabricator.wikimedia.org/T425304) (owner: 10Jgiannelos) [08:29:16] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) install5004.wikimedia.org on all recursors [08:29:28] (03CR) 10Elukey: [C:03+1] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1282752 (owner: 10Muehlenhoff) [08:29:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2209.codfw.wmnet with reason: host reimage [08:29:49] (03CR) 10Jgiannelos: [C:03+2] Bump base images to latest [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1282875 (https://phabricator.wikimedia.org/T425304) (owner: 10Jgiannelos) [08:30:15] (03CR) 10Daniel Kinzler: [C:04-1] rest gateway: defined anon-mediawiki class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler) [08:30:54] (03PS1) 10Daniel Kinzler: rest-gateway: add anon-app ratelimit class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282908 (https://phabricator.wikimedia.org/T425391) [08:31:01] (03Merged) 10jenkins-bot: Bump base images to latest [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1282875 (https://phabricator.wikimedia.org/T425304) (owner: 10Jgiannelos) [08:32:16] jmm@cumin2002 makevm (PID 1617142) is awaiting input [08:32:22] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [08:32:53] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [08:32:57] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:33:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T419961)', diff saved to https://phabricator.wikimedia.org/P92251 and previous config saved to /var/cache/conftool/dbconfig/20260505-083326-fceratto.json [08:33:48] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [08:33:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2172 (T419961)', diff saved to https://phabricator.wikimedia.org/P92252 and previous config saved to /var/cache/conftool/dbconfig/20260505-083356-fceratto.json [08:34:02] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 13 hosts with reason: switches replacement [08:34:16] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11888629 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=bdfd24a0-f5cd-4c3b-945b-36deeb91ba1c) set by ayounsi@cumin1003 for 20:00:00 on 1... [08:34:33] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [08:34:52] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [08:35:18] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:37:00] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#11888645 (10elukey) @Jclark-ctr okok the error makes more sense, those are X14-based supermicros like kafka-logging*, so we'll need the new provision cookbook for them. [08:37:02] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [08:37:12] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:37:24] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:37:25] (03CR) 10Volans: "LGTM with one question inline" [puppet] - 10https://gerrit.wikimedia.org/r/1282878 (https://phabricator.wikimedia.org/T416588) (owner: 10Majavah) [08:38:27] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [08:39:11] (03PS2) 10Majavah: openstack: encapi: Always use explicit SQL JOIN syntax [puppet] - 10https://gerrit.wikimedia.org/r/1282878 (https://phabricator.wikimedia.org/T416588) [08:39:11] (03PS2) 10Majavah: openstack: wmcs-enc-cli: Add subcommand to manually delete projects [puppet] - 10https://gerrit.wikimedia.org/r/1282879 (https://phabricator.wikimedia.org/T416588) [08:40:02] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:40:08] (03CR) 10Majavah: openstack: encapi: Always use explicit SQL JOIN syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1282878 (https://phabricator.wikimedia.org/T416588) (owner: 10Majavah) [08:41:08] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [08:41:58] (03CR) 10Marostegui: [C:03+2] Revert "db2209: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1282885 (owner: 10Marostegui) [08:41:58] (03PS2) 10Blake: k8s: Remove support for k8s versions before 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1278370 (https://phabricator.wikimedia.org/T423251) [08:42:17] (03CR) 10Blake: k8s: Remove support for k8s versions before 1.31 (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1278370 (https://phabricator.wikimedia.org/T423251) (owner: 10Blake) [08:42:31] (03CR) 10CI reject: [V:04-1] k8s: Remove support for k8s versions before 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1278370 (https://phabricator.wikimedia.org/T423251) (owner: 10Blake) [08:42:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T419961)', diff saved to https://phabricator.wikimedia.org/P92253 and previous config saved to /var/cache/conftool/dbconfig/20260505-084231-fceratto.json [08:42:41] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [08:43:04] jmm@cumin2002 makevm (PID 1617142) is awaiting input [08:44:05] (03PS1) 10Elukey: profile::kafka::mirror::alerts: fix prometheus URLs [puppet] - 10https://gerrit.wikimedia.org/r/1282919 [08:46:09] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2213.codfw.wmnet with reason: Maintenance [08:46:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2213 (T419635)', diff saved to https://phabricator.wikimedia.org/P92254 and previous config saved to /var/cache/conftool/dbconfig/20260505-084616-fceratto.json [08:46:19] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:47:51] (03PS1) 10Atsuko: service::catalog: Add opensearch-ttmserver-test [puppet] - 10https://gerrit.wikimedia.org/r/1282920 (https://phabricator.wikimedia.org/T424248) [08:48:39] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM install5004.wikimedia.org - jmm@cumin2002" [08:48:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM install5004.wikimedia.org - jmm@cumin2002" [08:48:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:48:45] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install5004.wikimedia.org on all recursors [08:49:06] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) install5004.wikimedia.org on all recursors [08:49:28] (03CR) 10Muehlenhoff: [C:03+2] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1282752 (owner: 10Muehlenhoff) [08:50:16] !log installing augeas security updates [08:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:42] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host install5004.wikimedia.org [08:52:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P92255 and previous config saved to /var/cache/conftool/dbconfig/20260505-085238-fceratto.json [08:52:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [08:52:46] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2209.codfw.wmnet with OS trixie [08:54:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T419635)', diff saved to https://phabricator.wikimedia.org/P92256 and previous config saved to /var/cache/conftool/dbconfig/20260505-085407-fceratto.json [08:54:11] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:54:12] (03CR) 10Filippo Giunchedi: "I like this solution as a middle ground to be able to use zookeeper_clusters also for WMCS cloud-private hostnames. Adding Luca and Moritz" [puppet] - 10https://gerrit.wikimedia.org/r/1282372 (https://phabricator.wikimedia.org/T422646) (owner: 10Majavah) [08:57:44] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1282920 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [08:58:02] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2209: after reimage to trixie [08:58:31] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1282878 (https://phabricator.wikimedia.org/T416588) (owner: 10Majavah) [08:58:58] (03CR) 10Majavah: [C:03+2] openstack: encapi: Always use explicit SQL JOIN syntax [puppet] - 10https://gerrit.wikimedia.org/r/1282878 (https://phabricator.wikimedia.org/T416588) (owner: 10Majavah) [08:59:26] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM! thanos.w.o actually has the same problem, I just noticed" [puppet] - 10https://gerrit.wikimedia.org/r/1282730 (https://phabricator.wikimedia.org/T424312) (owner: 10Majavah) [09:00:01] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs: metricsinfra: Set thanos-query alert.query-url [puppet] - 10https://gerrit.wikimedia.org/r/1282730 (https://phabricator.wikimedia.org/T424312) (owner: 10Majavah) [09:00:55] (03PS1) 10Muehlenhoff: Add library hint for augeas [puppet] - 10https://gerrit.wikimedia.org/r/1282921 [09:01:17] (03PS1) 10AikoChou: ml-services: update production image for RRLA and revscoring model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282922 (https://phabricator.wikimedia.org/T416384) [09:02:34] (03PS1) 10Jelto: miscweb: make sidecar image configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282923 (https://phabricator.wikimedia.org/T414405) [09:02:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P92258 and previous config saved to /var/cache/conftool/dbconfig/20260505-090246-fceratto.json [09:03:46] (03PS3) 10Majavah: openstack: wmcs-enc-cli: Add subcommand to manually delete projects [puppet] - 10https://gerrit.wikimedia.org/r/1282879 (https://phabricator.wikimedia.org/T416588) [09:04:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P92259 and previous config saved to /var/cache/conftool/dbconfig/20260505-090415-fceratto.json [09:04:29] (03PS1) 10Jgiannelos: ProofreadPageTestCase: Don't write globals in tests, that's not good, m'kay? [extensions/ProofreadPage] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282924 [09:04:35] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [09:04:57] (03Restored) 10Jgiannelos: Sandbox*Test: Fix CI issues for now, so that we're unblocked [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281895 (owner: 10Zabe) [09:05:12] (03Restored) 10Jgiannelos: Temporarily disable some parser tests [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282810 (owner: 10Awight) [09:05:21] (03PS3) 10Jgiannelos: Sandbox*Test: Fix CI issues for now, so that we're unblocked [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281895 (owner: 10Zabe) [09:05:21] (03PS3) 10Jgiannelos: Temporarily disable some parser tests [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282810 (owner: 10Awight) [09:06:51] (03CR) 10Dpogorzelski: [C:03+1] ml-services: update production image for RRLA and revscoring model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282922 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou) [09:06:57] (03CR) 10CI reject: [V:04-1] ProofreadPageTestCase: Don't write globals in tests, that's not good, m'kay? [extensions/ProofreadPage] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282924 (owner: 10Jgiannelos) [09:07:24] (03Abandoned) 10Jgiannelos: ProofreadPageTestCase: Don't write globals in tests, that's not good, m'kay? [extensions/ProofreadPage] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282924 (owner: 10Jgiannelos) [09:07:31] (03CR) 10AikoChou: [C:03+2] ml-services: update production image for RRLA and revscoring model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282922 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou) [09:08:15] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for augeas [puppet] - 10https://gerrit.wikimedia.org/r/1282921 (owner: 10Muehlenhoff) [09:08:25] (03PS1) 10Ayounsi: ulsfo: update switch monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1282925 (https://phabricator.wikimedia.org/T425399) [09:08:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [09:08:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [09:10:28] (03Merged) 10jenkins-bot: ml-services: update production image for RRLA and revscoring model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282922 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou) [09:12:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T419961)', diff saved to https://phabricator.wikimedia.org/P92260 and previous config saved to /var/cache/conftool/dbconfig/20260505-091254-fceratto.json [09:13:18] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance [09:14:00] (03PS1) 10Federico Ceratto: common, site, ferm: Remove dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/1278452 (https://phabricator.wikimedia.org/T416582) [09:14:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P92262 and previous config saved to /var/cache/conftool/dbconfig/20260505-091423-fceratto.json [09:14:58] (03CR) 10CI reject: [V:04-1] Sandbox*Test: Fix CI issues for now, so that we're unblocked [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281895 (owner: 10Zabe) [09:15:45] (03CR) 10Jelto: [C:03+2] miscweb: make sidecar image configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282923 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [09:18:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance [09:18:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2206 (T419961)', diff saved to https://phabricator.wikimedia.org/P92263 and previous config saved to /var/cache/conftool/dbconfig/20260505-091808-fceratto.json [09:18:32] (03Merged) 10jenkins-bot: miscweb: make sidecar image configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282923 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [09:20:31] (03CR) 10Marostegui: [C:04-1] "Missing dborch1001 yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1278452 (https://phabricator.wikimedia.org/T416582) (owner: 10Federico Ceratto) [09:21:36] (03CR) 10Elukey: [C:03+2] profile::kafka::mirror::alerts: fix prometheus URLs [puppet] - 10https://gerrit.wikimedia.org/r/1282919 (owner: 10Elukey) [09:23:58] (03CR) 10Tiziano Fogli: ulsfo: update switch monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1282925 (https://phabricator.wikimedia.org/T425399) (owner: 10Ayounsi) [09:24:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T419635)', diff saved to https://phabricator.wikimedia.org/P92264 and previous config saved to /var/cache/conftool/dbconfig/20260505-092431-fceratto.json [09:24:35] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:25:14] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [09:26:24] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [09:26:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [09:26:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [09:26:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T419961)', diff saved to https://phabricator.wikimedia.org/P92265 and previous config saved to /var/cache/conftool/dbconfig/20260505-092654-fceratto.json [09:28:04] (03PS1) 10Marostegui: db1174,db2221: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282927 (https://phabricator.wikimedia.org/T425388) [09:28:27] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:28:48] (03CR) 10Marostegui: [C:03+2] db1174,db2221: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1282927 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui) [09:28:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2221.codfw.wmnet with reason: Reimage to Trixie [09:28:48] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:28:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1174.eqiad.wmnet with reason: Reimage to Trixie [09:28:54] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2221: Reimage to Trixie [09:28:57] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1174: Reimage to Trixie [09:29:10] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [09:29:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2221: Reimage to Trixie [09:29:35] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1174: Reimage to Trixie [09:29:45] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [09:29:58] (03PS1) 10Mpostoronca: hCaptcha: Add diagnostic context to script load error logs [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282928 (https://phabricator.wikimedia.org/T424496) [09:30:05] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [09:30:24] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [09:30:25] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2221.codfw.wmnet with OS trixie [09:30:41] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1174.eqiad.wmnet with OS trixie [09:30:41] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2213.codfw.wmnet with reason: Maintenance [09:31:36] (03CR) 10CI reject: [V:04-1] hCaptcha: Add diagnostic context to script load error logs [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282928 (https://phabricator.wikimedia.org/T424496) (owner: 10Mpostoronca) [09:31:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [09:31:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [09:32:57] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [09:33:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1173 (T419635)', diff saved to https://phabricator.wikimedia.org/P92269 and previous config saved to /var/cache/conftool/dbconfig/20260505-093305-fceratto.json [09:33:08] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:33:50] (03PS4) 10Gehel: feat(sysctl): allow shadowing distro provided configs [puppet] - 10https://gerrit.wikimedia.org/r/1282319 (https://phabricator.wikimedia.org/T425301) [09:33:51] (03PS5) 10Gehel: perf(opensearch): increase 'vm.max_map_count' to 1048576 [puppet] - 10https://gerrit.wikimedia.org/r/1282320 (https://phabricator.wikimedia.org/T425301) [09:34:43] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282320 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [09:34:45] (03CR) 10CI reject: [V:04-1] perf(opensearch): increase 'vm.max_map_count' to 1048576 [puppet] - 10https://gerrit.wikimedia.org/r/1282320 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [09:35:35] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:36:13] !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [09:36:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T419635)', diff saved to https://phabricator.wikimedia.org/P92270 and previous config saved to /var/cache/conftool/dbconfig/20260505-093619-fceratto.json [09:37:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P92271 and previous config saved to /var/cache/conftool/dbconfig/20260505-093703-fceratto.json [09:38:08] (03PS2) 10Ayounsi: ulsfo: update switch monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1282925 (https://phabricator.wikimedia.org/T425399) [09:38:08] !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [09:38:49] (03CR) 10Ayounsi: ulsfo: update switch monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1282925 (https://phabricator.wikimedia.org/T425399) (owner: 10Ayounsi) [09:38:55] (03Abandoned) 10Mpostoronca: hCaptcha: Add diagnostic context to script load error logs [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282928 (https://phabricator.wikimedia.org/T424496) (owner: 10Mpostoronca) [09:39:13] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [09:41:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281964 (https://phabricator.wikimedia.org/T333765) (owner: 10Jon Harald Søby) [09:41:34] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [09:42:04] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:43:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2209: after reimage to trixie [09:45:53] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1174.eqiad.wmnet with reason: host reimage [09:46:00] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:46:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P92273 and previous config saved to /var/cache/conftool/dbconfig/20260505-094627-fceratto.json [09:46:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2221.codfw.wmnet with reason: host reimage [09:47:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P92274 and previous config saved to /var/cache/conftool/dbconfig/20260505-094711-fceratto.json [09:49:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1174.eqiad.wmnet with reason: host reimage [09:49:23] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:52:16] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:54:42] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:56:20] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:56:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P92275 and previous config saved to /var/cache/conftool/dbconfig/20260505-095635-fceratto.json [09:56:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2221.codfw.wmnet with reason: host reimage [09:57:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T419961)', diff saved to https://phabricator.wikimedia.org/P92276 and previous config saved to /var/cache/conftool/dbconfig/20260505-095719-fceratto.json [09:57:42] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance [09:57:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2210 (T419961)', diff saved to https://phabricator.wikimedia.org/P92277 and previous config saved to /var/cache/conftool/dbconfig/20260505-095749-fceratto.json [09:59:27] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:04:17] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:06:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T419961)', diff saved to https://phabricator.wikimedia.org/P92278 and previous config saved to /var/cache/conftool/dbconfig/20260505-100607-fceratto.json [10:06:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T419635)', diff saved to https://phabricator.wikimedia.org/P92279 and previous config saved to /var/cache/conftool/dbconfig/20260505-100642-fceratto.json [10:11:24] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1174.eqiad.wmnet with OS trixie [10:12:24] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:14:45] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1174: after reimage to trixie [10:14:45] (03CR) 10FNegri: [C:03+1] openstack: wmcs-enc-cli: Add subcommand to manually delete projects [puppet] - 10https://gerrit.wikimedia.org/r/1282879 (https://phabricator.wikimedia.org/T416588) (owner: 10Majavah) [10:14:57] (03CR) 10Majavah: [C:03+2] openstack: wmcs-enc-cli: Add subcommand to manually delete projects [puppet] - 10https://gerrit.wikimedia.org/r/1282879 (https://phabricator.wikimedia.org/T416588) (owner: 10Majavah) [10:15:13] (03PS1) 10Muehlenhoff: Rename Hiera file to actually match the second durum VM in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1282937 (https://phabricator.wikimedia.org/T421863) [10:16:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P92281 and previous config saved to /var/cache/conftool/dbconfig/20260505-101616-fceratto.json [10:17:31] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:17:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [10:17:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:17:47] (03PS1) 10Elukey: admin_ng: remove overrides for cfssl-issuer's endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282938 (https://phabricator.wikimedia.org/T416664) [10:18:37] (03CR) 10Elukey: [C:03+2] profile::kafka::mirror::prometheus_alerts: fix Prometheus instances [puppet] - 10https://gerrit.wikimedia.org/r/1282936 (owner: 10Elukey) [10:19:48] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11888996 (10MoritzMuehlenhoff) [10:19:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2221.codfw.wmnet with OS trixie [10:21:27] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:22:41] (03CR) 10Muehlenhoff: [C:03+2] Rename Hiera file to actually match the second durum VM in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1282937 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [10:23:26] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [10:23:28] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:23:53] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [10:24:03] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [10:24:03] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2221: after reimage to trixie [10:24:47] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [10:26:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P92283 and previous config saved to /var/cache/conftool/dbconfig/20260505-102623-fceratto.json [10:27:44] (03CR) 10Cathal Mooney: [C:03+1] "Thanks! I was going to open a task on this but perhaps it's as simple as I put it in the wrong place so let's see." [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi) [10:29:16] !log aikochou@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:32:48] !log aikochou@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:34:12] (03PS4) 10Cathal Mooney: gnmic: add subscriptions to openconfig subinterface path [puppet] - 10https://gerrit.wikimedia.org/r/1278682 (https://phabricator.wikimedia.org/T424683) [10:36:19] (03PS1) 10Majavah: openstack: puppet-enc: Exclusively read projects table [puppet] - 10https://gerrit.wikimedia.org/r/1282943 (https://phabricator.wikimedia.org/T416588) [10:36:21] (03PS1) 10Majavah: openstack: puppet-enc: Stop writing and drop old project column [puppet] - 10https://gerrit.wikimedia.org/r/1282944 (https://phabricator.wikimedia.org/T416588) [10:36:28] (03CR) 10Elukey: [C:03+2] Add Wikifunctions' evaluator ingress endpoints to service.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1280433 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [10:36:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T419961)', diff saved to https://phabricator.wikimedia.org/P92285 and previous config saved to /var/cache/conftool/dbconfig/20260505-103632-fceratto.json [10:36:54] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance [10:37:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2219 (T419961)', diff saved to https://phabricator.wikimedia.org/P92286 and previous config saved to /var/cache/conftool/dbconfig/20260505-103702-fceratto.json [10:37:22] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:39:26] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:40:25] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1227.eqiad.wmnet with reason: Maintenance [10:40:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T419635)', diff saved to https://phabricator.wikimedia.org/P92288 and previous config saved to /var/cache/conftool/dbconfig/20260505-104032-fceratto.json [10:40:36] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:41:08] (03CR) 10Muehlenhoff: [C:03+2] Assign the hcaptcha::proxy role to hcaptcha-proxy5003/5004 [puppet] - 10https://gerrit.wikimedia.org/r/1280353 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [10:44:08] (03PS1) 10Elukey: profile::prometheus::alerts: rely on kafka mirror's defaults [puppet] - 10https://gerrit.wikimedia.org/r/1282945 [10:44:40] (03CR) 10CI reject: [V:04-1] profile::prometheus::alerts: rely on kafka mirror's defaults [puppet] - 10https://gerrit.wikimedia.org/r/1282945 (owner: 10Elukey) [10:45:20] (03PS2) 10Elukey: profile::prometheus::alerts: rely on kafka mirror's defaults [puppet] - 10https://gerrit.wikimedia.org/r/1282945 [10:45:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T419961)', diff saved to https://phabricator.wikimedia.org/P92290 and previous config saved to /var/cache/conftool/dbconfig/20260505-104521-fceratto.json [10:46:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [10:46:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:48:51] (03CR) 10Elukey: [C:03+2] admin_ng: remove overrides for cfssl-issuer's endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282938 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [10:49:41] !log elukey@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [10:49:46] !log elukey@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [10:50:06] !log elukey@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [10:50:09] !log elukey@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [10:50:31] !log elukey@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:50:34] !log elukey@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:51:37] (03CR) 10Elukey: [C:03+2] profile::prometheus::alerts: rely on kafka mirror's defaults [puppet] - 10https://gerrit.wikimedia.org/r/1282945 (owner: 10Elukey) [10:54:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T419635)', diff saved to https://phabricator.wikimedia.org/P92291 and previous config saved to /var/cache/conftool/dbconfig/20260505-105419-fceratto.json [10:54:25] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:54:56] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 242.08 ms [10:55:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P92293 and previous config saved to /var/cache/conftool/dbconfig/20260505-105529-fceratto.json [10:57:01] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [10:57:06] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:58:12] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [11:00:03] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1174: after reimage to trixie [11:04:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P92295 and previous config saved to /var/cache/conftool/dbconfig/20260505-110427-fceratto.json [11:04:32] (03PS1) 10Santiago Faci: mw.testKitchen.getExperiment() -> mw.testKitchen.compat.getExperiment() [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280210 (https://phabricator.wikimedia.org/T419513) (owner: 10Phuedx) [11:05:23] !log ayounsi@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs4009*} and A:liberica [11:05:32] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs4009*} and A:liberica [11:05:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P92296 and previous config saved to /var/cache/conftool/dbconfig/20260505-110537-fceratto.json [11:06:38] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11889092 (10MoritzMuehlenhoff) [11:07:24] !log installing multipart bugfix updates from bookworm point release [11:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:29] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2221: after reimage to trixie [11:09:47] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11889094 (10MoritzMuehlenhoff) [11:10:31] RESOLVED: LibericaStaleConfig: Liberica instance lvs4009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=ulsfo&var-instance=lvs4009 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [11:10:33] !log installing ca-certificates updates from bookworm point release [11:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:52] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:12:26] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 66%, RTA = 5260.23 ms [11:14:07] (03CR) 10Ayounsi: [C:03+2] ulsfo: update switch monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1282925 (https://phabricator.wikimedia.org/T425399) (owner: 10Ayounsi) [11:14:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P92298 and previous config saved to /var/cache/conftool/dbconfig/20260505-111435-fceratto.json [11:15:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T419961)', diff saved to https://phabricator.wikimedia.org/P92299 and previous config saved to /var/cache/conftool/dbconfig/20260505-111545-fceratto.json [11:16:08] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance [11:16:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2236 (T419961)', diff saved to https://phabricator.wikimedia.org/P92300 and previous config saved to /var/cache/conftool/dbconfig/20260505-111616-fceratto.json [11:18:14] (03PS1) 10Btullis: Update the storage device that cloudelastic is using for opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1282950 (https://phabricator.wikimedia.org/T425301) [11:18:35] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282950 (https://phabricator.wikimedia.org/T425301) (owner: 10Btullis) [11:19:10] (03CR) 10DCausse: [C:03+1] Update the storage device that cloudelastic is using for opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1282950 (https://phabricator.wikimedia.org/T425301) (owner: 10Btullis) [11:22:41] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 0%, RTA = 1218.12 ms [11:24:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T419961)', diff saved to https://phabricator.wikimedia.org/P92301 and previous config saved to /var/cache/conftool/dbconfig/20260505-112438-fceratto.json [11:24:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T419635)', diff saved to https://phabricator.wikimedia.org/P92302 and previous config saved to /var/cache/conftool/dbconfig/20260505-112449-fceratto.json [11:24:53] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:29:43] (03PS1) 10Ladsgroup: Close Norwegian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282955 (https://phabricator.wikimedia.org/T421796) [11:31:28] (03CR) 10Btullis: [C:03+2] Update the storage device that cloudelastic is using for opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1282950 (https://phabricator.wikimedia.org/T425301) (owner: 10Btullis) [11:31:50] (03PS1) 10Slyngshede: Geo-maps: Update Meta PoPs [dns] - 10https://gerrit.wikimedia.org/r/1282956 [11:33:19] jouncebot: nowandnext [11:33:19] No deployments scheduled for the next 0 hour(s) and 26 minute(s) [11:33:19] In 0 hour(s) and 26 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1200) [11:33:26] PROBLEM - Wikidough DoT Check -IPv6- on doh4003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [11:33:34] PROBLEM - Wikidough DoH Check -IPv6- on doh4004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [11:33:34] PROBLEM - Wikidough DoT Check -IPv6- on doh4004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [11:33:34] PROBLEM - Wikidough DoH Check -IPv6- on doh4003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [11:34:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282955 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [11:34:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P92303 and previous config saved to /var/cache/conftool/dbconfig/20260505-113446-fceratto.json [11:36:02] (03PS1) 10Ayounsi: Bird: use the GUA v6 gateway instead of link-local [puppet] - 10https://gerrit.wikimedia.org/r/1282958 [11:36:09] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282958 (owner: 10Ayounsi) [11:36:31] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs4009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [11:36:53] (03PS2) 10Ayounsi: Bird: use the GUA v6 gateway instead of link-local [puppet] - 10https://gerrit.wikimedia.org/r/1282958 [11:37:01] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282958 (owner: 10Ayounsi) [11:37:06] (03Merged) 10jenkins-bot: Close Norwegian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282955 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [11:38:13] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1282955|Close Norwegian Wikinews (T421796)]] [11:38:16] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [11:38:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd2003.codfw.wmnet [11:38:51] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11889161 (10ops-monitoring-bot) VM aux-k8s-etcd2003.codfw.wmnet rebooted by jmm@cumin2002 with reason: bump memory [11:39:54] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1282955|Close Norwegian Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:40:02] (03CR) 10Nikerabbit: service::catalog: Add opensearch-ttmserver-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1282920 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [11:40:19] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 77%, RTA = 8182.29 ms [11:40:35] (03PS3) 10Ayounsi: Bird: use the GUA v6 gateway instead of link-local [puppet] - 10https://gerrit.wikimedia.org/r/1282958 [11:40:53] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282958 (owner: 10Ayounsi) [11:41:21] (03PS2) 10Slyngshede: Geo-maps: Update Meta PoPs [dns] - 10https://gerrit.wikimedia.org/r/1282956 [11:42:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-etcd2003.codfw.wmnet [11:43:17] (03PS1) 10Ladsgroup: Close Shan Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282960 (https://phabricator.wikimedia.org/T421796) [11:43:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd2004.codfw.wmnet [11:43:23] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [11:43:47] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11889168 (10ops-monitoring-bot) VM aux-k8s-etcd2004.codfw.wmnet rebooted by jmm@cumin2002 with reason: bump memory [11:44:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P92304 and previous config saved to /var/cache/conftool/dbconfig/20260505-114455-fceratto.json [11:45:22] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 390.40 ms [11:46:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [11:46:53] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [11:46:58] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [11:47:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-etcd2004.codfw.wmnet [11:47:29] (03PS4) 10Ayounsi: Bird: use the GUA v6 gateway instead of link-local [puppet] - 10https://gerrit.wikimedia.org/r/1282958 [11:47:34] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282955|Close Norwegian Wikinews (T421796)]] (duration: 09m 21s) [11:47:37] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [11:47:52] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282958 (owner: 10Ayounsi) [11:49:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282960 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [11:50:28] (03PS2) 10Elukey: Turn Wikifunctions evaluator endpoints to production state [puppet] - 10https://gerrit.wikimedia.org/r/1280434 (https://phabricator.wikimedia.org/T424193) [11:51:02] (03Merged) 10jenkins-bot: Close Shan Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282960 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [11:51:18] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1282960|Close Shan Wikinews (T421796)]] [11:51:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [11:51:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [11:52:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd2005.codfw.wmnet [11:52:30] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11889189 (10ops-monitoring-bot) VM aux-k8s-etcd2005.codfw.wmnet rebooted by jmm@cumin2002 with reason: bump memory [11:52:59] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1282960|Close Shan Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:53:02] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [11:53:14] (03PS1) 10Daniel Kinzler: Move make files to standard location [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) [11:53:24] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [11:55:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T419961)', diff saved to https://phabricator.wikimedia.org/P92305 and previous config saved to /var/cache/conftool/dbconfig/20260505-115503-fceratto.json [11:55:27] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance [11:55:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2237 (T419961)', diff saved to https://phabricator.wikimedia.org/P92306 and previous config saved to /var/cache/conftool/dbconfig/20260505-115535-fceratto.json [11:56:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-etcd2005.codfw.wmnet [11:56:11] (03PS2) 10Daniel Kinzler: Move make files to standard location [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) [11:56:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [11:57:07] (03PS3) 10Daniel Kinzler: Move Makefiles to standard location [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) [11:57:32] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282960|Close Shan Wikinews (T421796)]] (duration: 06m 13s) [11:59:08] (03PS5) 10Ayounsi: Bird: use the GUA v6 gateway instead of link-local [puppet] - 10https://gerrit.wikimedia.org/r/1282958 [11:59:55] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282958 (owner: 10Ayounsi) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1200) [12:01:45] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11889215 (10MoritzMuehlenhoff) [12:03:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T419961)', diff saved to https://phabricator.wikimedia.org/P92307 and previous config saved to /var/cache/conftool/dbconfig/20260505-120344-fceratto.json [12:04:43] !log installing postgresql-13 security updates [12:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:28] RECOVERY - Wikidough DoT Check -IPv6- on doh4004 is OK: TCP OK - 0.153 second response time on 2620:0:863:3:198:35:26:101 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [12:06:28] RECOVERY - Wikidough DoH Check -IPv6- on doh4004 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.297 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [12:10:59] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:11:20] RECOVERY - Wikidough DoT Check -IPv6- on doh4003 is OK: TCP OK - 0.155 second response time on 2620:0:863:3:198:35:26:100 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [12:11:28] RECOVERY - Wikidough DoH Check -IPv6- on doh4003 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.296 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [12:12:22] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=ulsfo - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:13:00] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 3176.68 ms [12:13:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P92308 and previous config saved to /var/cache/conftool/dbconfig/20260505-121352-fceratto.json [12:14:21] FIRING: [17x] JobUnavailable: Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:15:42] (03CR) 10Cathal Mooney: "Overall lgtm, one suggestion in-line which might make it easier. In terms of blast radius I'd not considered we'd the same config everywh" [puppet] - 10https://gerrit.wikimedia.org/r/1282958 (owner: 10Ayounsi) [12:16:42] (03PS1) 10Muehlenhoff: thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282969 [12:19:18] (03PS6) 10Ayounsi: Bird: use the GUA v6 gateway instead of link-local [puppet] - 10https://gerrit.wikimedia.org/r/1282958 [12:19:51] (03CR) 10Filippo Giunchedi: [C:03+1] openstack: puppet-enc: Exclusively read projects table [puppet] - 10https://gerrit.wikimedia.org/r/1282943 (https://phabricator.wikimedia.org/T416588) (owner: 10Majavah) [12:19:55] (03CR) 10Filippo Giunchedi: [C:03+1] openstack: puppet-enc: Stop writing and drop old project column [puppet] - 10https://gerrit.wikimedia.org/r/1282944 (https://phabricator.wikimedia.org/T416588) (owner: 10Majavah) [12:22:35] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282969 (owner: 10Muehlenhoff) [12:23:13] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [12:23:27] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [12:24:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P92309 and previous config saved to /var/cache/conftool/dbconfig/20260505-122404-fceratto.json [12:26:42] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [12:28:06] (03CR) 10Mszwarc: Add Akan (ak) to wmgExtraLanguageNames by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281964 (https://phabricator.wikimedia.org/T333765) (owner: 10Jon Harald Søby) [12:28:09] (03PS7) 10Ayounsi: Bird: use the GUA v6 gateway instead of link-local [puppet] - 10https://gerrit.wikimedia.org/r/1282958 [12:28:34] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282958 (owner: 10Ayounsi) [12:28:51] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [12:29:12] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [12:29:40] (03CR) 10Ayounsi: Bird: use the GUA v6 gateway instead of link-local (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1282958 (owner: 10Ayounsi) [12:31:11] (03PS10) 10JMeybohm: tlsproxy::envoy: Support ratelimit configuration [puppet] - 10https://gerrit.wikimedia.org/r/1228995 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [12:31:20] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [12:31:47] (03CR) 10Gehel: "After discussion with moritzm I'm introducing a $no_priority_prefix parameter to allow shadowing." [puppet] - 10https://gerrit.wikimedia.org/r/1282319 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [12:31:52] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8513/console" [puppet] - 10https://gerrit.wikimedia.org/r/1228995 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [12:32:48] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8514/console" [puppet] - 10https://gerrit.wikimedia.org/r/1228995 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [12:33:31] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282958 (owner: 10Ayounsi) [12:33:55] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [12:33:58] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [12:34:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T419961)', diff saved to https://phabricator.wikimedia.org/P92310 and previous config saved to /var/cache/conftool/dbconfig/20260505-123411-fceratto.json [12:34:35] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2239.codfw.wmnet with reason: Maintenance [12:35:55] hi, we (Growth team) have accidentally merged https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1280226 but I had not scheduled it for deployment. I guess this is can bother deployers in the next backport window. If there are no objections I'm gonna try to deploy before the window starts [12:36:23] (03CR) 10Muehlenhoff: [C:03+1] feat(sysctl): allow shadowing distro provided configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1282319 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [12:36:58] !log installing imagemagick security updates [12:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:22] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:39:25] jouncebot: nowandnext [12:39:25] For the next 0 hour(s) and 20 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1200) [12:39:25] In 0 hour(s) and 20 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1300) [12:39:52] sergi0: sure looks ok for me [12:40:04] (SRE) [12:40:19] great, ty! [12:40:33] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2240.codfw.wmnet with reason: Maintenance [12:40:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2240 (T419961)', diff saved to https://phabricator.wikimedia.org/P92311 and previous config saved to /var/cache/conftool/dbconfig/20260505-124041-fceratto.json [12:40:47] !log sgimeno@deploy1003 Started scap sync-world: Backport for [[gerrit:1280226|loggedOutWarning: instrument browser navigation and tab close (T421518)]] [12:40:49] T421518: Investigate unexpectedly low CTR totals in Logged-Out Warning Message experiment - https://phabricator.wikimedia.org/T421518 [12:41:37] !log sgimeno@deploy1003 sgimeno: Backport for [[gerrit:1280226|loggedOutWarning: instrument browser navigation and tab close (T421518)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:41:55] * sergi0 tests [12:42:26] !log installing node-tar security updates [12:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:29] (03PS1) 10Ladsgroup: Close Esperanto Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282970 (https://phabricator.wikimedia.org/T421796) [12:43:09] jouncebot: nowandnext [12:43:09] For the next 0 hour(s) and 16 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1200) [12:43:09] In 0 hour(s) and 16 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1300) [12:43:18] (03PS1) 10Ayounsi: ulsfo: re-add old switch [puppet] - 10https://gerrit.wikimedia.org/r/1282971 (https://phabricator.wikimedia.org/T425399) [12:43:21] !log sgimeno@deploy1003 sgimeno: Continuing with deployment [12:43:40] sergi0: I accidentally clicked on it [12:43:54] ah I was wondering, no worries, lgtm [12:44:01] sorry [12:44:03] :D [12:44:10] (03CR) 10Tiziano Fogli: [C:03+1] ulsfo: re-add old switch [puppet] - 10https://gerrit.wikimedia.org/r/1282971 (https://phabricator.wikimedia.org/T425399) (owner: 10Ayounsi) [12:44:16] (03PS1) 10Jelto: miscweb: add a second sidecar for wmf-navigator data sync [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282973 (https://phabricator.wikimedia.org/T414405) [12:44:43] !log sgimeno@deploy1003 Finished scap sync-world: Backport for [[gerrit:1280226|loggedOutWarning: instrument browser navigation and tab close (T421518)]] (duration: 03m 56s) [12:47:12] sergi0: okay if I do a deployment? [12:47:34] yes, sorry, all yours [12:47:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282970 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [12:47:59] no worries! Thank you and sorry for that 😅 [12:48:26] (03CR) 10Ayounsi: [C:03+2] ulsfo: re-add old switch [puppet] - 10https://gerrit.wikimedia.org/r/1282971 (https://phabricator.wikimedia.org/T425399) (owner: 10Ayounsi) [12:48:58] (03Merged) 10jenkins-bot: Close Esperanto Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282970 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [12:49:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T419961)', diff saved to https://phabricator.wikimedia.org/P92312 and previous config saved to /var/cache/conftool/dbconfig/20260505-124907-fceratto.json [12:49:11] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1282970|Close Esperanto Wikinews (T421796)]] [12:49:14] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [12:50:57] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1282970|Close Esperanto Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:51:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282397 (https://phabricator.wikimedia.org/T425290) (owner: 10HakanIST) [12:51:17] (03PS1) 10Muehlenhoff: Update WDQS Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1282974 [12:52:26] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [12:52:40] (03CR) 10Cathal Mooney: [C:03+1] Bird: use the GUA v6 gateway instead of link-local [puppet] - 10https://gerrit.wikimedia.org/r/1282958 (owner: 10Ayounsi) [12:53:44] (03CR) 10Gehel: feat(sysctl): allow shadowing distro provided configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1282319 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [12:53:52] (03PS5) 10Gehel: feat(sysctl): allow shadowing distro provided configs [puppet] - 10https://gerrit.wikimedia.org/r/1282319 (https://phabricator.wikimedia.org/T425301) [12:55:10] (03CR) 10Ayounsi: CoreRouterInterfaceDropPercent: fix ping disable (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1282099 (owner: 10Ayounsi) [12:55:29] (03CR) 10Elukey: [C:03+2] Turn Wikifunctions evaluator endpoints to production state [puppet] - 10https://gerrit.wikimedia.org/r/1280434 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [12:55:54] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1282106 (https://phabricator.wikimedia.org/T416452) (owner: 10Elukey) [12:56:12] (03CR) 10Gehel: [C:03+2] feat(sysctl): allow shadowing distro provided configs [puppet] - 10https://gerrit.wikimedia.org/r/1282319 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [12:56:13] (03CR) 10Elukey: [C:03+2] aptrepo: add otelcol-contrib thirdparty config for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1282106 (https://phabricator.wikimedia.org/T416452) (owner: 10Elukey) [12:56:35] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282970|Close Esperanto Wikinews (T421796)]] (duration: 07m 23s) [12:56:38] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [12:57:39] (03CR) 10Elukey: [C:03+2] profile::services_proxy::envoy: add wikifunctions eval endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1280435 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [12:59:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P92313 and previous config saved to /var/cache/conftool/dbconfig/20260505-125915-fceratto.json [12:59:35] (03PS3) 10Jon Harald Søby: Add Akan (ak) to wmgExtraLanguageNames by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281964 (https://phabricator.wikimedia.org/T333765) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1300). [13:00:05] MatmaRex, Msz2001, Jhs, jakob_WMDE, and HakanIST: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:12] (03CR) 10Jon Harald Søby: Add Akan (ak) to wmgExtraLanguageNames by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281964 (https://phabricator.wikimedia.org/T333765) (owner: 10Jon Harald Søby) [13:00:16] o/ [13:00:17] I can deploy [13:00:20] o/ [13:00:31] hi [13:00:38] MatmaRex: you're first in the queue. Are you around? [13:02:01] If not, I'll start with my patch, not to delay the window [13:02:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282850 (https://phabricator.wikimedia.org/T418484) (owner: 10Mszwarc) [13:02:33] RECOVERY - Check correctness of the icinga configuration on alert1002 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [13:03:18] (03Merged) 10jenkins-bot: Switch 'autoconfirmed' to use APCOND_AGE_FROM_EDIT on certain wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282850 (https://phabricator.wikimedia.org/T418484) (owner: 10Mszwarc) [13:03:35] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1282850|Switch 'autoconfirmed' to use APCOND_AGE_FROM_EDIT on certain wikis (T418484)]] [13:03:38] T418484: Reconfigure autoconfirmed group so that account age is counted from first edit and not registration - https://phabricator.wikimedia.org/T418484 [13:03:39] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (DIFF 73 CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compil" [puppet] - 10https://gerrit.wikimedia.org/r/1228995 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [13:04:22] jouncebot: now [13:04:22] For the next 0 hour(s) and 55 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1300) [13:04:36] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [13:04:42] hi, sorry i'm late. my changes should be no-ops, i hope someone deploying can ship them out [13:04:50] Sure, I'll do it [13:05:04] (but after my changes finish deploying) [13:05:25] thanks [13:05:41] !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1282850|Switch 'autoconfirmed' to use APCOND_AGE_FROM_EDIT on certain wikis (T418484)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:07:17] (03CR) 10Ssingh: "Hi folks: Asking for planning reasons since it is a big change: when we are planning to merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/1282958 (owner: 10Ayounsi) [13:07:19] !log mszwarc@deploy1003 mszwarc: Continuing with deployment [13:07:46] (03CR) 10Muehlenhoff: [C:03+2] Avoid false positive alerts after Ganeti master failover [puppet] - 10https://gerrit.wikimedia.org/r/1272701 (owner: 10Muehlenhoff) [13:09:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P92314 and previous config saved to /var/cache/conftool/dbconfig/20260505-130923-fceratto.json [13:10:03] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 90%, RTA = 1144.19 ms [13:10:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet [13:11:18] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [13:11:24] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [13:11:30] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282850|Switch 'autoconfirmed' to use APCOND_AGE_FROM_EDIT on certain wikis (T418484)]] (duration: 07m 55s) [13:11:33] T418484: Reconfigure autoconfirmed group so that account age is counted from first edit and not registration - https://phabricator.wikimedia.org/T418484 [13:11:35] (03CR) 10Mszwarc: "recheck" [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282397 (https://phabricator.wikimedia.org/T425290) (owner: 10HakanIST) [13:12:23] Jhs: can I deploy you patch together with MatmaRex's? [13:12:30] Msz2001, sure [13:12:42] (03CR) 10Majavah: [C:03+2] openstack: puppet-enc: Exclusively read projects table [puppet] - 10https://gerrit.wikimedia.org/r/1282943 (https://phabricator.wikimedia.org/T416588) (owner: 10Majavah) [13:13:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270882 (https://phabricator.wikimedia.org/T417690) (owner: 10D3r1ck01) [13:13:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271969 (https://phabricator.wikimedia.org/T418507) (owner: 10Bartosz Dziewoński) [13:13:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281964 (https://phabricator.wikimedia.org/T333765) (owner: 10Jon Harald Søby) [13:14:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet [13:14:41] (03Merged) 10jenkins-bot: Remove temporary `wgOAuth2UsePrefixedSub` feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270882 (https://phabricator.wikimedia.org/T417690) (owner: 10D3r1ck01) [13:14:45] (03Merged) 10jenkins-bot: Move privileged global and local group handling to WikimediaCustomizations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271969 (https://phabricator.wikimedia.org/T418507) (owner: 10Bartosz Dziewoński) [13:14:49] (03Merged) 10jenkins-bot: Add Akan (ak) to wmgExtraLanguageNames by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281964 (https://phabricator.wikimedia.org/T333765) (owner: 10Jon Harald Søby) [13:15:02] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1270882|Remove temporary `wgOAuth2UsePrefixedSub` feature flag (T417690)]], [[gerrit:1271969|Move privileged global and local group handling to WikimediaCustomizations (T418507)]], [[gerrit:1281964|Add Akan (ak) to wmgExtraLanguageNames by default (T333765 T425256)]] [13:15:10] T417690: Remove $wgOAuth2UsePrefixedSub - https://phabricator.wikimedia.org/T417690 [13:15:11] T418507: Move wmfGetPrivilegedGroups(), $wmgPrivilegedGroups, $wmgPrivilegedGlobalGroups, GetSecurityLogContext and PasswordPoliciesForUser hook handlers to WikimediaCustomizations - https://phabricator.wikimedia.org/T418507 [13:15:11] T333765: Remove Akan support from MediaWiki, ULS, and Wikimedia servers - https://phabricator.wikimedia.org/T333765 [13:15:11] T425256: Incorrect display text for 'ak' language in selector - https://phabricator.wikimedia.org/T425256 [13:15:32] HakanIST: Your patch is affected by one of Scribunto's tests failing for wmf.26 - and CI won't merge it [13:16:08] Msz2001: okay, should I try later time? [13:16:25] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:37] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: NodeTextfileStale (instance ganeti-test2003:9100) - https://phabricator.wikimedia.org/T424001#11889488 (10MoritzMuehlenhoff) 05Open→03Resolved The .prom files were stale on hosts which were formerly a Ganeti master, and then fai... [13:16:45] !log mszwarc@deploy1003 mszwarc, jhsoby, matmarex, d3r1ck01: Backport for [[gerrit:1270882|Remove temporary `wgOAuth2UsePrefixedSub` feature flag (T417690)]], [[gerrit:1271969|Move privileged global and local group handling to WikimediaCustomizations (T418507)]], [[gerrit:1281964|Add Akan (ak) to wmgExtraLanguageNames by default (T333765 T425256)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug [13:16:45] ). Changes can now be verified there. [13:17:00] You can try after the issue is resolved, or if it's not urgent, just wait for 1.47-wmf.1 to roll out [13:17:12] jhs, MatmaRex: Anything to check for your patches? [13:17:28] Msz2001, yeah, i'm almost done checking, give me 1 minute [13:17:32] (03PS1) 10Elukey: services: add evaluator's listeners to Wikifunction's orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282977 (https://phabricator.wikimedia.org/T424193) [13:17:34] sure [13:17:36] testing [13:18:23] Msz2001, mine seems to be working as intended everywhere 👍 [13:18:55] looks good here as well [13:19:22] ack, continuing [13:19:27] !log mszwarc@deploy1003 mszwarc, jhsoby, matmarex, d3r1ck01: Continuing with deployment [13:19:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T419961)', diff saved to https://phabricator.wikimedia.org/P92315 and previous config saved to /var/cache/conftool/dbconfig/20260505-131931-fceratto.json [13:19:54] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2245.codfw.wmnet with reason: Maintenance [13:20:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2245 (T419961)', diff saved to https://phabricator.wikimedia.org/P92316 and previous config saved to /var/cache/conftool/dbconfig/20260505-132002-fceratto.json [13:20:15] (03PS1) 10Ssingh: varnish: remove CSP (and Report-Only) from VCL [puppet] - 10https://gerrit.wikimedia.org/r/1282979 (https://phabricator.wikimedia.org/T420604) [13:21:08] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8516/co" [puppet] - 10https://gerrit.wikimedia.org/r/1282979 (https://phabricator.wikimedia.org/T420604) (owner: 10Ssingh) [13:21:51] jakob_WMDE: should I also deploy your patch or would you prefer to deploy it yourself (if you have deploy rights)? [13:22:04] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on dborch1002.wikimedia.org with reason: T416582 [13:22:07] T416582: Migrate orchestrator to Trixie - https://phabricator.wikimedia.org/T416582 [13:22:19] Msz2001: would be great if you could deploy it! thanks :) [13:22:40] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [13:22:43] No problem, just a minute, as the previous deployment is finishing [13:23:24] (03PS2) 10Jforrester: wikifunctions: add evaluators' listeners to the orchestrators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282977 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [13:23:24] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [13:23:30] (03CR) 10Jforrester: [C:03+1] wikifunctions: add evaluators' listeners to the orchestrators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282977 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [13:23:39] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1270882|Remove temporary `wgOAuth2UsePrefixedSub` feature flag (T417690)]], [[gerrit:1271969|Move privileged global and local group handling to WikimediaCustomizations (T418507)]], [[gerrit:1281964|Add Akan (ak) to wmgExtraLanguageNames by default (T333765 T425256)]] (duration: 08m 37s) [13:23:43] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [13:23:47] T417690: Remove $wgOAuth2UsePrefixedSub - https://phabricator.wikimedia.org/T417690 [13:23:47] T418507: Move wmfGetPrivilegedGroups(), $wmgPrivilegedGroups, $wmgPrivilegedGlobalGroups, GetSecurityLogContext and PasswordPoliciesForUser hook handlers to WikimediaCustomizations - https://phabricator.wikimedia.org/T418507 [13:23:48] T333765: Remove Akan support from MediaWiki, ULS, and Wikimedia servers - https://phabricator.wikimedia.org/T333765 [13:23:48] T425256: Incorrect display text for 'ak' language in selector - https://phabricator.wikimedia.org/T425256 [13:24:24] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [13:24:54] (03CR) 10Eevans: [C:03+2] decommission aqs101[0-2,4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1282386 (https://phabricator.wikimedia.org/T425357) (owner: 10Eevans) [13:26:37] jakob_WMDE: scap reports this patch as not deployable to production (1.47-wmf.1 is not yet on any wiki). I can instead CR+2 the change and have it merged to the branch, so it rolls with the train. Is it okay? Or would you prefer to deploy it when it can be tested? [13:27:04] !log eevans@cumin1003 START - Cookbook sre.hosts.decommission for hosts aqs1010.eqiad.wmnet [13:27:34] Msz2001: having it roll out with the train is fine! I checked on beta that it works :) [13:27:49] (03CR) 10Mszwarc: [C:03+2] Fix LemmaLanguageField after core change [extensions/WikibaseLexeme] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282931 (owner: 10Jakob) [13:28:17] Fine, +2'd it [13:28:27] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:28:50] The backport window is done then [13:28:52] Msz2001: I did ask about this earlier, and I believe taavi told me to deploy it as a backport instead of merging it [13:29:07] Ah, sorry. I can remove +2 [13:29:15] (03CR) 10Mszwarc: "per IRC" [extensions/WikibaseLexeme] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282931 (owner: 10Jakob) [13:29:46] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11889565 (10Papaul) 05Open→03Resolved We can close this [13:29:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T419961)', diff saved to https://phabricator.wikimedia.org/P92317 and previous config saved to /var/cache/conftool/dbconfig/20260505-132952-fceratto.json [13:30:08] !log UTC afternoon backport window done [13:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:16] (03CR) 10HakanIST: "recheck" [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282397 (https://phabricator.wikimedia.org/T425290) (owner: 10HakanIST) [13:31:00] (03CR) 10Jforrester: "I think we need to land both this and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Scribunto/+/1281895 (with one force-merged)." [extensions/ProofreadPage] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282924 (owner: 10Jgiannelos) [13:31:00] Msz2001: taavi said earlier "wmf.1 already exists on the deployment server, and usually at this time would've been synced to test wikis automatically, but the test wiki sync seems to have failed for this train some reason. I would wait for the testwikis issue to be sorted and then backport as usual" - do you know if that's still relevant? I also would've thought just merging to the 1.47.1 branch would have been [13:31:00] fine [13:32:18] Per https://versions.toolforge.org/ testwikis are still at wmf.26, so it might not be sorted yet [13:33:02] but it seems a bit silly to roll out the faulty branch when we have the fix in master already :/ [13:33:26] !log eevans@cumin1003 START - Cookbook sre.dns.netbox [13:34:01] (03PS3) 10Blake: k8s: Remove support for k8s versions before 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1278370 (https://phabricator.wikimedia.org/T423251) [13:34:09] (03CR) 10Elukey: [C:03+2] wikifunctions: add evaluators' listeners to the orchestrators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282977 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [13:34:31] (03CR) 10CI reject: [V:04-1] k8s: Remove support for k8s versions before 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1278370 (https://phabricator.wikimedia.org/T423251) (owner: 10Blake) [13:34:43] (03Restored) 10Jforrester: ProofreadPageTestCase: Don't write globals in tests, that's not good, m'kay? [extensions/ProofreadPage] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282924 (owner: 10Jgiannelos) [13:34:55] James_F: o/ if you have time we can deploy the mesh changes and see how they look like [13:35:23] elukey: I'm in continuous meetings for the next 6 hours (oy), but I can keep an eye out? [13:35:47] James_F: I envy you so much! :D yeah I can take care of those [13:36:01] <3 [13:36:56] jakob_WMDE: Hm, I see the point - if it was normal state before attempting to roll out to first group of wikis, merging shouldn't probably have any side effect, but right now, I'd prefer not to add additional around fixing the issue with wmf.1 rolling to testwikis. (maybe it's just precaution, but waiting a few hours also shouldn't cause any disaster IMO) [13:37:00] !log eevans@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aqs1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1003" [13:37:17] !log eevans@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aqs1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1003" [13:37:17] !log eevans@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:37:18] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aqs1010.eqiad.wmnet [13:37:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T425159#11889621 (10Jclark-ctr) 05Open→03Resolved [13:37:41] (03PS4) 10Blake: k8s: Remove support for k8s versions before 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1278370 (https://phabricator.wikimedia.org/T423251) [13:37:47] !log eevans@cumin1003 START - Cookbook sre.hosts.decommission for hosts aqs1011.eqiad.wmnet [13:37:55] Msz2001: ok, get. then I'll schedule it for another backport window. thanks! [13:37:59] *get it [13:38:07] yw :) [13:38:25] 06SRE, 10Icinga, 10observability, 10Observability-Alerting: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336#11889625 (10Jgreen) If Fundraising is the only remaining nsca user, this service can be shut down, see {T425424} [13:39:12] (03CR) 10Elukey: [C:03+1] tlsproxy::envoy: Bump default now that services have moved [puppet] - 10https://gerrit.wikimedia.org/r/1279340 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [13:39:33] (03PS1) 10Ladsgroup: Close Portuguese Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282988 (https://phabricator.wikimedia.org/T421796) [13:40:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P92318 and previous config saved to /var/cache/conftool/dbconfig/20260505-134000-fceratto.json [13:41:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikibaseLexeme] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282931 (owner: 10Jakob) [13:42:02] (03PS1) 10Elukey: otelcol: upgrade to Trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1282989 (https://phabricator.wikimedia.org/T416452) [13:42:04] (03PS2) 10Ladsgroup: Close Portuguese Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282988 (https://phabricator.wikimedia.org/T421796) [13:42:50] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1227.eqiad.wmnet with reason: Maintenance [13:42:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T419635)', diff saved to https://phabricator.wikimedia.org/P92319 and previous config saved to /var/cache/conftool/dbconfig/20260505-134257-fceratto.json [13:43:01] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:43:04] (03CR) 10Elukey: "~/Wikimedia/production-images$ docker-pkg build images/ --select *otelcol*" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1282989 (https://phabricator.wikimedia.org/T416452) (owner: 10Elukey) [13:43:07] jouncebot: nowandnext [13:43:08] For the next 0 hour(s) and 16 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1300) [13:43:08] In 0 hour(s) and 16 minute(s): Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1400) [13:43:46] !log jasmine@cumin2002 START - Cookbook sre.kafka.change-confluent-distro-version Change Confluent distribution for Kafka A:kafka-main-codfw cluster: Change Confluent distribution. [13:43:54] o/ [13:44:01] sorry I lost track of time, what’s the status of jakob_WMDE’s change now? [13:44:05] !log eevans@cumin1003 START - Cookbook sre.dns.netbox [13:44:12] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1227: Repooling [13:44:16] It's not merged [13:44:17] Lucas_WMDE: I scheduled it for another backport window tomorrow [13:44:28] is there a reason no to deploy it now? [13:44:30] (I can deploy if needed) [13:44:33] *not to [13:44:37] (03CR) 10Jasmine: [C:03+2] kafka-main: set main-codfw cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [13:44:43] scap says it's undeployable (wmf.1 is not in prod yet) [13:45:04] o_O [13:45:12] ok let me take a look then [13:45:14] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2209.codfw.wmnet with reason: Maintenance [13:45:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2209 (T419635)', diff saved to https://phabricator.wikimedia.org/P92321 and previous config saved to /var/cache/conftool/dbconfig/20260505-134522-fceratto.json [13:45:23] assuming my SSH will let me [13:45:23] if there is nothing being deployed right now. May I close two more wikinews wikis? [13:45:36] Go ahead [13:45:47] Thanks [13:45:55] well, I think we do have something to deploy… [13:45:58] but I guess you can do wikinews anyway [13:46:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282988 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [13:46:29] Ah, sorry, I misunderstood your comment, Lucas [13:46:52] The scap's output that I saw was: [13:46:53] 13:25:09 Checking whether requested changes are valid for backport... [13:46:53] Change '1282931', project: 'mediawiki/extensions/WikibaseLexeme', branch: 'wmf/1.47.0-wmf.1' is not deployable to production [13:47:16] Msz2001: I think what we want to do is merge the backport on Gerrit and then just `git pull` on deployment.eqiad.wmnet [13:47:19] and then it’ll roll out with the train [13:47:22] but I’d like to SSH in first [13:47:29] might need to reboot, not sure why I can’t SSH [13:47:32] (03Merged) 10jenkins-bot: Close Portuguese Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1282988 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [13:47:35] (anything up with bast3007? lemme try another…) [13:47:39] !log eevans@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aqs1011.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1003" [13:47:47] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1282988|Close Portuguese Wikinews (T421796)]] [13:47:50] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [13:47:56] !log eevans@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aqs1011.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1003" [13:47:56] !log eevans@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:47:57] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aqs1011.eqiad.wmnet [13:48:42] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11889662 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:48:43] bast3007 works for me [13:49:11] okay let me reboot then [13:49:12] !log jmm@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=config-master,name=codfw [13:49:24] (good idea anyway, pull in the copy.fail fix and all that) [13:49:29] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1282988|Close Portuguese Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:49:44] thanks for confirming ^^ [13:49:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [13:49:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:50:00] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [13:50:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P92323 and previous config saved to /var/cache/conftool/dbconfig/20260505-135008-fceratto.json [13:50:31] Amir1: can you ping me when you’re done? [13:50:34] !log eevans@cumin1003 START - Cookbook sre.hosts.decommission for hosts aqs1014.eqiad.wmnet [13:50:46] sure [13:50:48] thx [13:50:58] almost done [13:51:23] (03CR) 10SBassett: [C:03+1] varnish: remove CSP (and Report-Only) from VCL [puppet] - 10https://gerrit.wikimedia.org/r/1282979 (https://phabricator.wikimedia.org/T420604) (owner: 10Ssingh) [13:53:11] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:39] hmmmm. I think I can IPv4 to bast3007, but not IPv6 😱 [13:54:01] and unfortunately, `ssh -4` doesn’t seem to apply to the ProxyJump, it’s still trying to use IPv6 for the bastion /o\ [13:54:05] (my eyeballs are not happy) [13:54:09] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282988|Close Portuguese Wikinews (T421796)]] (duration: 06m 22s) [13:54:12] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [13:54:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [13:54:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:55:04] okay, `AddressFamily inet` in `.ssh/config` works [13:55:10] I’ll report the network issue later [13:55:28] !log eevans@cumin1003 START - Cookbook sre.dns.netbox [13:56:04] Lucas_WMDE: I'm done [13:56:08] thanks! [13:56:48] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "merging – we can’t scap-merge this (scap refuses because the branch isn’t deployed yet), but merging and then pulling it to the deployment" [extensions/WikibaseLexeme] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282931 (owner: 10Jakob) [13:56:52] ^ fyi jakob_WMDE Msz2001 [13:57:16] ack [13:57:31] ack [13:58:19] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [13:58:23] (03Merged) 10jenkins-bot: Fix LemmaLanguageField after core change [extensions/WikibaseLexeme] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282931 (owner: 10Jakob) [13:58:48] “I’ll report the network issue later” – nevermind, seems to be a general IPv6 issue on my end, so not worth reporting on the Wikimedia side [13:58:48] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [13:58:49] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [13:59:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM config-master2001.codfw.wmnet [13:59:17] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [13:59:27] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1227: Repooling [13:59:40] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11889729 (10ops-monitoring-bot) VM config-master2001.codfw.wmnet rebooted by jmm@cumin2002 with reason: bump memory [14:00:05] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1400) [14:00:08] (03PS1) 10Cathal Mooney: Netops-fundraising: add ignore rule for missing series [alerts] - 10https://gerrit.wikimedia.org/r/1282995 [14:00:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T419961)', diff saved to https://phabricator.wikimedia.org/P92325 and previous config saved to /var/cache/conftool/dbconfig/20260505-140016-fceratto.json [14:00:23] hmm, /srv/mediawiki-staging/php-1.47.0-wmf.1 is actually *three* commits behind upstream [14:00:40] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2246.codfw.wmnet with reason: Maintenance [14:00:41] “Branch commit for wmf/1.47.0-wmf.1”, update Cite, and update WikibaseLexeme (the change we want to deploy) [14:00:46] feels like I shouldn’t git pull then [14:00:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2246 (T419961)', diff saved to https://phabricator.wikimedia.org/P92326 and previous config saved to /var/cache/conftool/dbconfig/20260505-140047-fceratto.json [14:00:49] * Lucas_WMDE looksup the train conductors [14:01:01] !log eevans@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aqs1014.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1003" [14:01:17] !log eevans@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aqs1014.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1003" [14:01:17] !log eevans@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:01:18] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aqs1014.eqiad.wmnet [14:01:32] !log eevans@cumin1003 START - Cookbook sre.hosts.decommission for hosts aqs1015.eqiad.wmnet [14:02:32] brennen, jeena: FYI, we just merged a 1.47.0-wmf.1 backport (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseLexeme/+/1282931); scap won’t deploy it yet because this branch isn’t deploy anywhere, and I also didn’t git pull it on the deployment host because that was already behind Gerrit by a few other commits. hopefully it’ll [14:02:32] all be good to go by the time the train is ready to roll out [14:02:49] !log jasmine@cumin2002 END (FAIL) - Cookbook sre.kafka.change-confluent-distro-version (exit_code=99) Change Confluent distribution for Kafka A:kafka-main-codfw cluster: Change Confluent distribution. [14:02:52] I think I would leave it at that. any objections jakob_WMDE Msz2001? [14:03:10] fine with me, thanks! [14:03:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM config-master2001.codfw.wmnet [14:03:25] no concerns from me [14:03:34] alright [14:03:43] !log UTC afternoon backport+config window done [14:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:47] thanks a lot for deploying Msz2001! [14:03:50] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [14:03:52] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [14:03:53] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [14:03:53] * Lucas_WMDE hopes the train will go alright [14:03:57] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [14:04:01] yw :) [14:04:35] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: sync [14:04:42] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync [14:04:48] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "Update: I didn’t pull this on the deployment host after all, because `php-1.47.0-wmf.1` was already two (now three) commits behind Gerrit." [extensions/WikibaseLexeme] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1282931 (owner: 10Jakob) [14:05:01] !log jmm@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=config-master,name=codfw [14:05:10] !log jmm@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=config-master,name=eqiad [14:05:55] !log eevans@cumin1003 START - Cookbook sre.dns.netbox [14:06:11] (03PS2) 10Muehlenhoff: tlsproxy::envoy: Bump default now that services have moved [puppet] - 10https://gerrit.wikimedia.org/r/1279340 (https://phabricator.wikimedia.org/T420993) [14:06:58] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [14:07:00] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [14:07:01] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [14:07:05] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [14:07:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279340 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [14:07:43] FIRING: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:07:54] Lucas_WMDE: i see you +2'ed a backport. are you deploying? [14:08:03] * urbanecm has a would-be train blocker fix [14:08:13] urbanecm: I’m not deploying anything [14:08:28] the backport is +2ed but on a non-deployed branch so there’s nothing to do beyond the merge [14:08:29] go ahead [14:08:44] ah, makes sense [14:09:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet [14:09:26] * urbanecm sees the pa.ging alert [14:09:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246 (T419961)', diff saved to https://phabricator.wikimedia.org/P92327 and previous config saved to /var/cache/conftool/dbconfig/20260505-140933-fceratto.json [14:09:34] !log eevans@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aqs1015.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1003" [14:09:50] !log eevans@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aqs1015.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1003" [14:09:50] !log eevans@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:09:51] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aqs1015.eqiad.wmnet [14:10:12] elukey: jhathaway: should i wait for you to check the alert? [14:10:39] !log herron@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1004.eqiad.wmnet with OS trixie [14:10:50] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission aqs10[10-12,14-15].eqiad.wmnet - https://phabricator.wikimedia.org/T425357#11889763 (10Eevans) a:05Eevans→03None [14:10:59] !log herron@cumin1003 START - Cookbook sre.hosts.move-vlan for host kafka-logging1004 [14:10:59] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host kafka-logging1004 [14:12:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:12:11] urbanecm: checking, it didn't page my phone. I think it is ulsfo downtime expired, the dc is depooled [14:12:30] ty, waiting for confirmation [14:13:14] (03PS4) 10Jforrester: Sandbox*Test: Fix CI issues for now, so that we're unblocked [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281895 (https://phabricator.wikimedia.org/T425401) (owner: 10Zabe) [14:13:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet [14:13:32] (03PS2) 10Jforrester: ProofreadPageTestCase: Don't write globals in tests, that's not good, m'kay? [extensions/ProofreadPage] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282924 (https://phabricator.wikimedia.org/T425401) (owner: 10Jgiannelos) [14:13:39] (03CR) 10CI reject: [V:04-1] Sandbox*Test: Fix CI issues for now, so that we're unblocked [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281895 (https://phabricator.wikimedia.org/T425401) (owner: 10Zabe) [14:13:45] urbanecm: yeah go ahead [14:13:51] proceeding, ty [14:14:04] urbanecm: I have a different(?) train blocker fix with T425401 BTW. [14:14:05] T425401: Backports currently impossible on the wmf/1.46 branch - https://phabricator.wikimedia.org/T425401 [14:14:34] (03CR) 10Jforrester: "recheck" [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281895 (https://phabricator.wikimedia.org/T425401) (owner: 10Zabe) [14:15:09] James_F: i can just +2 both in one go? [14:15:17] seems like for yours, there's not much to test anyway [14:15:25] or you can start and i'll continue then [14:15:30] urbanecm: No, I think we'll need to force-merge one and then see if CI passes on the other. [14:15:38] (03PS1) 10Herron: kafka-logging1004: prep for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1282997 (https://phabricator.wikimedia.org/T417001) [14:15:51] ah, ok. then please start and ping me when i can take over? [14:16:17] (03CR) 10Jforrester: [V:03+2 C:03+2] "Don't do this at home, folks." [extensions/ProofreadPage] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282924 (https://phabricator.wikimedia.org/T425401) (owner: 10Jgiannelos) [14:16:23] (03CR) 10CI reject: [V:04-1] ProofreadPageTestCase: Don't write globals in tests, that's not good, m'kay? [extensions/ProofreadPage] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282924 (https://phabricator.wikimedia.org/T425401) (owner: 10Jgiannelos) [14:17:52] (03CR) 10Jforrester: [C:03+2] Sandbox*Test: Fix CI issues for now, so that we're unblocked [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281895 (https://phabricator.wikimedia.org/T425401) (owner: 10Zabe) [14:19:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246', diff saved to https://phabricator.wikimedia.org/P92328 and previous config saved to /var/cache/conftool/dbconfig/20260505-141941-fceratto.json [14:20:05] (03PS5) 10Jforrester: Sandbox*Test: Fix CI issues for now, so that we're unblocked [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281895 (https://phabricator.wikimedia.org/T425401) (owner: 10Zabe) [14:20:06] (03PS1) 10Jasmine: kafka-main: add eqiad-main cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1282999 (https://phabricator.wikimedia.org/T419216) [14:20:08] (03CR) 10Jforrester: "…" [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281895 (https://phabricator.wikimedia.org/T425401) (owner: 10Zabe) [14:22:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T419635)', diff saved to https://phabricator.wikimedia.org/P92329 and previous config saved to /var/cache/conftool/dbconfig/20260505-142206-fceratto.json [14:22:10] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:22:10] (03PS1) 10Andrew Bogott: Designate: move zookeeper config into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1283000 (https://phabricator.wikimedia.org/T422646) [14:22:25] OK, I've killed off all the REL* patches so now the Scribunto patch might even actually run CI. [14:22:33] poor patches :( [14:22:40] (03CR) 10CI reject: [V:04-1] Designate: move zookeeper config into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1283000 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [14:22:45] (03PS2) 10Jelto: miscweb: add a second sidecar for wmf-navigator data sync [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282973 (https://phabricator.wikimedia.org/T414405) [14:22:54] The depends-on directive shouldn't have triggered on them but did. :-( [14:23:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:23:11] RESOLVED: [8x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM config-master1001.eqiad.wmnet [14:24:56] (03PS1) 10Elukey: wikifunctions: fix staging configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283001 [14:25:08] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11889821 (10ops-monitoring-bot) VM config-master1001.eqiad.wmnet rebooted by jmm@cumin2002 with reason: bump memory [14:25:31] !log herron@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1004.eqiad.wmnet with reason: host reimage [14:25:44] (03PS1) 10Urbanecm: fix: wrong property name action_data [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283002 (https://phabricator.wikimedia.org/T425425) [14:26:16] (03PS1) 10Ladsgroup: Close Tamil Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283003 (https://phabricator.wikimedia.org/T421796) [14:26:26] (03PS2) 10Andrew Bogott: Designate: move zookeeper config into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1283000 (https://phabricator.wikimedia.org/T422646) [14:26:41] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 771346592 and 61 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:27:14] (03PS1) 10Urbanecm: fix: wrong property name action_data [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283004 (https://phabricator.wikimedia.org/T425425) [14:27:47] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283000 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [14:28:01] (03CR) 10Jgiannelos: Temporarily disable some parser tests [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282810 (owner: 10Awight) [14:28:14] (03CR) 10Elukey: [C:03+2] wikifunctions: fix staging configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283001 (owner: 10Elukey) [14:28:41] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 130216 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:28:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM config-master1001.eqiad.wmnet [14:28:53] (03CR) 10Jelto: [C:03+2] miscweb: add a second sidecar for wmf-navigator data sync [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282973 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:28:55] (03CR) 10Jforrester: [C:04-2] "Please do not merge to production when we're fixing things." [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282810 (owner: 10Awight) [14:28:57] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1004.eqiad.wmnet with reason: host reimage [14:29:27] (03CR) 10Ladsgroup: "And it won't have any impact any way. The core patch is not merged" [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282810 (owner: 10Awight) [14:29:47] (03PS1) 10Jforrester: IndexAndPageLibrarySandboxTest: Disable for now, broken [extensions/ProofreadPage] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283005 [14:29:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246', diff saved to https://phabricator.wikimedia.org/P92331 and previous config saved to /var/cache/conftool/dbconfig/20260505-142949-fceratto.json [14:29:58] (03PS2) 10Jforrester: IndexAndPageLibrarySandboxTest: Disable for now, broken [extensions/ProofreadPage] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283005 (https://phabricator.wikimedia.org/T425401) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1430) [14:30:05] (03CR) 10Jforrester: [V:03+2 C:03+2] IndexAndPageLibrarySandboxTest: Disable for now, broken [extensions/ProofreadPage] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283005 (https://phabricator.wikimedia.org/T425401) (owner: 10Jforrester) [14:30:57] !log jmm@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=config-master,name=eqiad [14:31:13] (03Abandoned) 10Jforrester: Temporarily disable some parser tests [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282810 (owner: 10Awight) [14:31:32] urbanecm: Should T425425 be marked as a train blocker? [14:31:33] T425425: mediawiki.product_metrics.contributors.experiments should NOT have additional properties - https://phabricator.wikimedia.org/T425425 [14:31:39] (03Merged) 10jenkins-bot: miscweb: add a second sidecar for wmf-navigator data sync [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282973 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:32:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P92332 and previous config saved to /var/cache/conftool/dbconfig/20260505-143214-fceratto.json [14:32:20] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11889864 (10MoritzMuehlenhoff) [14:32:28] (03CR) 10CI reject: [V:04-1] Sandbox*Test: Fix CI issues for now, so that we're unblocked [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281895 (https://phabricator.wikimedia.org/T425401) (owner: 10Zabe) [14:33:08] James_F: probably. although the faulty patch is backported to .26, so... [14:33:25] urbanecm: Ack, I'll sling it into the task tree. [14:33:31] ❤️ [14:34:05] (03CR) 10Jforrester: [C:03+2] "…" [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281895 (https://phabricator.wikimedia.org/T425401) (owner: 10Zabe) [14:34:44] (03CR) 10Jforrester: [V:03+2 C:03+2] "Eh, let's just land this." [extensions/Scribunto] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281895 (https://phabricator.wikimedia.org/T425401) (owner: 10Zabe) [14:34:58] urbanecm: OK, you should now be unblocked. [14:35:04] fingers crossed! [14:35:06] (03CR) 10Jforrester: "recheck" [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282397 (https://phabricator.wikimedia.org/T425290) (owner: 10HakanIST) [14:35:28] (03CR) 10Urbanecm: [C:03+2] fix: wrong property name action_data [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283004 (https://phabricator.wikimedia.org/T425425) (owner: 10Urbanecm) [14:35:29] (03CR) 10Urbanecm: [C:03+2] fix: wrong property name action_data [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283002 (https://phabricator.wikimedia.org/T425425) (owner: 10Urbanecm) [14:35:34] (03PS1) 10Muehlenhoff: Record LDAP NDA status for HakanIST [puppet] - 10https://gerrit.wikimedia.org/r/1283012 (https://phabricator.wikimedia.org/T424812) [14:35:35] (03CR) 10Jforrester: "recheck" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281894 (https://phabricator.wikimedia.org/T424553) (owner: 10Zabe) [14:35:39] (03CR) 10Jforrester: "recheck" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [14:35:49] (03CR) 10Jforrester: "recheck" [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) (owner: 10Awight) [14:35:58] (03CR) 10Jforrester: "recheck" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282930 (https://phabricator.wikimedia.org/T424496) (owner: 10Mpostoronca) [14:37:02] 1- let me know if I can help on anything 2- once you're done, I'd be grateful if you ping me. I need to close more wikinews wikis [14:37:22] (10 done, 20 more to go) [14:37:43] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 167316800 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:39:43] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2589504 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:39:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246 (T419961)', diff saved to https://phabricator.wikimedia.org/P92333 and previous config saved to /var/cache/conftool/dbconfig/20260505-143958-fceratto.json [14:40:00] Amir1: for some reason i thought that'd be closed all in one goal [14:40:03] i'll ping you once done [14:40:21] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2247.codfw.wmnet with reason: Maintenance [14:40:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2247 (T419961)', diff saved to https://phabricator.wikimedia.org/P92334 and previous config saved to /var/cache/conftool/dbconfig/20260505-144029-fceratto.json [14:40:48] I decided against it for two reasons: 1- Out of courtesy. Each project was years of work. 2- A lot of them have specific needs. Like clean up of DPL, etc. [14:41:02] Amir1: <3 [14:41:03] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [14:41:45] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [14:42:06] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [14:42:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P92335 and previous config saved to /var/cache/conftool/dbconfig/20260505-144223-fceratto.json [14:42:59] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [14:43:01] (03CR) 10Jgiannelos: "recheck" [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) (owner: 10Awight) [14:43:36] (03PS2) 10Zabe: Correctly support new file tables in RevisionDeleteUser [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281894 (https://phabricator.wikimedia.org/T424553) [14:43:43] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 793787808 and 57 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:44:28] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1004.eqiad.wmnet with OS trixie [14:44:32] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP NDA status for HakanIST [puppet] - 10https://gerrit.wikimedia.org/r/1283012 (https://phabricator.wikimedia.org/T424812) (owner: 10Muehlenhoff) [14:45:43] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3235448 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:46:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278704 (https://phabricator.wikimedia.org/T415254) (owner: 10Santiago Faci) [14:46:27] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting logstash-access LDAP group access for HakanIST - https://phabricator.wikimedia.org/T424812#11889930 (10MoritzMuehlenhoff) 05In progress→03Resolved a:03MoritzMuehlenhoff The Logstash access has been enabled via Wikimedia IDM. Marking as r... [14:47:08] (03Merged) 10jenkins-bot: fix: wrong property name action_data [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283004 (https://phabricator.wikimedia.org/T425425) (owner: 10Urbanecm) [14:47:36] 07sre-alert-triage, 07Essential-Work, 06Machine-Learning-Team (Q4 FY2025-26): Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T414971#11889939 (10klausman) 05Open→03Resolved a:03klausman This originated with our changes to kserve... [14:47:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283002 (https://phabricator.wikimedia.org/T425425) (owner: 10Urbanecm) [14:48:10] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11889944 (10MoritzMuehlenhoff) [14:48:56] (03PS1) 10SBassett: Set CSP to enforce with currently-allow-listed domains in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283020 (https://phabricator.wikimedia.org/T419612) [14:49:00] (03Merged) 10jenkins-bot: fix: wrong property name action_data [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283002 (https://phabricator.wikimedia.org/T425425) (owner: 10Urbanecm) [14:49:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247 (T419961)', diff saved to https://phabricator.wikimedia.org/P92336 and previous config saved to /var/cache/conftool/dbconfig/20260505-144905-fceratto.json [14:49:47] (03CR) 10CI reject: [V:04-1] Set CSP to enforce with currently-allow-listed domains in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283020 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [14:50:26] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Timeouts on puppetserver1002 past reboot - https://phabricator.wikimedia.org/T423282#11889951 (10MoritzMuehlenhoff) 05Open→03Resolved This is stable since a week, boldly resolving. It's not really clear whether the root cause was the jdk/j... [14:51:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [14:51:44] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1283002|fix: wrong property name action_data (T425425)]] [14:51:49] T425425: mediawiki.product_metrics.contributors.experiments should NOT have additional properties - https://phabricator.wikimedia.org/T425425 [14:52:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T419635)', diff saved to https://phabricator.wikimedia.org/P92337 and previous config saved to /var/cache/conftool/dbconfig/20260505-145231-fceratto.json [14:52:35] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:52:46] (03PS3) 10Andrew Bogott: P:zookeeper: Allow WMCS to use cloud-private FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/1282372 (https://phabricator.wikimedia.org/T422646) (owner: 10Majavah) [14:52:46] (03PS3) 10Andrew Bogott: Designate: move zookeeper config into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1283000 (https://phabricator.wikimedia.org/T422646) [14:53:10] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283000 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [14:53:23] (03PS1) 10Muehlenhoff: profile::mail::mx: Mark the SMTP as intentionally open to the world [puppet] - 10https://gerrit.wikimedia.org/r/1283021 (https://phabricator.wikimedia.org/T149804) [14:53:25] (03PS2) 10SBassett: Set CSP to enforce with allow-listed domains in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283020 (https://phabricator.wikimedia.org/T419612) [14:53:28] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1283002|fix: wrong property name action_data (T425425)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:54:31] (03PS1) 10Muehlenhoff: Switch install7002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283022 [14:55:23] !log urbanecm@deploy1003 urbanecm: Continuing with deployment [14:55:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission aqs10[10-12,14-15].eqiad.wmnet - https://phabricator.wikimedia.org/T425357#11889972 (10Jclark-ctr) a:03Jclark-ctr [14:56:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283021 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [14:57:09] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [14:57:28] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [14:57:41] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [14:57:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission aqs10[10-12,14-15].eqiad.wmnet - https://phabricator.wikimedia.org/T425357#11889976 (10Jclark-ctr) Name Rack Position aqs1010 A1 5 aqs1011 B1 24 aqs1014 D1 8 aqs1015 D8 27 [14:58:04] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [14:58:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission aqs10[10-12,14-15].eqiad.wmnet - https://phabricator.wikimedia.org/T425357#11889979 (10Jclark-ctr) [14:59:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P92338 and previous config saved to /var/cache/conftool/dbconfig/20260505-145913-fceratto.json [14:59:33] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283002|fix: wrong property name action_data (T425425)]] (duration: 07m 48s) [14:59:36] T425425: mediawiki.product_metrics.contributors.experiments should NOT have additional properties - https://phabricator.wikimedia.org/T425425 [14:59:39] Amir1: i am done [14:59:50] Awesome. Thanks! [15:00:04] jelto, arnoldokoth, mutante, and arnaudb: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1500). [15:00:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283003 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [15:01:14] (03Merged) 10jenkins-bot: Close Tamil Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283003 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [15:01:28] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283003|Close Tamil Wikinews (T421796)]] [15:01:31] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [15:02:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) (owner: 10Awight) [15:03:20] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283003|Close Tamil Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:03:38] (03Abandoned) 10Andrew Bogott: zookeeper: allow overriding the zookeeper host ID [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [15:04:26] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [15:07:31] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:08:34] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283003|Close Tamil Wikinews (T421796)]] (duration: 07m 06s) [15:08:37] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [15:09:09] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Port defs_from_etcd logic to nftables - https://phabricator.wikimedia.org/T348734#11890060 (10LSobanski) [15:09:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P92339 and previous config saved to /var/cache/conftool/dbconfig/20260505-150921-fceratto.json [15:10:09] (03CR) 10Ayounsi: [C:03+1] Switch install7002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283022 (owner: 10Muehlenhoff) [15:10:58] (03PS1) 10Ladsgroup: Close Czech Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283024 (https://phabricator.wikimedia.org/T421796) [15:10:58] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/2 (Core: asw1-23-ulsfo:ethernet-1/55 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:12:09] (03PS1) 10Muehlenhoff: Remove role::mail::mx and related Puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1283025 (https://phabricator.wikimedia.org/T352394) [15:13:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283024 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [15:13:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283025 (https://phabricator.wikimedia.org/T352394) (owner: 10Muehlenhoff) [15:14:02] (03Merged) 10jenkins-bot: Close Czech Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283024 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [15:14:17] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283024|Close Czech Wikinews (T421796)]] [15:14:20] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [15:14:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:14:48] (03CR) 10Ayounsi: Netops-fundraising: add ignore rule for missing series (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1282995 (owner: 10Cathal Mooney) [15:15:42] (03CR) 10Cathal Mooney: Netops-fundraising: add ignore rule for missing series (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1282995 (owner: 10Cathal Mooney) [15:16:00] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283024|Close Czech Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:16:08] (03Abandoned) 10Cathal Mooney: Netops-fundraising: add ignore rule for missing series [alerts] - 10https://gerrit.wikimedia.org/r/1282995 (owner: 10Cathal Mooney) [15:16:24] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [15:16:30] (03CR) 10Cathal Mooney: Netops-fundraising: add ignore rule for missing series (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1282995 (owner: 10Cathal Mooney) [15:18:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276814 (https://phabricator.wikimedia.org/T422403) (owner: 10Aaron Schulz) [15:19:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247 (T419961)', diff saved to https://phabricator.wikimedia.org/P92340 and previous config saved to /var/cache/conftool/dbconfig/20260505-151930-fceratto.json [15:19:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [15:19:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:19:48] (03PS1) 10Trueg: openjdk-25-jre/openjdk-25-jdk [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 [15:20:34] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283024|Close Czech Wikinews (T421796)]] (duration: 06m 17s) [15:20:37] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [15:20:40] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:20:44] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:20:50] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:21:06] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:22:48] (03PS1) 10Alex.sanford: Add messages related to mandatory 2FA for more groups [extensions/WikimediaMessages] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283028 (https://phabricator.wikimedia.org/T423119) [15:24:34] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:24:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaMessages] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283028 (https://phabricator.wikimedia.org/T423119) (owner: 10Alex.sanford) [15:24:43] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:28:07] (03CR) 10Scott French: [C:03+1] Revert "envoyproxy: Provide support for UDS upstreams" [puppet] - 10https://gerrit.wikimedia.org/r/1282345 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [15:30:19] (03CR) 10Scott French: [C:03+1] Revert "envoyproxy: Add STEK configuration support" [puppet] - 10https://gerrit.wikimedia.org/r/1282346 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [15:31:19] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 387.57 ms [15:33:29] (03PS1) 10Svantje Lilienthal: VE: Avoid counting all refs when listIndex is undefined [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283033 (https://phabricator.wikimedia.org/T425433) [15:35:03] (03CR) 10Dzahn: [C:03+2] delete mwmaint.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1282430 (owner: 10Dzahn) [15:35:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283033 (https://phabricator.wikimedia.org/T425433) (owner: 10Svantje Lilienthal) [15:36:46] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs4009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [15:36:56] uh? [15:36:58] oh ulsfo [15:37:24] (03PS1) 10Mstyles: Set $wgReauthenticateTime editsitejs to one hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283036 (https://phabricator.wikimedia.org/T197137) [15:37:38] !log dzahn@dns1005 START - running authdns-update [15:38:08] !log deleting mwmaint.discovery.wmnet DNS entry - the hosts behind it dont exist anymore [15:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:58] (03CR) 10Scott French: [C:03+1] Revert "envoyproxy: global_tlsparams" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1282347 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [15:39:02] !log dzahn@dns1005 END - running authdns-update [15:39:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [15:39:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:41:16] (03PS1) 10DCausse: search: fix alt. completion indices to test keyword tokenizer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283037 (https://phabricator.wikimedia.org/T420427) [15:42:14] (03PS1) 10Ladsgroup: Close Finnish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283038 (https://phabricator.wikimedia.org/T421796) [15:42:16] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: sync [15:42:43] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync [15:42:47] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:19] (03PS3) 10DCausse: search: add alt. completion indices to test keyword tokenizer (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269465 (https://phabricator.wikimedia.org/T420427) [15:44:41] jouncebot: nowandnext [15:44:41] For the next 0 hour(s) and 15 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1500) [15:44:41] In 0 hour(s) and 15 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1600) [15:44:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283038 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [15:45:50] (03Merged) 10jenkins-bot: Close Finnish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283038 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [15:46:07] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283038|Close Finnish Wikinews (T421796)]] [15:46:10] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [15:47:49] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 225.96 ms [15:47:50] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283038|Close Finnish Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:48:10] (03PS1) 10Ladsgroup: Close Korean Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283039 (https://phabricator.wikimedia.org/T421796) [15:48:11] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [15:49:42] jouncebot: nowandnext [15:49:42] For the next 0 hour(s) and 10 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1500) [15:49:42] In 0 hour(s) and 10 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1600) [15:49:50] ACKNOWLEDGEMENT - dump of db_inventory in codfw on backupmon1001 is CRITICAL: Last dump for db_inventory at codfw (db2185) taken on 2026-05-05 00:37:50 is 3 MiB, but the previous one was 3 MiB, a change of -19.9 % Jcrespo not worrying - The acknowledgement expires at: 2026-05-12 11:49:03. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [15:49:50] ACKNOWLEDGEMENT - dump of db_inventory in eqiad on backupmon1001 is CRITICAL: Last dump for db_inventory at eqiad (db1215) taken on 2026-05-05 00:38:41 is 3 MiB, but the previous one was 3 MiB, a change of -19.9 % Jcrespo not worrying - The acknowledgement expires at: 2026-05-12 11:49:03. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [15:49:52] sbassett: which window are we using? [15:51:48] (03CR) 10Scott French: [C:03+1] "Thanks, Janis!" [puppet] - 10https://gerrit.wikimedia.org/r/1282348 (https://phabricator.wikimedia.org/T271421) (owner: 10JMeybohm) [15:52:20] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283038|Close Finnish Wikinews (T421796)]] (duration: 06m 12s) [15:52:23] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [15:52:28] (03CR) 10Pppery: Remove role::mail::mx and related Puppet code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283025 (https://phabricator.wikimedia.org/T352394) (owner: 10Muehlenhoff) [15:52:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283039 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [15:53:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:53:32] (03Merged) 10jenkins-bot: Close Korean Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283039 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [15:53:50] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283039|Close Korean Wikinews (T421796)]] [15:54:40] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: sync [15:55:07] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync [15:55:22] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync [15:55:35] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283039|Close Korean Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:55:51] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync [15:56:59] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 2 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11890364 (10Epidosis) As a note: I have started a new import through QS 3.0 a few hours ago - cf. https://www.wikidata.org/wiki/Property_talk:P227#Massive_import_of_data... [15:57:30] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [15:59:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282930 (https://phabricator.wikimedia.org/T424496) (owner: 10Mpostoronca) [16:00:05] jhathaway and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:12] (03PS1) 10Ladsgroup: Close Japanese Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283040 (https://phabricator.wikimedia.org/T421796) [16:01:43] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283039|Close Korean Wikinews (T421796)]] (duration: 07m 53s) [16:01:47] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [16:02:02] (03PS1) 10DCausse: search: enable Latin-to-Devanagari transliteration second-chance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283041 (https://phabricator.wikimedia.org/T425018) [16:04:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283040 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [16:05:25] (03Merged) 10jenkins-bot: Close Japanese Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283040 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [16:05:32] (03PS3) 10Dzahn: tcpproxy: add support for gitlab-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) [16:05:41] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283040|Close Japanese Wikinews (T421796)]] [16:06:27] (03PS2) 10Dzahn: add load balancer IPs for gitlab to geo DNS [dns] - 10https://gerrit.wikimedia.org/r/1282436 (https://phabricator.wikimedia.org/T425441) [16:07:26] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283040|Close Japanese Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:07:29] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [16:07:47] (03PS1) 10Elukey: profile::service_proxy::envoy: fix wikifunction's configs [puppet] - 10https://gerrit.wikimedia.org/r/1283042 [16:07:50] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [16:08:12] (03PS2) 10Dzahn: add discovery names for gitlab [dns] - 10https://gerrit.wikimedia.org/r/1282437 (https://phabricator.wikimedia.org/T425441) [16:08:38] (03CR) 10Jasmine: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282999 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [16:09:21] FIRING: [18x] JobUnavailable: Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:55] (03PS2) 10Muehlenhoff: Remove role::mail::mx and related Puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1283025 (https://phabricator.wikimedia.org/T325394) [16:10:00] (03CR) 10Neriah: [C:03+1] search: add alt. completion indices to test keyword tokenizer (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269465 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [16:10:12] (03CR) 10Muehlenhoff: Remove role::mail::mx and related Puppet code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283025 (https://phabricator.wikimedia.org/T325394) (owner: 10Muehlenhoff) [16:10:35] (03CR) 10Elukey: [C:03+2] profile::service_proxy::envoy: fix wikifunction's configs [puppet] - 10https://gerrit.wikimedia.org/r/1283042 (owner: 10Elukey) [16:10:59] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:11:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [16:11:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [16:11:57] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283040|Close Japanese Wikinews (T421796)]] (duration: 06m 16s) [16:12:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [16:12:45] (03PS1) 10Muehlenhoff: profile::postfix::mx: Mark the SMTP port as intentionally open [puppet] - 10https://gerrit.wikimedia.org/r/1283043 (https://phabricator.wikimedia.org/T149804) [16:15:11] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 33%, RTA = 2101.74 ms [16:15:27] (03PS4) 10CDanis: deployment_server: add kubectl wait-job plugin [puppet] - 10https://gerrit.wikimedia.org/r/1273926 [16:15:35] (03CR) 10CDanis: [C:03+2] deployment_server: add kubectl wait-job plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1273926 (owner: 10CDanis) [16:16:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [16:16:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [16:18:40] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: sync [16:19:01] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync [16:19:12] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync [16:19:42] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync [16:20:15] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 0%, RTA = 777.03 ms [16:21:59] James_F: mesh works! [16:23:04] will send a patch tomorrow to update the orchestrator's config [16:23:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283043 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [16:23:53] (03CR) 10Alex.sanford: Set $wgReauthenticateTime editsitejs to one hour (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283036 (https://phabricator.wikimedia.org/T197137) (owner: 10Mstyles) [16:24:21] FIRING: [19x] JobUnavailable: Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:24:53] elukey: Awesome, thank you! [16:29:53] (03CR) 10Muehlenhoff: profile::postfix::mx: Mark the SMTP port as intentionally open (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283043 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [16:32:01] (03CR) 10Tjones: [C:03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283041 (https://phabricator.wikimedia.org/T425018) (owner: 10DCausse) [16:32:56] (03CR) 10SBassett: Set $wgReauthenticateTime editsitejs to one hour (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283036 (https://phabricator.wikimedia.org/T197137) (owner: 10Mstyles) [16:33:45] (03CR) 10SBassett: [C:03+1] Set $wgReauthenticateTime editsitejs to one hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283036 (https://phabricator.wikimedia.org/T197137) (owner: 10Mstyles) [16:34:21] FIRING: [19x] JobUnavailable: Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:31] jouncebot: nowandnext [16:34:31] For the next 0 hour(s) and 25 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1600) [16:34:31] In 0 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1700) [16:34:35] (03CR) 10SBassett: [C:03+1] Set CSP to enforce with allow-listed domains in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283020 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [16:35:14] Any objections to backporting a wmf.26 patch now? [16:35:27] (03PS1) 10Andrew Bogott: Revert "setup_capi.sh.erb: temporarily use external packages" [puppet] - 10https://gerrit.wikimedia.org/r/1283046 [16:35:27] (03PS1) 10Andrew Bogott: magnum: complete rename from capihelm to clusterapi [puppet] - 10https://gerrit.wikimedia.org/r/1283047 (https://phabricator.wikimedia.org/T393782) [16:35:27] (03PS1) 10Danielyepezgarces: Enabling RSS extension for cowikimedia chapter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283048 (https://phabricator.wikimedia.org/T425440) [16:36:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283036 (https://phabricator.wikimedia.org/T197137) (owner: 10Mstyles) [16:36:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283020 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [16:36:18] (03CR) 10CI reject: [V:04-1] magnum: complete rename from capihelm to clusterapi [puppet] - 10https://gerrit.wikimedia.org/r/1283047 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [16:36:24] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283047 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [16:37:16] Ah I see sbassett is deploying now [16:37:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:38:09] (03CR) 10Tjones: [C:03+1] search: add alt. completion indices to test keyword tokenizer (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269465 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [16:38:32] (03Merged) 10jenkins-bot: Set $wgReauthenticateTime editsitejs to one hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283036 (https://phabricator.wikimedia.org/T197137) (owner: 10Mstyles) [16:38:36] (03Merged) 10jenkins-bot: Set CSP to enforce with allow-listed domains in Wikimedia production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283020 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [16:38:53] !log sbassett@deploy1003 Started scap sync-world: Backport for [[gerrit:1283036|Set $wgReauthenticateTime editsitejs to one hour (T197137)]], [[gerrit:1283020|Set CSP to enforce with allow-listed domains in Wikimedia production (T419612 T420604 T420607)]] [16:39:00] T197137: Editing sitewide JS/CSS pages should require elevated security - https://phabricator.wikimedia.org/T197137 [16:39:01] T420604: Deduplicate CSP between VCL and MediaWiki - https://phabricator.wikimedia.org/T420604 [16:39:01] T420607: Stop setting a report-only CSP - https://phabricator.wikimedia.org/T420607 [16:39:21] FIRING: [19x] JobUnavailable: Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:41:31] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 2926.68 ms [16:42:00] (03CR) 10CDanis: [C:03+2] haproxy: webrequest: capture ratelimiting headers [puppet] - 10https://gerrit.wikimedia.org/r/1279465 (https://phabricator.wikimedia.org/T419736) (owner: 10CDanis) [16:44:28] (03PS1) 10SBassett: Remove undefined variable $wmgUseCSPReportOnlyHasSession [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283049 (https://phabricator.wikimedia.org/T419612) [16:45:11] (03CR) 10CDanis: [C:03+2] Add x_trusted_request and x_wmf_ratelimit_class to webrequest live streams [puppet] - 10https://gerrit.wikimedia.org/r/1279489 (https://phabricator.wikimedia.org/T419736) (owner: 10Aleksandar Mastilovic) [16:46:33] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 260.85 ms [16:46:42] (03CR) 10SBassett: [C:03+1] Remove undefined variable $wmgUseCSPReportOnlyHasSession [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283049 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [16:47:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [16:48:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283049 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [16:48:34] (03PS1) 10Kosta Harlan: Add user_groups to editAttemptStep schema [extensions/WikimediaEvents] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283050 (https://phabricator.wikimedia.org/T424010) [16:48:44] (03PS2) 10Andrew Bogott: magnum: complete rename from capihelm to clusterapi [puppet] - 10https://gerrit.wikimedia.org/r/1283047 (https://phabricator.wikimedia.org/T393782) [16:48:44] (03PS1) 10Andrew Bogott: magnum: container-api to version 1.12.7 -- the 1.13.x doesn't work with capo [puppet] - 10https://gerrit.wikimedia.org/r/1283051 (https://phabricator.wikimedia.org/T393782) [16:50:22] (03Merged) 10jenkins-bot: Remove undefined variable $wmgUseCSPReportOnlyHasSession [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283049 (https://phabricator.wikimedia.org/T419612) (owner: 10SBassett) [16:50:36] !log sbassett@deploy1003 Started scap sync-world: Backport for [[gerrit:1283036|Set $wgReauthenticateTime editsitejs to one hour (T197137)]], [[gerrit:1283020|Set CSP to enforce with allow-listed domains in Wikimedia production (T419612 T420604 T420607)]], [[gerrit:1283049|Remove undefined variable $wmgUseCSPReportOnlyHasSession (T419612 T420604 T420607)]] [16:50:45] T197137: Editing sitewide JS/CSS pages should require elevated security - https://phabricator.wikimedia.org/T197137 [16:50:46] T420604: Deduplicate CSP between VCL and MediaWiki - https://phabricator.wikimedia.org/T420604 [16:50:47] T420607: Stop setting a report-only CSP - https://phabricator.wikimedia.org/T420607 [16:52:20] !log sbassett@deploy1003 mstyles, sbassett: Backport for [[gerrit:1283036|Set $wgReauthenticateTime editsitejs to one hour (T197137)]], [[gerrit:1283020|Set CSP to enforce with allow-listed domains in Wikimedia production (T419612 T420604 T420607)]], [[gerrit:1283049|Remove undefined variable $wmgUseCSPReportOnlyHasSession (T419612 T420604 T420607)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdeb [16:52:20] ug). Changes can now be verified there. [16:53:02] (03CR) 10Andrew Bogott: [C:03+2] Revert "setup_capi.sh.erb: temporarily use external packages" [puppet] - 10https://gerrit.wikimedia.org/r/1283046 (owner: 10Andrew Bogott) [16:53:53] !log sbassett@deploy1003 mstyles, sbassett: Continuing with deployment [16:55:22] (03PS3) 10Andrew Bogott: magnum: complete rename from capihelm to clusterapi [puppet] - 10https://gerrit.wikimedia.org/r/1283047 (https://phabricator.wikimedia.org/T393782) [16:55:22] (03PS2) 10Andrew Bogott: magnum: container-api to version 1.12.7 -- the 1.13.x doesn't work with capo [puppet] - 10https://gerrit.wikimedia.org/r/1283051 (https://phabricator.wikimedia.org/T393782) [16:55:41] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283047 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [16:57:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:58:02] !log sbassett@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283036|Set $wgReauthenticateTime editsitejs to one hour (T197137)]], [[gerrit:1283020|Set CSP to enforce with allow-listed domains in Wikimedia production (T419612 T420604 T420607)]], [[gerrit:1283049|Remove undefined variable $wmgUseCSPReportOnlyHasSession (T419612 T420604 T420607)]] (duration: 07m 25s) [16:58:08] T197137: Editing sitewide JS/CSS pages should require elevated security - https://phabricator.wikimedia.org/T197137 [16:58:08] T420604: Deduplicate CSP between VCL and MediaWiki - https://phabricator.wikimedia.org/T420604 [16:58:09] T420607: Stop setting a report-only CSP - https://phabricator.wikimedia.org/T420607 [16:59:03] (03CR) 10Ssingh: [V:03+1 C:03+2] varnish: remove CSP (and Report-Only) from VCL [puppet] - 10https://gerrit.wikimedia.org/r/1282979 (https://phabricator.wikimedia.org/T420604) (owner: 10Ssingh) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1700) [17:00:58] sbassett: rolling out [17:04:36] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [17:05:58] !log sudo cumin -b11 "A:cp and not P{cp2041* or cp2042*} and not A:ulsfo" "run-puppet-agent --enable 'merging CR 1282979'" [17:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:26] 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#11890740 (10Dzahn) T425441 will continue the task of moving GitLab behind the CDN. [17:08:40] !log herron@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1003.eqiad.wmnet with OS trixie [17:09:24] !log herron@cumin1003 START - Cookbook sre.hosts.move-vlan for host kafka-logging1003 [17:11:48] (03PS1) 10Herron: kafka-logging1003: update IP and prep for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1283055 (https://phabricator.wikimedia.org/T417001) [17:12:06] !log herron@cumin1003 START - Cookbook sre.dns.netbox [17:12:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [17:13:11] (03PS1) 10Jdlrobson: Legacy parser no longer varies by user thumbnail size. [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283056 (https://phabricator.wikimedia.org/T417513) [17:15:42] !log herron@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-logging1003 - herron@cumin1003" [17:15:48] !log herron@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-logging1003 - herron@cumin1003" [17:15:48] !log herron@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:15:48] !log herron@cumin1003 START - Cookbook sre.dns.wipe-cache kafka-logging1003.eqiad.wmnet 66.48.64.10.in-addr.arpa 6.6.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:16:08] !log herron@cumin1003 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) kafka-logging1003.eqiad.wmnet 66.48.64.10.in-addr.arpa 6.6.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:17:07] (03PS3) 10Tiziano Fogli: o11y/global: disable seasonality checks for small prom instances [alerts] - 10https://gerrit.wikimedia.org/r/1282935 (https://phabricator.wikimedia.org/T415317) [17:19:08] herron@cumin1003 reimage (PID 369380) is awaiting input [17:19:41] !log herron@cumin1003 START - Cookbook sre.dns.wipe-cache kafka-logging1003.eqiad.wmnet 66.48.64.10.in-addr.arpa 6.6.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:20:01] !log herron@cumin1003 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) kafka-logging1003.eqiad.wmnet 66.48.64.10.in-addr.arpa 6.6.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:21:14] !log herron@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host kafka-logging1003 [17:21:50] !log herron@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-logging1003 [17:21:50] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host kafka-logging1003 [17:23:18] (03CR) 10CI reject: [V:04-1] Legacy parser no longer varies by user thumbnail size. [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283056 (https://phabricator.wikimedia.org/T417513) (owner: 10Jdlrobson) [17:23:38] !log root@deploy1003 helmfile [eqiad] START helmfile.d/admin 'sync'. [17:23:54] !log root@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [17:25:32] (03PS4) 10Tiziano Fogli: o11y/global: disable seasonality checks for small prom instances [alerts] - 10https://gerrit.wikimedia.org/r/1282935 (https://phabricator.wikimedia.org/T415317) [17:27:26] sbassett: VCL CSP rollout complete [17:27:32] rather, VCL CSP removal [17:27:50] (03CR) 10Herron: [C:03+1] o11y/global: disable seasonality checks for small prom instances [alerts] - 10https://gerrit.wikimedia.org/r/1282935 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [17:30:40] tx, sukhe. I think we’re looking good. [17:32:35] thanks for helping remove the duplication of the policy! [17:32:47] !log herron@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'. [17:33:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [17:33:20] !log herron@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'. [17:37:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [17:37:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:38:02] !log herron@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1003.eqiad.wmnet with reason: host reimage [17:38:57] (03PS1) 10Dzahn: gerrit: add public key and fingerprint for ed25529 ssh host key [puppet] - 10https://gerrit.wikimedia.org/r/1283058 (https://phabricator.wikimedia.org/T240266) [17:39:36] (03CR) 10CI reject: [V:04-1] gerrit: add public key and fingerprint for ed25529 ssh host key [puppet] - 10https://gerrit.wikimedia.org/r/1283058 (https://phabricator.wikimedia.org/T240266) (owner: 10Dzahn) [17:39:44] (03PS2) 10Dzahn: gerrit: add public key and fingerprint for ed25529 ssh host key [puppet] - 10https://gerrit.wikimedia.org/r/1283058 (https://phabricator.wikimedia.org/T240266) [17:40:23] (03CR) 10CI reject: [V:04-1] gerrit: add public key and fingerprint for ed25529 ssh host key [puppet] - 10https://gerrit.wikimedia.org/r/1283058 (https://phabricator.wikimedia.org/T240266) (owner: 10Dzahn) [17:41:14] (03PS3) 10Dzahn: gerrit: add public key and fingerprint for ed25529 ssh host key [puppet] - 10https://gerrit.wikimedia.org/r/1283058 (https://phabricator.wikimedia.org/T240266) [17:41:28] PROBLEM - Host kafka-logging1003 is DOWN: PING CRITICAL - Packet loss = 100% [17:42:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:44:28] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1003.eqiad.wmnet with reason: host reimage [17:46:30] RECOVERY - Host kafka-logging1003 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [17:55:02] (03PS1) 10Ladsgroup: Close Dutch Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283061 (https://phabricator.wikimedia.org/T421796) [18:00:05] brennen and jeena: May I have your attention please! MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T1800) [18:00:12] o/ [18:04:17] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1003.eqiad.wmnet with OS trixie [18:06:17] !log 1.47.0-wmf.1 train status (T423910): no current blockers, rolling to group0 [18:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:20] T423910: 1.47.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T423910 [18:06:31] FIRING: Traffic on tunnel link: Alert for device cr1-drmrs.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [18:08:19] (03CR) 10Ottomata: [C:03+1] airflow-main: remove obsolete hosts (from commented entry) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281587 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [18:09:49] (03CR) 10Andrew Bogott: [C:03+2] magnum: complete rename from capihelm to clusterapi [puppet] - 10https://gerrit.wikimedia.org/r/1283047 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [18:09:49] (03CR) 10Andrew Bogott: [C:03+2] magnum: container-api to version 1.12.7 -- the 1.13.x doesn't work with capo [puppet] - 10https://gerrit.wikimedia.org/r/1283051 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [18:09:49] (03PS1) 10TrainBranchBot: testwikis to 1.47.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283062 (https://phabricator.wikimedia.org/T423910) [18:09:49] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283062 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot) [18:09:49] (03CR) 10Ottomata: [C:03+1] "@snwachukwu@wikimedia.org" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281589 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [18:10:02] (03CR) 10Ottomata: [C:03+2] airflow-main: remove obsolete hosts (from commented entry) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281587 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [18:10:15] (03Merged) 10jenkins-bot: testwikis to 1.47.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283062 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot) [18:10:37] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device pfw1a-eqiad [18:10:47] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device pfw1a-eqiad [18:11:31] RESOLVED: Traffic on tunnel link: Device cr1-drmrs.wikimedia.org recovered from Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [18:11:32] !log brennen@deploy1003 Started scap sync-world: testwikis to 1.47.0-wmf.1 refs T423910 [18:11:35] T423910: 1.47.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T423910 [18:12:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [18:12:15] (03Merged) 10jenkins-bot: airflow-main: remove obsolete hosts (from commented entry) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281587 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [18:12:31] (03PS1) 10Ladsgroup: Close Swedish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283063 (https://phabricator.wikimedia.org/T421796) [18:12:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282397 (https://phabricator.wikimedia.org/T425290) (owner: 10HakanIST) [18:13:20] !log pt1979@cumin1003 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS trixie [18:13:34] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie [18:13:51] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device pfw1a-codfw [18:14:06] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device pfw1a-codfw [18:15:10] (03PS3) 10Eevans: _aqs2-common_: updated aqs node list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281589 (https://phabricator.wikimedia.org/T412830) [18:17:34] brennen: if it's okay (and it won't interfere) can I push closing of a couple of wikinews wikis? [18:18:34] Amir1: remind me what that entails? train's chugging along and may be a bit. [18:18:41] (03CR) 10Jforrester: [C:03+1] "Let's land this after the new release tomorrow." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1275996 (https://phabricator.wikimedia.org/T404652) (owner: 10David Martin) [18:19:11] it should be quite harmless, it basically changes the wiki's config to be editable only by stewards basically [18:19:24] (example https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1283063) [18:19:31] example config change https://integration.wikimedia.org/ci/job/operations-mw-config-php83-composer-diffConfig/1626/console [18:19:33] (03PS1) 10Andrew Bogott: profile::openstack::capi: don't pass in helm repo [puppet] - 10https://gerrit.wikimedia.org/r/1283064 [18:19:34] I can wait [18:21:59] (03CR) 10Andrew Bogott: [C:03+2] profile::openstack::capi: don't pass in helm repo [puppet] - 10https://gerrit.wikimedia.org/r/1283064 (owner: 10Andrew Bogott) [18:22:00] Amir1: no objections once scap finishes, i will probably let .1 bake in on testwikis for a bit. [18:22:18] sounds good to me [18:24:21] FIRING: [17x] JobUnavailable: Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:25:48] !log pt1979@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4038.ulsfo.wmnet with OS trixie [18:26:00] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie executed with err... [18:30:24] !log pt1979@cumin1003 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS trixie [18:30:35] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891359 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie [18:40:00] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:43:43] (03CR) 10Neriah: [C:03+1] Close Dutch Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283061 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [18:44:06] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cp4038 ip address - pt1979@cumin2002" [18:44:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cp4038 ip address - pt1979@cumin2002" [18:44:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:44:22] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:44:31] !log pt1979@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4038.ulsfo.wmnet with OS trixie [18:44:42] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891449 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie executed with err... [18:45:33] (03CR) 10Neriah: [C:03+1] Close Swedish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283063 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [18:46:00] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 3 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11891450 (10Rosalie_WMDE) [18:46:05] (03CR) 10Codename Noreste: "On the description, please add the following:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (owner: 10Nvdtn19) [18:47:36] !log brennen@deploy1003 Finished scap sync-world: testwikis to 1.47.0-wmf.1 refs T423910 (duration: 36m 04s) [18:47:39] T423910: 1.47.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T423910 [18:48:27] Amir1: over to you if you'd like to roll out those config changes; going to let this sit at testwikis while i get a bite of lunch. [18:48:38] Thanks! [18:48:45] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cp4038 ip address - pt1979@cumin2002" [18:48:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update cp4038 ip address - pt1979@cumin2002" [18:48:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:49:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283063 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [18:49:25] !log pt1979@cumin1003 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS trixie [18:49:38] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891474 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie [18:51:28] (03Merged) 10jenkins-bot: Close Swedish Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283063 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [18:52:00] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283063|Close Swedish Wikinews (T421796)]] [18:52:03] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [18:55:11] 06SRE, 06Infrastructure-Foundations, 10netops: Packet loss on eqsin OOB CCT via IPv6 - https://phabricator.wikimedia.org/T425471 (10cmooney) 03NEW p:05Triage→03Medium [18:55:56] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283063|Close Swedish Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:56:14] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [19:02:59] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283063|Close Swedish Wikinews (T421796)]] (duration: 10m 59s) [19:03:02] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [19:03:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [19:04:19] !log pt1979@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4038.ulsfo.wmnet with OS trixie [19:04:32] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891558 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie executed with err... [19:05:20] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "set correct vlan group in netbox for new ulsfo vlans - cmooney@cumin1003 - T408892" [19:05:23] T408892: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892 [19:05:26] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "set correct vlan group in netbox for new ulsfo vlans - cmooney@cumin1003 - T408892" [19:06:29] Amir1: good, or you have another one to do? [19:07:04] I don't have anything right now [19:07:13] the clean up scripts are still running on those wikis [19:07:17] cool, resuming train stuff. [19:07:58] (03PS1) 10TrainBranchBot: group0 to 1.47.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283065 (https://phabricator.wikimedia.org/T423910) [19:08:00] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283065 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot) [19:08:51] (03Merged) 10jenkins-bot: group0 to 1.47.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283065 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot) [19:09:01] (03PS1) 10Ladsgroup: Close Persian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283066 (https://phabricator.wikimedia.org/T421796) [19:11:07] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/2 (Core: asw1-23-ulsfo:ethernet-1/55 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:11:15] (03PS1) 10Ayounsi: network/data.yaml: add new ulsfo ranges [puppet] - 10https://gerrit.wikimedia.org/r/1283070 (https://phabricator.wikimedia.org/T408892) [19:11:43] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283070 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [19:14:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:56] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.47.0-wmf.1 refs T423910 [19:14:59] T423910: 1.47.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T423910 [19:15:21] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 6 hosts with reason: rebooting firewall in desperation [19:15:32] 06SRE, 06Infrastructure-Foundations, 10netops: Packet loss on eqsin OOB CCT via IPv6 - https://phabricator.wikimedia.org/T425471#11891576 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e811f42a-7bc8-4cee-b558-794852157c2b) set by cmooney@cumin1003 for 0:30:00 on 6 host(s) and their servi... [19:15:40] (03PS1) 10DDesouza: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283074 (https://phabricator.wikimedia.org/T329991) [19:15:53] !log dancy@deploy1003 Installing scap version "4.262.0" for 2 host(s) [19:17:45] !log dancy@deploy1003 Installation of scap version "4.262.0" completed for 2 hosts [19:17:51] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1283070 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [19:18:05] (03CR) 10Ayounsi: [C:03+2] network/data.yaml: add new ulsfo ranges [puppet] - 10https://gerrit.wikimedia.org/r/1283070 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [19:20:17] (03CR) 10DDesouza: [C:03+2] miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283074 (https://phabricator.wikimedia.org/T329991) (owner: 10DDesouza) [19:20:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:22:03] (03PS1) 10Andrew Bogott: magnum-container-api: Install ORC CRD before setting up clusterctl [puppet] - 10https://gerrit.wikimedia.org/r/1283079 (https://phabricator.wikimedia.org/T393782) [19:22:15] PROBLEM - Host ps1-604-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [19:22:15] PROBLEM - Host ps1-603-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [19:22:23] PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [19:22:41] (03Merged) 10jenkins-bot: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283074 (https://phabricator.wikimedia.org/T329991) (owner: 10DDesouza) [19:24:35] (03CR) 10Andrew Bogott: [C:03+2] magnum-container-api: Install ORC CRD before setting up clusterctl [puppet] - 10https://gerrit.wikimedia.org/r/1283079 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [19:27:45] !log herron@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1002.eqiad.wmnet with OS trixie [19:27:57] RECOVERY - Host ps1-604-eqsin is UP: PING OK - Packet loss = 0%, RTA = 224.19 ms [19:27:57] RECOVERY - Host ps1-603-eqsin is UP: PING OK - Packet loss = 0%, RTA = 224.59 ms [19:27:59] RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 223.59 ms [19:28:20] !log herron@cumin1003 START - Cookbook sre.hosts.move-vlan for host kafka-logging1002 [19:29:21] FIRING: [10x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:29:25] FIRING: [3x] SystemdUnitFailed: opensearch_2@.service.d.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:30:59] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [19:31:12] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:31:13] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:31:23] herron@cumin1003 reimage (PID 387447) is awaiting input [19:31:25] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:31:26] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:31:43] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:32:16] (03PS1) 10Herron: kafka-logging1002: update IP and prep for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1283081 (https://phabricator.wikimedia.org/T417001) [19:32:42] !log herron@cumin1003 START - Cookbook sre.dns.netbox [19:33:14] (03PS1) 10Neriah: Enable WikiLove on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283082 (https://phabricator.wikimedia.org/T424891) [19:33:34] (03PS1) 10Bking: data-platform: Add alerts for cirrus memory or I/O stalls [alerts] - 10https://gerrit.wikimedia.org/r/1283083 (https://phabricator.wikimedia.org/T424852) [19:34:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283082 (https://phabricator.wikimedia.org/T424891) (owner: 10Neriah) [19:35:01] (03PS1) 10DDesouza: miscweb(design-strategy): remove [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283084 (https://phabricator.wikimedia.org/T329991) [19:35:14] FIRING: [2x] ProbeDown: Service people1005:30443 has failed probes (http_design_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:36:46] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs4009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [19:36:48] (03PS1) 10CDanis: Add a maintenance script cron secret [labs/private] - 10https://gerrit.wikimedia.org/r/1283085 [19:37:32] (03CR) 10CDanis: [V:03+2 C:03+2] Add a maintenance script cron secret [labs/private] - 10https://gerrit.wikimedia.org/r/1283085 (owner: 10CDanis) [19:37:38] (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): remove [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283084 (https://phabricator.wikimedia.org/T329991) (owner: 10DDesouza) [19:38:21] (03PS11) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [19:38:37] !log herron@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-logging1002 - herron@cumin1003" [19:38:48] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [19:39:06] !log herron@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-logging1002 - herron@cumin1003" [19:39:06] !log herron@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:39:06] !log herron@cumin1003 START - Cookbook sre.dns.wipe-cache kafka-logging1002.eqiad.wmnet 142.32.64.10.in-addr.arpa 2.4.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:39:27] !log herron@cumin1003 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) kafka-logging1002.eqiad.wmnet 142.32.64.10.in-addr.arpa 2.4.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:39:40] !log herron@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host kafka-logging1002 [19:39:59] (03Merged) 10jenkins-bot: miscweb(design-strategy): remove [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283084 (https://phabricator.wikimedia.org/T329991) (owner: 10DDesouza) [19:40:36] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [19:40:37] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:40:53] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [19:40:55] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:40:57] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:40:59] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:41:00] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:41:03] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:41:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [19:41:27] !log herron@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-logging1002 [19:41:27] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host kafka-logging1002 [19:45:08] !log pt1979@cumin1003 START - Cookbook sre.hosts.reimage for host cp4038.ulsfo.wmnet with OS trixie [19:45:20] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie [19:45:43] FIRING: [6x] CoreBGPDown: Core BGP session down between cr2-eqsin and mr1-eqsin (103.102.166.143) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:46:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [19:51:31] FIRING: Traffic on tunnel link: Alert for device cr1-drmrs.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [19:53:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [19:53:55] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 204796024 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:54:34] !log root@deploy1003 helmfile [eqiad] START helmfile.d/admin 'sync'. [19:54:49] !log root@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [19:54:55] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3422752 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:55:20] !log herron@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'. [19:55:54] !log herron@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'. [19:56:30] RESOLVED: Traffic on tunnel link: Device cr1-drmrs.wikimedia.org recovered from Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [19:57:35] !log herron@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1002.eqiad.wmnet with reason: host reimage [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T2000). [20:00:04] manfredi, arlolra, AaronSchulz, Mpostoronca, HakanIST, and Neriah: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] Hi :), I am around [20:00:10] (03PS1) 10DDesouza: miscweb(design-landing-page): bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283090 (https://phabricator.wikimedia.org/T329991) [20:00:13] hi [20:00:19] I am around [20:00:22] o/ [20:00:30] Hello [20:02:50] o/ [20:03:27] manfredi: you are first but it looks like you have i18n changes that could take a while [20:03:49] I think so, are you the deployer ? [20:04:05] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1002.eqiad.wmnet with reason: host reimage [20:04:05] not "the" deployer but a deployer [20:04:14] Ok thanks [20:04:26] maybe you should go last? [20:04:39] I think you can start merging mine because i had a problem with the CI [20:04:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission aqs10[10-12,14-15].eqiad.wmnet - https://phabricator.wikimedia.org/T425357#11891755 (10Jclark-ctr) [20:04:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission aqs10[10-12,14-15].eqiad.wmnet - https://phabricator.wikimedia.org/T425357#11891756 (10Jclark-ctr) 05Open→03Resolved [20:05:38] sorry, I was asking if you mind going at the end [20:05:55] (03CR) 10DDesouza: [C:03+2] miscweb(design-landing-page): bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283090 (https://phabricator.wikimedia.org/T329991) (owner: 10DDesouza) [20:06:40] Ok [20:06:49] thanks, then I will get started [20:06:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281526 (https://phabricator.wikimedia.org/T336703) (owner: 10SomeRandomDeveloper) [20:07:38] !log pt1979@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage [20:08:00] arlolra: mine can be combined with others [20:08:05] * AaronSchulz can also go alone too [20:08:15] (03Merged) 10jenkins-bot: miscweb(design-landing-page): bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283090 (https://phabricator.wikimedia.org/T329991) (owner: 10DDesouza) [20:09:05] ugh, there's an issue with my patch so lets go to AaronSchulz [20:09:07] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [20:09:12] arlolra: ok [20:09:22] maybe combine with Neriah [20:09:22] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [20:09:24] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [20:09:35] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [20:09:37] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [20:09:51] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [20:10:13] arlolra: I don't object 😉 [20:10:21] great [20:10:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283082 (https://phabricator.wikimedia.org/T424891) (owner: 10Neriah) [20:10:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276814 (https://phabricator.wikimedia.org/T422403) (owner: 10Aaron Schulz) [20:11:03] you can combine the hCaptcha log changes with any other deploy if it helps [20:11:29] (03Merged) 10jenkins-bot: Enable WikiLove on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283082 (https://phabricator.wikimedia.org/T424891) (owner: 10Neriah) [20:11:32] (03Merged) 10jenkins-bot: Add wikibase.v1 module to the sandbox were it is present [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276814 (https://phabricator.wikimedia.org/T422403) (owner: 10Aaron Schulz) [20:11:59] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1283082|Enable WikiLove on shwiki (T424891)]], [[gerrit:1276814|Add wikibase.v1 module to the sandbox were it is present (T422403)]] [20:12:04] T424891: Enable WikiLove on shwiki - https://phabricator.wikimedia.org/T424891 [20:12:05] T422403: Create Wikibase v1 REST API Module - https://phabricator.wikimedia.org/T422403 [20:13:56] !log arlolra@deploy1003 aaron, neriah, arlolra: Backport for [[gerrit:1283082|Enable WikiLove on shwiki (T424891)]], [[gerrit:1276814|Add wikibase.v1 module to the sandbox were it is present (T422403)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:14:29] !log pt1979@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4038.ulsfo.wmnet with reason: host reimage [20:14:54] (03PS4) 10Arlolra: Errors added below ref list dirty when not responsive [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) (owner: 10Awight) [20:15:34] AaronSchulz: Neriah: testing? [20:16:04] it's fine [20:16:47] I don't see the extension on shwiki [20:16:55] did I miss something? [20:17:42] Are you using WikimediaDebug? [20:17:47] I see it there [20:18:16] I'll proceed [20:18:19] !log arlolra@deploy1003 aaron, neriah, arlolra: Continuing with deployment [20:18:54] I see now that I had a problem with it, and now it's fine. you can continue [20:19:01] thanks [20:19:13] great. Thank you! [20:20:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:20:26] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1002.eqiad.wmnet with OS trixie [20:20:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [20:21:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:22:02] 06SRE, 10DNS, 06Traffic: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Anthropic - https://phabricator.wikimedia.org/T424785#11891832 (10CDobbins) 05In progress→03Resolved [20:22:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:22:30] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283082|Enable WikiLove on shwiki (T424891)]], [[gerrit:1276814|Add wikibase.v1 module to the sandbox were it is present (T422403)]] (duration: 10m 30s) [20:22:35] T424891: Enable WikiLove on shwiki - https://phabricator.wikimedia.org/T424891 [20:22:35] T422403: Create Wikibase v1 REST API Module - https://phabricator.wikimedia.org/T422403 [20:23:18] HakanIST: can I combine yours with Mpostoronca? [20:23:22] sure [20:24:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282930 (https://phabricator.wikimedia.org/T424496) (owner: 10Mpostoronca) [20:24:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282397 (https://phabricator.wikimedia.org/T425290) (owner: 10HakanIST) [20:27:29] (03CR) 10CI reject: [V:04-1] hCaptcha: Add diagnostic context to script load error logs [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282930 (https://phabricator.wikimedia.org/T424496) (owner: 10Mpostoronca) [20:28:12] what does that mean ? [20:28:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:28:28] The change '1282930' failed build tests and could not be merged [20:29:14] The error looks transient [20:29:16] stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/WikimediaMessages/': GnuTLS recv error (-54): Error in the pull function.' [20:29:20] https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-vendor-mysql-php83/32473/console [20:29:21] FIRING: [14x] JobUnavailable: Reduced availability for job fifo_log_demux in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:29:24] Let's try again [20:29:46] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:30:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282930 (https://phabricator.wikimedia.org/T424496) (owner: 10Mpostoronca) [20:30:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282397 (https://phabricator.wikimedia.org/T425290) (owner: 10HakanIST) [20:30:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) (owner: 10Awight) [20:30:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [20:31:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:34:21] FIRING: [14x] JobUnavailable: Reduced availability for job fifo_log_demux in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:34:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [20:34:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:34:46] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:35:24] (03Merged) 10jenkins-bot: sectionCollapsing: Scroll to fragment target on init [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282397 (https://phabricator.wikimedia.org/T425290) (owner: 10HakanIST) [20:36:46] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:37:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:39:21] FIRING: [11x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:39:58] is this bad? [20:40:27] (03Merged) 10jenkins-bot: hCaptcha: Add diagnostic context to script load error logs [extensions/ConfirmEdit] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282930 (https://phabricator.wikimedia.org/T424496) (owner: 10Mpostoronca) [20:41:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:41:24] !log pt1979@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4038.ulsfo.wmnet with OS trixie [20:41:37] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11891874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie completed: - cp40... [20:41:46] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:44:21] FIRING: [11x] JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:44:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [20:44:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [20:45:40] (03Merged) 10jenkins-bot: Errors added below ref list dirty when not responsive [extensions/Cite] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1282804 (https://phabricator.wikimedia.org/T384599) (owner: 10Awight) [20:46:12] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1282930|hCaptcha: Add diagnostic context to script load error logs (T424496)]], [[gerrit:1282397|sectionCollapsing: Scroll to fragment target on init (T425290)]], [[gerrit:1282804|Errors added below ref list dirty when not responsive (T384599)]] [20:46:19] T424496: Unable to load hCaptcha script - https://phabricator.wikimedia.org/T424496 [20:46:19] T425290: Section links (HTML #anchors) don't work on mobile + Parsoid + minerva + specific screen sizes - https://phabricator.wikimedia.org/T425290 [20:46:19] T384599: Errors in refs defined in a references tag are not reported correctly in the Parsoid implementation of Cite - https://phabricator.wikimedia.org/T384599 [20:46:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:46:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:48:09] !log arlolra@deploy1003 mpostoronca, h2o, awight, arlolra: Backport for [[gerrit:1282930|hCaptcha: Add diagnostic context to script load error logs (T424496)]], [[gerrit:1282397|sectionCollapsing: Scroll to fragment target on init (T425290)]], [[gerrit:1282804|Errors added below ref list dirty when not responsive (T384599)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be ve [20:48:09] rified there. [20:48:30] Mpostoronca: HakanIST: please test [20:49:11] arlolra: am I next? [20:49:33] manfredi: yes [20:50:08] Hopefully Readers don't need their window because we'll bleed into that [20:51:10] looks good, thank you [20:52:24] Mpostoronca: can I proceed? [20:52:55] I assume so since it's just diagnostic [20:52:59] !log arlolra@deploy1003 mpostoronca, h2o, awight, arlolra: Continuing with deployment [20:53:46] yes, can't see any logs, probably we'll have to wait for some online error, so it's okay [20:53:58] I'll check again tomorrow morning [20:54:17] thanks [20:54:21] thank you [20:54:59] arlolra: Can I slip in a scap update after your current deployment (before you move on to the next)? It's a fix for the spiderpig job log viewer. It takes about 2 minutes [20:55:12] sure [20:55:16] thx [20:55:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:55:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:57:11] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1282930|hCaptcha: Add diagnostic context to script load error logs (T424496)]], [[gerrit:1282397|sectionCollapsing: Scroll to fragment target on init (T425290)]], [[gerrit:1282804|Errors added below ref list dirty when not responsive (T384599)]] (duration: 10m 59s) [20:57:19] T424496: Unable to load hCaptcha script - https://phabricator.wikimedia.org/T424496 [20:57:19] T425290: Section links (HTML #anchors) don't work on mobile + Parsoid + minerva + specific screen sizes - https://phabricator.wikimedia.org/T425290 [20:57:20] T384599: Errors in refs defined in a references tag are not reported correctly in the Parsoid implementation of Cite - https://phabricator.wikimedia.org/T384599 [20:57:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:57:28] dancy: just ping when you're done [20:57:42] !log dancy@deploy1003 Installing scap version "4.262.1" for 2 host(s) [20:57:49] Will do. [20:59:34] !log dancy@deploy1003 Installation of scap version "4.262.1" completed for 2 hosts [20:59:45] arlolra: Back to you [20:59:52] thanks [21:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260505T2100) [21:00:20] Is anyone here from Readers to use their window? [21:00:21] Hey arlolra let me know when you are done :) [21:00:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:00:41] Oh, Jdlrobson the last backport has some i18n so it could be a while [21:00:44] arlolra: im actually backporting for content transformers today :) https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1283056 [21:00:48] ^^ checking on the WDQS stuff now [21:00:59] (03CR) 10Jdlrobson: "recheck" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283056 (https://phabricator.wikimedia.org/T417513) (owner: 10Jdlrobson) [21:01:25] (i'd rather this went out 2 days early in case we want to rever it) [21:01:46] can I combine that with the last backport? [21:03:27] arlolra: sure thing [21:03:32] great [21:04:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [21:04:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283056 (https://phabricator.wikimedia.org/T417513) (owner: 10Jdlrobson) [21:04:36] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [21:06:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:06:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:08:30] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [21:11:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:12:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:15:27] (03Merged) 10jenkins-bot: Email confirmation banner: Remove obsolete arm_b variant [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1281501 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [21:15:47] (03Merged) 10jenkins-bot: Legacy parser no longer varies by user thumbnail size. [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1283056 (https://phabricator.wikimedia.org/T417513) (owner: 10Jdlrobson) [21:16:13] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1281501|Email confirmation banner: Remove obsolete arm_b variant (T421366)]], [[gerrit:1283056|Legacy parser no longer varies by user thumbnail size. (T417513)]] [21:16:18] T421366: Test Kitchen Experiment setup to measure the impact of the banner - https://phabricator.wikimedia.org/T421366 [21:16:18] T417513: Switch to CSS/JS solution for thumbnail size in legacy parser - https://phabricator.wikimedia.org/T417513 [21:16:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:18:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [21:19:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:22:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:22:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:23:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [21:23:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [21:25:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [21:26:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:28:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [21:28:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [21:30:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:30:44] FIRING: [3x] CoreBGPDown: Core BGP session down between cr2-eqiad and (185.15.58.139) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:31:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:33:52] !log arlolra@deploy1003 jdlrobson, mmartorana, arlolra: Backport for [[gerrit:1281501|Email confirmation banner: Remove obsolete arm_b variant (T421366)]], [[gerrit:1283056|Legacy parser no longer varies by user thumbnail size. (T417513)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:33:57] T421366: Test Kitchen Experiment setup to measure the impact of the banner - https://phabricator.wikimedia.org/T421366 [21:33:57] T417513: Switch to CSS/JS solution for thumbnail size in legacy parser - https://phabricator.wikimedia.org/T417513 [21:34:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [21:34:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:34:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [21:35:28] Jdlrobson: manfredi: please test [21:35:34] ok [21:35:43] FIRING: [3x] CoreBGPDown: Core BGP session down between cr2-eqiad and (185.15.58.139) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:36:00] arlolra: LGTM [21:36:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:36:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:36:26] All good! Thanks [21:36:35] !log arlolra@deploy1003 jdlrobson, mmartorana, arlolra: Continuing with deployment [21:36:38] thanks [21:38:07] (03CR) 10JHathaway: [C:03+2] Remove role::mail::mx and related Puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1283025 (https://phabricator.wikimedia.org/T325394) (owner: 10Muehlenhoff) [21:38:28] (03CR) 10JHathaway: [C:03+1] Remove role::mail::mx and related Puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1283025 (https://phabricator.wikimedia.org/T325394) (owner: 10Muehlenhoff) [21:39:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:40:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:41:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:43:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:44:02] arlolra: are we done? [21:44:22] Still sync'ing [21:44:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:44:41] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T425488 (10phaultfinder) 03NEW [21:46:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:48:05] (03PS1) 10Dzahn: microsites: update regex for content check of design.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1283096 (https://phabricator.wikimedia.org/T425476) [21:48:34] please ping me once you're done. Thanks! [21:49:08] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1281501|Email confirmation banner: Remove obsolete arm_b variant (T421366)]], [[gerrit:1283056|Legacy parser no longer varies by user thumbnail size. (T417513)]] (duration: 32m 55s) [21:49:13] T421366: Test Kitchen Experiment setup to measure the impact of the banner - https://phabricator.wikimedia.org/T421366 [21:49:13] T417513: Switch to CSS/JS solution for thumbnail size in legacy parser - https://phabricator.wikimedia.org/T417513 [21:49:30] manfredi: done. Sorry that took so long, thank you for your patience [21:49:46] Thank you! Appreciate it [21:50:41] Amir1: I'm not sure who the you is but I'm done with Spiderpig now [21:51:43] Thanks! [21:53:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283066 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [21:54:19] (03Merged) 10jenkins-bot: Close Persian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283066 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [21:54:50] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283066|Close Persian Wikinews (T421796)]] [21:54:54] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [21:55:50] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [21:55:55] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [21:58:55] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283066|Close Persian Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:59:15] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [22:02:06] (03CR) 10JHathaway: [C:03+1] sre.hardware.upgrade-firmware: remove unused code [cookbooks] - 10https://gerrit.wikimedia.org/r/1282356 (https://phabricator.wikimedia.org/T425327) (owner: 10Elukey) [22:04:59] (03PS1) 10Ladsgroup: Close Serbian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283099 (https://phabricator.wikimedia.org/T421796) [22:05:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [22:05:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [22:05:58] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283066|Close Persian Wikinews (T421796)]] (duration: 11m 07s) [22:06:01] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [22:06:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:06:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:08:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283099 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [22:08:36] (03CR) 10JHathaway: [C:03+1] d-i: Remove dhcpcd-base after installation completed [puppet] - 10https://gerrit.wikimedia.org/r/1280082 (https://phabricator.wikimedia.org/T414341) (owner: 10Muehlenhoff) [22:09:06] (03Merged) 10jenkins-bot: Close Serbian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283099 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [22:09:31] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283099|Close Serbian Wikinews (T421796)]] [22:09:34] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T425488#11892173 (10phaultfinder) [22:11:28] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283099|Close Serbian Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:11:31] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [22:12:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:12:08] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [22:12:09] (03CR) 10Dzahn: [C:03+2] microsites: update regex for content check of design.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1283096 (https://phabricator.wikimedia.org/T425476) (owner: 10Dzahn) [22:13:06] (03PS1) 10Ladsgroup: Close Romanian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283100 (https://phabricator.wikimedia.org/T421796) [22:13:30] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [22:16:16] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283099|Close Serbian Wikinews (T421796)]] (duration: 06m 45s) [22:16:46] (03PS1) 10WMDE-Fisch: VE: Avoid counting all refs when listIndex is undefined [extensions/Cite] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283101 (https://phabricator.wikimedia.org/T425433) [22:17:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283100 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [22:17:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/Cite] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1283101 (https://phabricator.wikimedia.org/T425433) (owner: 10WMDE-Fisch) [22:18:28] (03Merged) 10jenkins-bot: Close Romanian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283100 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [22:18:56] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283100|Close Romanian Wikinews (T421796)]] [22:19:03] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [22:20:51] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283100|Close Romanian Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:22:42] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [22:25:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:25:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:26:12] (03CR) 10Dzahn: [C:03+2] gerrit: add public key and fingerprint for ed25529 ssh host key [puppet] - 10https://gerrit.wikimedia.org/r/1283058 (https://phabricator.wikimedia.org/T240266) (owner: 10Dzahn) [22:26:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:26:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:26:52] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283100|Close Romanian Wikinews (T421796)]] (duration: 07m 56s) [22:26:55] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [22:30:17] (03PS1) 10Ladsgroup: Close Ukrainian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283103 (https://phabricator.wikimedia.org/T421796) [22:35:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283103 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [22:36:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:36:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:36:47] (03Merged) 10jenkins-bot: Close Ukrainian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283103 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [22:37:13] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283103|Close Ukrainian Wikinews (T421796)]] [22:37:16] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [22:37:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:37:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:39:08] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283103|Close Ukrainian Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:39:34] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [22:41:43] (03PS1) 10Ladsgroup: Close Arabic Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283104 (https://phabricator.wikimedia.org/T421796) [22:43:41] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283103|Close Ukrainian Wikinews (T421796)]] (duration: 06m 28s) [22:43:45] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [22:46:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283104 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [22:47:27] (03Merged) 10jenkins-bot: Close Arabic Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283104 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [22:47:51] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1283104|Close Arabic Wikinews (T421796)]] [22:48:39] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11892268 (10Papaul) @RobH see below the list of node still on 10G DAC that We will need to move to 25G DAC. Can you please order 7x2m 25G DAC? Thank you A:p... [22:49:46] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1283104|Close Arabic Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:49:49] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [22:50:15] FIRING: [2x] ProbeDown: Service people1005:30443 has failed probes (http_design_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:50:36] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [22:54:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:54:49] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1283104|Close Arabic Wikinews (T421796)]] (duration: 06m 58s) [22:54:52] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [22:55:14] RESOLVED: [2x] ProbeDown: Service people1005:30443 has failed probes (http_design_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:55:18] re: "FIRING: [2x] ProbeDown: Service people1005:30443" this is about design.wikimedia.org and something that went wrong with the latest deploy. [22:55:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:55:32] just found out it must be https://phabricator.wikimedia.org/T329991 [22:56:17] ... there is a certain irony that the issue presents itself as design.wikimedia.org loading without CSS [22:56:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:56:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:57:02] (03PS1) 10Bartosz Dziewoński: Remove unused 'writeapi' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283106 [22:57:11] it gives https://motherfuckingwebsite.com [22:59:04] perryprog: the good part is how our monitoring actually caught this because it checks content and not status code. sophisticated, eh :) [22:59:26] https://phabricator.wikimedia.org/T425476 is https://phabricator.wikimedia.org/T329991#11891867 [22:59:54] Yeah that's quite nice! It didn't page though, right? (Though, should it?) [23:00:20] oh, I see [23:00:33] no, it shouldn't, because it made a whole phab ticket. [23:00:35] that's sick [23:00:40] nah, but it creates automatic tickets [23:00:46] yea:) [23:11:07] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/2 (Core: asw1-23-ulsfo:ethernet-1/55 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:12:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:13:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:13:26] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T425488#11892315 (10Jclark-ctr) a:03Jclark-ctr [23:14:45] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T425488#11892321 (10Jclark-ctr) T425159 Came back will check balance in morning [23:16:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:16:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1013.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:20:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:20:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:23:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:23:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:26:41] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:29:40] FIRING: [3x] SystemdUnitFailed: opensearch_2@.service.d.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:30:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:30:30] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update ip addresses for nodes in rack 23 - pt1979@cumin2002" [23:30:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update ip addresses for nodes in rack 23 - pt1979@cumin2002" [23:30:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:34:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:36:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:36:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:36:46] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs4009:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [23:36:56] (03PS1) 10Papaul: Mode OOB on mr1-ulsfo to port 7 [homer/public] - 10https://gerrit.wikimedia.org/r/1283111 (https://phabricator.wikimedia.org/T421674) [23:38:36] (03PS2) 10Papaul: Move OOB on mr1-ulsfo to port 7 [homer/public] - 10https://gerrit.wikimedia.org/r/1283111 (https://phabricator.wikimedia.org/T421674) [23:40:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1283112 [23:40:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1283112 (owner: 10TrainBranchBot) [23:41:56] (03CR) 10Papaul: [C:03+2] Move OOB on mr1-ulsfo to port 7 [homer/public] - 10https://gerrit.wikimedia.org/r/1283111 (https://phabricator.wikimedia.org/T421674) (owner: 10Papaul) [23:45:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:45:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:46:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:46:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:49:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:49:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:51:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:51:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:53:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1283112 (owner: 10TrainBranchBot)