[00:05:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:59] (03CR) 10Zabe: [C:03+2] Undeploy GoogleNewsSitemap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277783 (https://phabricator.wikimedia.org/T421798) (owner: 10Zabe) [00:08:02] (03Merged) 10jenkins-bot: Undeploy GoogleNewsSitemap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277783 (https://phabricator.wikimedia.org/T421798) (owner: 10Zabe) [00:10:02] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1277783|Undeploy GoogleNewsSitemap (T421798)]] [00:10:06] T421798: Undeploy GoogleNewsSitemap after 2026-05-04 - https://phabricator.wikimedia.org/T421798 [00:23:55] FIRING: MaxConntrack: Elevated conntrack usage on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:24:06] (03PS3) 10Zabe: Drop some unneeded wikinews configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281485 (https://phabricator.wikimedia.org/T421796) [00:26:55] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:28:55] RESOLVED: MaxConntrack: Elevated conntrack usage on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:29:05] !log zabe@deploy1003 zabe: Backport for [[gerrit:1277783|Undeploy GoogleNewsSitemap (T421798)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:29:08] T421798: Undeploy GoogleNewsSitemap after 2026-05-04 - https://phabricator.wikimedia.org/T421798 [00:31:25] !log zabe@deploy1003 zabe: Continuing with deployment [00:33:54] 10ops-esams, 06SRE, 06Commons, 06DC-Ops, and 3 others: ESAMS serving an older revision of some overwritten files - https://phabricator.wikimedia.org/T425216#11896751 (10ssingh) Yeah, no, I was wrong and I misunderstood the problem. I misread that the actual image also differs across the CDN but it clearly... [00:39:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:43:57] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277783|Undeploy GoogleNewsSitemap (T421798)]] (duration: 33m 54s) [00:44:00] T421798: Undeploy GoogleNewsSitemap after 2026-05-04 - https://phabricator.wikimedia.org/T421798 [00:51:33] (03PS2) 10Zabe: Remove custom user groups from Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281491 (https://phabricator.wikimedia.org/T423578) [01:01:29] (03CR) 10Zabe: [C:03+2] Drop some unneeded wikinews configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281485 (https://phabricator.wikimedia.org/T421796) (owner: 10Zabe) [01:01:38] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host pc1021.eqiad.wmnet with OS trixie [01:01:47] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11896764 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host pc1021.eqiad.wmnet with OS trixie executed with errors: - pc1021 (**FAIL**... [01:02:26] (03Merged) 10jenkins-bot: Drop some unneeded wikinews configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281485 (https://phabricator.wikimedia.org/T421796) (owner: 10Zabe) [01:02:57] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1281485|Drop some unneeded wikinews configs (T421796)]] [01:03:00] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [01:03:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:04:36] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [01:09:17] !log zabe@deploy1003 zabe: Backport for [[gerrit:1281485|Drop some unneeded wikinews configs (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:09:21] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [01:09:37] !log zabe@deploy1003 zabe: Continuing with deployment [01:09:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1284073 [01:09:56] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1284073 (owner: 10TrainBranchBot) [01:10:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:11:57] 06SRE: wiki.openstreetmap.org Commons thumbs rate limit allowance - https://phabricator.wikimedia.org/T423570#11896774 (10BBlack) a:03jcrespo [01:15:55] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1281485|Drop some unneeded wikinews configs (T421796)]] (duration: 12m 57s) [01:15:58] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [01:19:23] 06SRE, 13Patch-For-Review: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528#11896780 (10BBlack) a:03elukey [01:25:02] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1284073 (owner: 10TrainBranchBot) [01:27:26] (03PS8) 10Cwhite: opensearch: move pki::get_cert call into profile module [puppet] - 10https://gerrit.wikimedia.org/r/1280788 (https://phabricator.wikimedia.org/T424204) [01:27:26] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/1284024/8533/" [puppet] - 10https://gerrit.wikimedia.org/r/1280788 (https://phabricator.wikimedia.org/T424204) (owner: 10Cwhite) [01:50:03] 06SRE, 10Pageviews-Anomaly, 06Traffic: "Nahui Ollin" is enwiki's #1 article. Never heard of it? That's the problem - https://phabricator.wikimedia.org/T425600#11896822 (10Nemoralis) Mobile percentage is 0.5% [02:01:09] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:44] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 35s) [02:09:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:17] 06SRE, 10Pageviews-Anomaly, 06Traffic: "Nahui Ollin" is enwiki's #1 article. Never heard of it? That's the problem - https://phabricator.wikimedia.org/T425600#11896825 (10AlexisJazz) I wonder if T420833 was a test run for this. [02:14:38] 10ops-esams, 06SRE, 06Commons, 06DC-Ops, and 3 others: ESAMS serving an older revision of some overwritten files - https://phabricator.wikimedia.org/T425216#11896827 (10AlexisJazz) >>! In T425216#11896751, @ssingh wrote: > Yeah, no, I was wrong and I misunderstood the problem. I misread that the actual ima... [02:15:59] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:20:07] 10ops-esams, 06SRE, 06Commons, 06DC-Ops, and 3 others: ESAMS and others serving older revisions of overwritten files - https://phabricator.wikimedia.org/T425216#11896828 (10AlexisJazz) [02:21:32] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: No response from remote host 185.15.59.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:24:24] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 185.15.59.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:31:16] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11896831 (10Papaul) [02:34:21] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:55] FIRING: [8x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:17:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [04:17:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [04:28:16] !incidents [04:28:16] 7913 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [04:31:02] (03PS1) 10Snwachukwu: Edit Analytics base image bump to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284179 (https://phabricator.wikimedia.org/T425310) [04:45:31] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2208 Backplane 0 error - https://phabricator.wikimedia.org/T425516#11896963 (10Marostegui) Thank you! I can access the host again. [04:45:55] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2208.codfw.wmnet with OS trixie [04:46:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2204 with weight 0 T424848', diff saved to https://phabricator.wikimedia.org/P92381 and previous config saved to /var/cache/conftool/dbconfig/20260507-044651-marostegui.json [04:46:54] T424848: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T424848 [04:47:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s2 T424848 [04:48:27] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1279346 (https://phabricator.wikimedia.org/T424848) (owner: 10Gerrit maintenance bot) [04:51:01] !log Starting s2 codfw failover from db2207 to db2204 - T424848 [04:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2204 to s2 primary T424848', diff saved to https://phabricator.wikimedia.org/P92382 and previous config saved to /var/cache/conftool/dbconfig/20260507-045141-marostegui.json [04:52:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2207 T424848', diff saved to https://phabricator.wikimedia.org/P92383 and previous config saved to /var/cache/conftool/dbconfig/20260507-045219-marostegui.json [04:52:22] T424848: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T424848 [04:52:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [04:52:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [04:54:16] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [04:55:16] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [04:56:10] (03PS1) 10Marostegui: db2207: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1284189 (https://phabricator.wikimedia.org/T424615) [04:58:16] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [05:01:08] (03CR) 10Marostegui: [C:03+2] db2207: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1284189 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui) [05:01:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2207.codfw.wmnet with reason: Reimage to Trixie [05:01:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2207: Reimage to Trixie [05:01:37] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2207: Reimage to Trixie [05:03:29] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2207.codfw.wmnet with OS trixie [05:04:36] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [05:04:44] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2208.codfw.wmnet with reason: host reimage [05:06:23] PROBLEM - SSH on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:06:23] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:41] PROBLEM - Exim SMTP on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Exim [05:09:46] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2208.codfw.wmnet with reason: host reimage [05:10:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:11:14] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 04 Aug 2026 03:33:57 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:11:14] RECOVERY - SSH on lists1004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:11:36] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 04 Aug 2026 03:33:57 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [05:23:01] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2207.codfw.wmnet with reason: host reimage [05:25:55] (03PS1) 10Marostegui: Revert "db2207: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1284210 [05:28:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2207.codfw.wmnet with reason: host reimage [05:33:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2208.codfw.wmnet with OS trixie [05:33:49] 06SRE, 10Pageviews-Anomaly, 06Traffic: "Nahui Ollin" is enwiki's #1 article. Never heard of it? That's the problem - https://phabricator.wikimedia.org/T425600#11896993 (10Bugreporter2) Is it a hoax? [05:38:01] (03PS1) 10Marostegui: db2208: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1284330 (https://phabricator.wikimedia.org/T425388) [05:38:11] (03CR) 10Marostegui: [C:03+2] Revert "db2207: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1284210 (owner: 10Marostegui) [05:47:10] (03CR) 10Ayounsi: "If we add the `pint disable promql/series` can we simplify the `expr` back? I worry about maintainability of such query." [alerts] - 10https://gerrit.wikimedia.org/r/1283993 (owner: 10Cathal Mooney) [05:51:11] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2207.codfw.wmnet with OS trixie [05:52:20] 06SRE, 10Pageviews-Anomaly, 06Traffic: "Nahui Ollin" is enwiki's #1 article. Never heard of it? That's the problem - https://phabricator.wikimedia.org/T425600#11897035 (10AlexisJazz) >>! In T425600#11896993, @Bugreporter2 wrote: > Is it a hoax? Doesn't look like a hoax. One of the sources is "An Inglorious... [05:54:11] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2207: after reimage to trixie [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T0600). [06:15:59] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:37:19] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1280788 (https://phabricator.wikimedia.org/T424204) (owner: 10Cwhite) [06:39:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2207: after reimage to trixie [06:40:55] FIRING: [8x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:41:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4006.ulsfo.wmnet to cluster ulsfo02 and group 01 [06:42:10] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4006.ulsfo.wmnet to cluster ulsfo02 and group 01 [06:46:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir4003.ulsfo.wmnet to drbd [06:46:29] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.changedisk (exit_code=99) for changing disk type of ncredir4003.ulsfo.wmnet to drbd [06:48:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir4003.ulsfo.wmnet to drbd [07:00:05] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:54] (03CR) 10Cathal Mooney: "No I don’t think so, we’d still get the “label mismatch” lint alerts in that case (if the total_pkts series is there but either of the dro" [alerts] - 10https://gerrit.wikimedia.org/r/1283993 (owner: 10Cathal Mooney) [07:02:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir4003.ulsfo.wmnet to drbd [07:02:40] PROBLEM - Host ncredir4003 is DOWN: PING CRITICAL - Packet loss = 100% [07:03:02] RECOVERY - Host ncredir4003 is UP: PING OK - Packet loss = 0%, RTA = 71.53 ms [07:23:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2005.codfw.wmnet [07:26:55] (03PS1) 10Ayounsi: ganeti.addnode: fix netbox PuppetDB debug logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1284440 [07:27:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2005.codfw.wmnet [07:30:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow4003.ulsfo.wmnet to drbd [07:32:17] !log installing apache2 security updates [07:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:35:44] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:36:21] jouncebot: nowandnext [07:36:21] For the next 0 hour(s) and 23 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T0700) [07:36:22] In 2 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T1000) [07:36:47] (03PS4) 10DCausse: search: add alt. completion indices to test keyword tokenizer (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269465 (https://phabricator.wikimedia.org/T420427) [07:37:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 07 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269465 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [07:39:21] FIRING: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:39:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269465 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [07:43:28] (03Merged) 10jenkins-bot: search: add alt. completion indices to test keyword tokenizer (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269465 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [07:43:35] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1284440 (owner: 10Ayounsi) [07:44:20] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1269465|search: add alt. completion indices to test keyword tokenizer (2/2) (T420427)]] [07:44:24] T420427: Search shouldn't trim trailing space when suggesting suggestions - https://phabricator.wikimedia.org/T420427 [07:44:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow4003.ulsfo.wmnet to drbd [07:45:47] (03CR) 10Ayounsi: [C:03+2] ganeti.addnode: fix netbox PuppetDB debug logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1284440 (owner: 10Ayounsi) [07:46:22] !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1269465|search: add alt. completion indices to test keyword tokenizer (2/2) (T420427)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:48:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:49:21] RESOLVED: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:49:51] (03Merged) 10jenkins-bot: ganeti.addnode: fix netbox PuppetDB debug logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1284440 (owner: 10Ayounsi) [07:49:52] !log dcausse@deploy1003 dcausse: Continuing with deployment [07:50:12] (03CR) 10Tiziano Fogli: [C:03+1] "The alert rule itself works fine, but the additional Pint lint check performs extra validations to ensure that all referenced metrics exis" [alerts] - 10https://gerrit.wikimedia.org/r/1283993 (owner: 10Cathal Mooney) [07:50:44] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:53:16] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [07:53:29] (03CR) 10Nikerabbit: translate: add opensearch-ttmserver-test (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [07:54:06] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269465|search: add alt. completion indices to test keyword tokenizer (2/2) (T420427)]] (duration: 09m 46s) [07:54:10] T420427: Search shouldn't trim trailing space when suggesting suggestions - https://phabricator.wikimedia.org/T420427 [07:55:26] (03CR) 10Elukey: [C:03+2] profile::kafka::mirror::alerts: fix max lag's group label [puppet] - 10https://gerrit.wikimedia.org/r/1283837 (owner: 10Elukey) [07:59:08] (03PS1) 10TheDJ: Remove the progress bar [extensions/3D] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284547 [08:03:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01 [08:03:41] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01 [08:04:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet [08:04:55] 07sre-alert-triage, 06Data-Platform-SRE, 06ServiceOps new: Alert in need of triage: Kafka MirrorMaker main-codfw_to_main-eqiad dropped message count in last 30m (instance alert1002) - https://phabricator.wikimedia.org/T425339#11897212 (10elukey) All resolved now! [08:05:05] (03PS1) 10Elukey: wikifunctions: move evaluator calls to mesh in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284551 (https://phabricator.wikimedia.org/T424193) [08:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:45] (03CR) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [08:07:38] (03CR) 10Elukey: [C:03+2] wikifunctions: move evaluator calls to mesh in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284551 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [08:10:49] (03CR) 10Cathal Mooney: [C:03+2] team-netops: CoreRouterInterfaceDropPercent - ingore missing series [alerts] - 10https://gerrit.wikimedia.org/r/1283993 (owner: 10Cathal Mooney) [08:12:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet [08:12:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01 [08:12:47] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: sync [08:12:54] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync [08:12:56] (03Merged) 10jenkins-bot: team-netops: CoreRouterInterfaceDropPercent - ingore missing series [alerts] - 10https://gerrit.wikimedia.org/r/1283993 (owner: 10Cathal Mooney) [08:13:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01 [08:14:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir4004.ulsfo.wmnet to drbd [08:17:14] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [08:17:35] (03PS1) 10STran: Enable staggered rollout for IRS on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284553 (https://phabricator.wikimedia.org/T424008) [08:20:18] (03CR) 10Atsuko: translate: add opensearch-ttmserver-test (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [08:21:38] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: drmrs v6 gateway IPs change - ayounsi@cumin1003" [08:21:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [08:21:49] (03CR) 10Marostegui: [C:03+2] db2208: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1284330 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui) [08:22:17] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2208: After reimage [08:22:27] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: drmrs v6 gateway IPs change - ayounsi@cumin1003" [08:22:27] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:22:30] (03PS8) 10Atsuko: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1283711 (https://phabricator.wikimedia.org/T425377) [08:23:02] !log drmrs remove old v6 gateway IP [08:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:07] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db2208: After reimage [08:23:20] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2208: After reimage [08:24:23] (03PS1) 10Marostegui: db2144: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1284554 [08:25:04] (03CR) 10CI reject: [V:04-1] db2144: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1284554 (owner: 10Marostegui) [08:26:12] (03PS2) 10Marostegui: db2144: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1284554 (https://phabricator.wikimedia.org/T425522) [08:26:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [08:27:20] (03CR) 10Marostegui: [C:03+2] db2144: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1284554 (https://phabricator.wikimedia.org/T425522) (owner: 10Marostegui) [08:28:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove db2144 T425522', diff saved to https://phabricator.wikimedia.org/P92389 and previous config saved to /var/cache/conftool/dbconfig/20260507-082822-marostegui.json [08:28:27] T425522: decommission db2144.codfw.wmnet - https://phabricator.wikimedia.org/T425522 [08:28:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir4004.ulsfo.wmnet to drbd [08:28:57] PROBLEM - Host ncredir4004 is DOWN: PING CRITICAL - Packet loss = 100% [08:29:13] RECOVERY - Host ncredir4004 is UP: PING OK - Packet loss = 0%, RTA = 71.50 ms [08:29:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus4003.ulsfo.wmnet to drbd [08:30:41] (03PS1) 10Marostegui: mariadb: Decommission db2144 [puppet] - 10https://gerrit.wikimedia.org/r/1284555 (https://phabricator.wikimedia.org/T418979) [08:31:50] (03PS1) 10Elukey: wikifunctions: move orchestrator -> eval calls to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284556 (https://phabricator.wikimedia.org/T424193) [08:32:08] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts db2144.codfw.wmnet [08:32:17] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission db2144 [puppet] - 10https://gerrit.wikimedia.org/r/1284555 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui) [08:33:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:34:32] (03CR) 10Elukey: [C:03+1] kafka-main: set codfw brokers inter-broker protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1283988 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [08:35:44] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:37:19] 06SRE: Move Kafka mirror monitors and alerts to the alerts repo - https://phabricator.wikimedia.org/T425621 (10elukey) 03NEW [08:37:31] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [08:42:26] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2144.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [08:42:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2144.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [08:42:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:42:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2144.codfw.wmnet [08:44:08] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2144.codfw.wmnet - https://phabricator.wikimedia.org/T425522#11897346 (10Marostegui) a:05Marostegui→03Jhancock.wm [08:44:08] (03PS1) 10Ayounsi: Remove asw2-ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1284558 (https://phabricator.wikimedia.org/T408892) [08:44:16] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2144.codfw.wmnet - https://phabricator.wikimedia.org/T425522#11897352 (10Marostegui) Ready for DC-Ops [08:46:37] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1284559 (https://phabricator.wikimedia.org/T425622) [08:46:43] (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1284560 (https://phabricator.wikimedia.org/T425622) [08:46:58] (03CR) 10Giuseppe Lavagetto: [C:04-1] "LGTM but there's a potential race condition when you switch the hiera key on." [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [08:47:59] (03PS1) 10Ayounsi: asw2-ulsfo: remove from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1284561 (https://phabricator.wikimedia.org/T408892) [08:48:40] (03PS2) 10Ayounsi: asw2-ulsfo: remove from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1284561 (https://phabricator.wikimedia.org/T408892) [08:49:41] (03PS1) 10Marostegui: db1202,db2182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1284563 (https://phabricator.wikimedia.org/T425388) [08:50:20] (03CR) 10Marostegui: [C:03+2] db1202,db2182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1284563 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui) [08:50:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1202.eqiad.wmnet with reason: Reimage to Trixie [08:50:57] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1202: Reimage to Trixie [08:51:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2182.codfw.wmnet with reason: Reimage to Trixie [08:51:09] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2182: Reimage to Trixie [08:51:10] (03CR) 10Cathal Mooney: [C:03+1] mr1-ulsfo: remove device specific security_zones definition [homer/public] - 10https://gerrit.wikimedia.org/r/1283503 (https://phabricator.wikimedia.org/T421674) (owner: 10Ayounsi) [08:51:16] (03CR) 10Cathal Mooney: [C:03+1] Remove asw2-ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1284558 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [08:51:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2182: Reimage to Trixie [08:51:32] (03CR) 10Ayounsi: [C:03+2] mr1-ulsfo: remove device specific security_zones definition [homer/public] - 10https://gerrit.wikimedia.org/r/1283503 (https://phabricator.wikimedia.org/T421674) (owner: 10Ayounsi) [08:51:34] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1202: Reimage to Trixie [08:51:43] (03CR) 10Cathal Mooney: [C:03+1] asw2-ulsfo: remove from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1284561 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [08:51:44] (03CR) 10Ayounsi: [C:03+2] Remove asw2-ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1284558 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [08:52:41] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1202.eqiad.wmnet with OS trixie [08:52:47] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2182.codfw.wmnet with OS trixie [08:52:52] (03Merged) 10jenkins-bot: mr1-ulsfo: remove device specific security_zones definition [homer/public] - 10https://gerrit.wikimedia.org/r/1283503 (https://phabricator.wikimedia.org/T421674) (owner: 10Ayounsi) [08:53:00] (03Merged) 10jenkins-bot: Remove asw2-ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1284558 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [08:54:27] (03CR) 10Ayounsi: [C:03+2] asw2-ulsfo: remove from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1284561 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [08:56:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [09:04:18] (03PS2) 10STran: Enable staggered rollout for IRS on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284553 (https://phabricator.wikimedia.org/T424008) [09:04:31] PROBLEM - Check correctness of the icinga configuration on alert1002 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [09:04:36] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [09:06:30] (03PS7) 10Tiziano Fogli: rsyslog: forward thanos-query-frontend logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1275799 (https://phabricator.wikimedia.org/T423986) [09:07:01] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1202.eqiad.wmnet with reason: host reimage [09:08:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2208: After reimage [09:11:05] (03PS11) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) [09:11:07] (03CR) 10Mszwarc: [C:03+1] Enable staggered rollout for IRS on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284553 (https://phabricator.wikimedia.org/T424008) (owner: 10STran) [09:11:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM bast4006.wikimedia.org [09:11:30] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11897447 (10ops-monitoring-bot) VM bast4006.wikimedia.org rebooted by jmm@cumin2002 with reason: None [09:11:45] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2182.codfw.wmnet with reason: host reimage [09:11:58] (03PS12) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) [09:14:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1202.eqiad.wmnet with reason: host reimage [09:15:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM bast4006.wikimedia.org [09:17:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284553 (https://phabricator.wikimedia.org/T424008) (owner: 10STran) [09:17:39] (03PS1) 10STran: Fix when user is considered exposed to the feature in the experiment [extensions/ReportIncident] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284569 (https://phabricator.wikimedia.org/T424075) [09:18:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ReportIncident] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284569 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [09:18:32] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2182.codfw.wmnet with reason: host reimage [09:19:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast4006.wikimedia.org with OS trixie [09:21:53] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11897467 (10MoritzMuehlenhoff) [09:22:06] (03Abandoned) 10Elukey: Test pki1002 on ganeti-test [puppet] - 10https://gerrit.wikimedia.org/r/1254242 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [09:24:32] !log elukey@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1006.eqiad.wmnet [09:25:21] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1006.eqiad.wmnet [09:25:33] (03CR) 10Elukey: [C:03+2] sre.hardware.upgrade-firmware: remove unused code [cookbooks] - 10https://gerrit.wikimedia.org/r/1282356 (https://phabricator.wikimedia.org/T425327) (owner: 10Elukey) [09:26:26] (03CR) 10Daniel Kinzler: rest-gateway: generalize class overrides (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278376 (https://phabricator.wikimedia.org/T424828) (owner: 10Daniel Kinzler) [09:26:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [09:28:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus4003.ulsfo.wmnet to drbd [09:28:27] PROBLEM - Host prometheus4003 is DOWN: PING CRITICAL - Packet loss = 100% [09:28:51] RECOVERY - Host prometheus4003 is UP: PING OK - Packet loss = 0%, RTA = 71.41 ms [09:29:41] (03CR) 10Daniel Kinzler: rest gateway: defined anon-mediawiki class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler) [09:30:26] (03CR) 10Mszwarc: [C:03+1] Fix when user is considered exposed to the feature in the experiment [extensions/ReportIncident] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284569 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [09:31:08] (03PS1) 10Esanders: Revert "Enable mobile editor abandonment survey on enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284575 (https://phabricator.wikimedia.org/T424102) [09:32:16] (03PS8) 10Tiziano Fogli: rsyslog: forward thanos-query-frontend logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1275799 (https://phabricator.wikimedia.org/T423986) [09:32:16] (03PS13) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) [09:32:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of hcaptcha-proxy4003.wikimedia.org to drbd [09:32:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:32:55] (03CR) 10Daniel Kinzler: [C:04-1] "CR-1, holding back from deployment. This is part of the test system refactor, it can wait. That will give me time polish the rough edges." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler) [09:33:31] (03CR) 10Daniel Kinzler: [C:04-1] "Thanks, I'll push this back to give me some time to polish it a bit more." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler) [09:34:16] (03CR) 10Daniel Kinzler: "But you didn't +1 :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler) [09:35:44] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:35:53] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [09:36:27] (03PS1) 10Marostegui: Revert "db1202,db2182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1284577 [09:37:06] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1202.eqiad.wmnet with OS trixie [09:38:30] (03PS4) 10Daniel Kinzler: rest gateway: defined anon-mediawiki class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) [09:39:21] FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:39:44] (03CR) 10Marostegui: [C:03+2] Revert "db1202,db2182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1284577 (owner: 10Marostegui) [09:39:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1202: after reimage to trixie [09:40:15] (03PS3) 10Daniel Kinzler: rest-gateway: add anon-app ratelimit class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282908 (https://phabricator.wikimedia.org/T425391) [09:41:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2182.codfw.wmnet with OS trixie [09:42:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of hcaptcha-proxy4003.wikimedia.org to drbd [09:42:47] (03PS2) 10Elukey: sre.network: handle dry-run outputs in run_junos_commands [cookbooks] - 10https://gerrit.wikimedia.org/r/1269011 [09:43:01] PROBLEM - Host hcaptcha-proxy4003 is DOWN: PING CRITICAL - Packet loss = 100% [09:43:19] RECOVERY - Host hcaptcha-proxy4003 is UP: PING OK - Packet loss = 0%, RTA = 71.71 ms [09:44:21] RESOLVED: [2x] JobUnavailable: Reduced availability for job mtail in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:44:50] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2182: after reimage to trixie [09:45:01] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:46:01] RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy4003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:49:36] (03CR) 10Elukey: sre.network: handle dry-run outputs in run_junos_commands (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1269011 (owner: 10Elukey) [09:49:40] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast4006.wikimedia.org with reason: host reimage [09:52:26] (03PS13) 10Daniel Kinzler: rest gateway: rate limits for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) [09:53:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11897551 (10MoritzMuehlenhoff) [09:53:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast4006.wikimedia.org with reason: host reimage [09:53:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11897552 (10MoritzMuehlenhoff) [09:54:14] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts hcaptcha-proxy5001.wikimedia.org [09:54:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [09:58:16] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: remove redundant bearerPayload case [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277703 (owner: 10Daniel Kinzler) [09:58:21] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: generalize class overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278376 (https://phabricator.wikimedia.org/T424828) (owner: 10Daniel Kinzler) [09:59:03] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T1000) [10:00:40] FIRING: [10x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:00:43] (03Merged) 10jenkins-bot: rest-gateway: generalize class overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278376 (https://phabricator.wikimedia.org/T424828) (owner: 10Daniel Kinzler) [10:00:46] (03Merged) 10jenkins-bot: rest gateway: remove redundant bearerPayload case [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277703 (owner: 10Daniel Kinzler) [10:01:59] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:02:59] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:03:28] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:03:54] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy5001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:04:09] (03Abandoned) 10Elukey: envoyproxy: rebuild envoy.yaml when the placeholder is created [puppet] - 10https://gerrit.wikimedia.org/r/1275827 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [10:04:32] (03CR) 10Elukey: [V:03+2 C:03+2] otelcol: upgrade to Trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1282989 (https://phabricator.wikimedia.org/T416452) (owner: 10Elukey) [10:04:51] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:06:05] 07sre-alert-triage, 06Data-Platform-SRE, 06ServiceOps new: Alert in need of triage: Kafka MirrorMaker main-codfw_to_main-eqiad dropped message count in last 30m (instance alert1002) - https://phabricator.wikimedia.org/T425339#11897577 (10JMeybohm) 05Open→03Resolved a:03JMeybohm [10:07:00] jmm@cumin2002 decommission (PID 3572474) is awaiting input [10:07:16] (03CR) 10JMeybohm: [C:03+1] kafka-main: set codfw brokers inter-broker protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1283988 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [10:09:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:10:23] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:10:49] !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:10:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast4006.wikimedia.org with OS trixie [10:11:23] !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:11:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy5001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:11:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:11:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha-proxy5001.wikimedia.org [10:11:41] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11897610 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `hcaptcha-proxy5001.wikimedia.org` - hcaptcha-proxy5001.wikim... [10:12:15] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 04 Aug 2026 03:33:57 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:13:58] !log rebalance ganti cluster in ulsfo following host reimages T424686 [10:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:01] T424686: ulsfo switch work May 2026: Host reimaging - https://phabricator.wikimedia.org/T424686 [10:14:37] !log slyngshede@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool ulsfo [reason: New switch configuration, T408892] [10:14:40] T408892: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892 [10:14:45] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool ulsfo [reason: New switch configuration, T408892] [10:16:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11897624 (10MoritzMuehlenhoff) [10:16:49] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts hcaptcha-proxy5002.wikimedia.org [10:17:24] (03PS4) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) [10:18:28] (03CR) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [10:20:38] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:20:40] FIRING: [12x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:21:02] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:21:39] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:23:06] (03PS1) 10Marostegui: db1227,db2168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1284589 (https://phabricator.wikimedia.org/T425388) [10:23:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=ulsfo - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:24:21] FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:25:21] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1202: after reimage to trixie [10:25:49] (03CR) 10Marostegui: [C:03+2] db1227,db2168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1284589 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui) [10:25:52] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: add anon-app ratelimit class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282908 (https://phabricator.wikimedia.org/T425391) (owner: 10Daniel Kinzler) [10:26:03] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: defined anon-mediawiki class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler) [10:26:11] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: add more known cg-nat addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273510 (owner: 10Daniel Kinzler) [10:26:15] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy5002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:26:23] (03CR) 10CI reject: [V:04-1] rest gateway: add more known cg-nat addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273510 (owner: 10Daniel Kinzler) [10:26:24] (03CR) 10CI reject: [V:04-1] rest gateway: defined anon-mediawiki class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler) [10:26:25] (03CR) 10CI reject: [V:04-1] rest-gateway: add anon-app ratelimit class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282908 (https://phabricator.wikimedia.org/T425391) (owner: 10Daniel Kinzler) [10:26:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1227.eqiad.wmnet with reason: Reimage to Trixie [10:26:33] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1227: Reimage to Trixie [10:27:21] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1227: Reimage to Trixie [10:28:24] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1227.eqiad.wmnet with OS trixie [10:28:45] FIRING: CirrusStreamingUpdaterUnknownErrors: CirrusSearch consumer-cloudelastic@eqiad is failing write requests because of unknown errors - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterUnknownErrors [10:29:21] jmm@cumin2002 decommission (PID 3589892) is awaiting input [10:30:11] (03PS5) 10Daniel Kinzler: rest gateway: add more known cg-nat addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273510 [10:30:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2182: after reimage to trixie [10:30:44] FIRING: [3x] CoreBGPDown: Core BGP session down between cr2-eqiad and (185.15.58.139) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:30:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2168.codfw.wmnet with reason: Reimage to Trixie [10:30:54] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2168: Reimage to Trixie [10:31:00] (03PS5) 10Daniel Kinzler: rest gateway: defined anon-mediawiki class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) [10:31:08] (03PS4) 10Daniel Kinzler: rest-gateway: add anon-app ratelimit class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282908 (https://phabricator.wikimedia.org/T425391) [10:31:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2168: Reimage to Trixie [10:31:45] (03CR) 10Daniel Kinzler: [C:03+2] "try again" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273510 (owner: 10Daniel Kinzler) [10:32:00] (03CR) 10Daniel Kinzler: [C:03+2] "try again" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler) [10:32:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy5002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:32:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:32:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha-proxy5002.wikimedia.org [10:32:23] (03CR) 10Daniel Kinzler: [C:03+2] "try again" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282908 (https://phabricator.wikimedia.org/T425391) (owner: 10Daniel Kinzler) [10:32:29] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11897668 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `hcaptcha-proxy5002.wikimedia.org` - hcaptcha-proxy5002.wikim... [10:32:34] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2168.codfw.wmnet with OS trixie [10:33:41] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1048.eqiad.wmnet with reason: Maintenance [10:33:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling es1048 (T419961)', diff saved to https://phabricator.wikimedia.org/P92405 and previous config saved to /var/cache/conftool/dbconfig/20260507-103349-fceratto.json [10:33:51] (03Merged) 10jenkins-bot: rest gateway: add more known cg-nat addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273510 (owner: 10Daniel Kinzler) [10:34:02] (03Merged) 10jenkins-bot: rest gateway: defined anon-mediawiki class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282884 (https://phabricator.wikimedia.org/T425390) (owner: 10Daniel Kinzler) [10:34:48] (03CR) 10Cathal Mooney: [C:03+2] QoS: Map packets marked with DSCP CS1 into low-prirority class [homer/public] - 10https://gerrit.wikimedia.org/r/1279334 (https://phabricator.wikimedia.org/T424640) (owner: 10Cathal Mooney) [10:35:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install5004.wikimedia.org [10:35:03] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:35:47] (03Merged) 10jenkins-bot: rest-gateway: add anon-app ratelimit class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282908 (https://phabricator.wikimedia.org/T425391) (owner: 10Daniel Kinzler) [10:36:13] (03Merged) 10jenkins-bot: QoS: Map packets marked with DSCP CS1 into low-prirority class [homer/public] - 10https://gerrit.wikimedia.org/r/1279334 (https://phabricator.wikimedia.org/T424640) (owner: 10Cathal Mooney) [10:37:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:38:45] RESOLVED: CirrusStreamingUpdaterUnknownErrors: CirrusSearch consumer-cloudelastic@eqiad is failing write requests because of unknown errors - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterUnknownErrors [10:39:01] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11897709 (10MoritzMuehlenhoff) [10:39:08] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install5004.wikimedia.org - jmm@cumin2002" [10:39:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install5004.wikimedia.org - jmm@cumin2002" [10:39:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:39:17] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install5004.wikimedia.org on all recursors [10:39:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install5004.wikimedia.org on all recursors [10:39:40] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:40:19] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:40:40] RESOLVED: [12x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:40:54] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:40:56] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2184.codfw.wmnet with reason: reimage [10:41:07] (03CR) 10Gmodena: [C:03+1] openjdk-25-jre/openjdk-25-jdk (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (owner: 10Trueg) [10:42:34] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1227.eqiad.wmnet with reason: host reimage [10:42:50] (03CR) 10Trueg: openjdk-25-jre/openjdk-25-jdk (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (owner: 10Trueg) [10:44:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048 (T419961)', diff saved to https://phabricator.wikimedia.org/P92406 and previous config saved to /var/cache/conftool/dbconfig/20260507-104359-fceratto.json [10:44:20] (03PS1) 10Ladsgroup: Close Russian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284592 (https://phabricator.wikimedia.org/T421796) [10:44:33] jouncebot: nowandnext [10:44:33] For the next 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T1000) [10:44:33] In 1 hour(s) and 15 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T1200) [10:45:25] jmm@cumin2002 makevm (PID 3602538) is awaiting input [10:45:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284592 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [10:46:19] (03PS2) 10Trueg: openjdk-25-jre/openjdk-25-jdk [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) [10:46:58] (03Merged) 10jenkins-bot: Close Russian Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284592 (https://phabricator.wikimedia.org/T421796) (owner: 10Ladsgroup) [10:47:26] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1284592|Close Russian Wikinews (T421796)]] [10:47:34] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [10:47:41] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:47:43] !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:48:08] !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:48:09] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1227.eqiad.wmnet with reason: host reimage [10:48:54] !log root@cumin1003 START - Cookbook sre.hosts.reimage for host db2184.codfw.wmnet with OS trixie [10:49:11] !log root@cumin1003 START - Cookbook sre.hosts.move-vlan for host db2184 [10:49:18] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:49:23] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1284592|Close Russian Wikinews (T421796)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:49:56] (03PS14) 10Daniel Kinzler: rest gateway: rate limits for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) [10:51:34] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2168.codfw.wmnet with reason: host reimage [10:51:54] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [10:54:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048', diff saved to https://phabricator.wikimedia.org/P92407 and previous config saved to /var/cache/conftool/dbconfig/20260507-105407-fceratto.json [10:54:57] !log root@cumin1003 START - Cookbook sre.dns.netbox [10:55:02] jmm@cumin2002 makevm (PID 3602538) is awaiting input [10:55:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2168.codfw.wmnet with reason: host reimage [10:56:02] (03CR) 10Muehlenhoff: openjdk-25-jre/openjdk-25-jdk (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg) [10:56:06] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1284592|Close Russian Wikinews (T421796)]] (duration: 08m 40s) [10:56:09] T421796: Close 31 editions of Wikinews on 2026-05-04 (make them read-only) - https://phabricator.wikimedia.org/T421796 [10:56:25] (03CR) 10Muehlenhoff: openjdk-25-jre/openjdk-25-jdk (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg) [10:57:20] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM install5004.wikimedia.org - jmm@cumin2002" [10:57:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM install5004.wikimedia.org - jmm@cumin2002" [10:57:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:57:26] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install5004.wikimedia.org on all recursors [10:57:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install5004.wikimedia.org on all recursors [10:57:36] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host install5004.wikimedia.org [10:57:44] !log root@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:57:44] !log root@cumin1003 START - Cookbook sre.dns.wipe-cache db2184.codfw.wmnet 129.32.192.10.in-addr.arpa 9.2.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:57:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:57:48] !log root@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2184.codfw.wmnet 129.32.192.10.in-addr.arpa 9.2.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:57:49] !log root@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db2184 [10:57:53] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled: ml-staging-ctrl_6443: Servers ml-staging-ctrl2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:58:07] !log root@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2184 [10:58:07] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host db2184 [10:58:47] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:58:53] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:59:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install5004.wikimedia.org [10:59:06] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:59:10] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:01:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [11:01:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [11:03:25] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install5004.wikimedia.org - jmm@cumin2002" [11:03:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install5004.wikimedia.org - jmm@cumin2002" [11:03:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:03:31] (03PS1) 10Marostegui: Revert "db1227,db2168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1284594 [11:03:31] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install5004.wikimedia.org on all recursors [11:03:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install5004.wikimedia.org on all recursors [11:03:54] (03CR) 10Daniel Kinzler: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler) [11:04:02] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install5004.wikimedia.org - jmm@cumin2002" [11:04:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install5004.wikimedia.org - jmm@cumin2002" [11:04:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048', diff saved to https://phabricator.wikimedia.org/P92409 and previous config saved to /var/cache/conftool/dbconfig/20260507-110415-fceratto.json [11:04:33] RECOVERY - Check correctness of the icinga configuration on alert1002 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [11:04:58] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [11:05:46] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11897933 (10MoritzMuehlenhoff) [11:06:42] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for an-presto1006.mgmt:22 - https://phabricator.wikimedia.org/T425590#11897946 (10Jclark-ctr) a:03Jclark-ctr [11:07:02] <_joe_> !ack [11:07:03] 7916 (ACKED) NELHigh sre (thanos-rule@main tcp.timed_out) [11:07:08] jmm@cumin2002 makevm (PID 3619268) is awaiting input [11:07:24] looking [11:07:33] federico3: I'm here too [11:07:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install5004.wikimedia.org with OS bookworm [11:08:07] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11897962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host install5004.wikimedia.org with OS bookworm [11:09:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T425488#11897967 (10phaultfinder) [11:09:56] (03CR) 10Muehlenhoff: [C:03+2] Switch install7002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283022 (owner: 10Muehlenhoff) [11:09:58] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [11:10:11] <_joe_> uh well [11:10:14] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: rate limits for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler) [11:10:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1227.eqiad.wmnet with OS trixie [11:11:24] (03CR) 10Marostegui: [C:03+2] Revert "db1227,db2168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1284594 (owner: 10Marostegui) [11:11:26] !log instaling modsecurity-apache security updates [11:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:17] (03Merged) 10jenkins-bot: rest gateway: rate limits for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler) [11:13:15] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1227: after reimage to trixie [11:14:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance es1048 (T419961)', diff saved to https://phabricator.wikimedia.org/P92412 and previous config saved to /var/cache/conftool/dbconfig/20260507-111424-fceratto.json [11:15:52] !log root@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2184.codfw.wmnet with reason: host reimage [11:15:53] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:16:14] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:17:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2168.codfw.wmnet with OS trixie [11:19:51] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2184.codfw.wmnet with reason: host reimage [11:20:46] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2168: after reimage to trixie [11:21:22] !incidents [11:21:22] 7916 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [11:21:22] 7913 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [11:21:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [11:21:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [11:22:31] (03PS1) 10Muehlenhoff: Switch install6003 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1284600 [11:31:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/3D] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284547 (owner: 10TheDJ) [11:32:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install7002.wikimedia.org [11:33:19] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for an-presto1006.mgmt:22 - https://phabricator.wikimedia.org/T425590#11898084 (10Jclark-ctr) 05Open→03Resolved relocated cable to different port [11:33:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [11:35:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275467 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus) [11:35:59] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:36:19] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:39:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install7002.wikimedia.org [11:39:49] ACKNOWLEDGEMENT - MariaDB Replica Lag: backup1-codfw on db2184 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2975.87 seconds Jcrespo reimage - The acknowledgement expires at: 2026-05-11 11:39:32. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:40:45] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2184.codfw.wmnet with OS trixie [11:43:15] !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:43:46] !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:46:19] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1205.eqiad.wmnet with reason: reimage [11:47:54] !log root@cumin1003 START - Cookbook sre.hosts.reimage for host db1205.eqiad.wmnet with OS trixie [11:48:44] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11898179 (10MoritzMuehlenhoff) [11:49:17] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#11898183 (10MoritzMuehlenhoff) [11:49:40] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#11898185 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done for quite a while now. [11:50:02] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T425488#11898188 (10Jclark-ctr) Rebalanced pdu again found a psu not power cable not secured in pdu causing imbalance Sensor: Line, AA:L3, Current Value: 12.64 A (curre... [11:50:14] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T425488#11898189 (10Jclark-ctr) 05Open→03Resolved [11:55:57] (03CR) 10Ayounsi: [C:03+1] Switch install6003 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1284600 (owner: 10Muehlenhoff) [11:57:31] (03CR) 10Ayounsi: [C:03+1] "thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1269011 (owner: 10Elukey) [11:58:40] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1227: after reimage to trixie [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T1200) [12:00:12] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on install5004.wikimedia.org with reason: host reimage [12:02:15] !log root@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1205.eqiad.wmnet with reason: host reimage [12:02:32] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:02:48] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:03:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [12:05:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install5004.wikimedia.org with reason: host reimage [12:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2168: after reimage to trixie [12:07:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [12:07:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [12:08:13] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1205.eqiad.wmnet with reason: host reimage [12:11:03] !log slyngshede@cumin1003 conftool action : set/pooled=yes; selector: cluster=dnsbox,dc=ulsfo,service=authdns-update [reason: ulsfo switch refresh T408892] [12:11:07] T408892: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892 [12:11:10] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:11:32] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:11:34] !log slyngshede@dns1004 START - running authdns-update [12:12:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [12:12:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [12:12:55] !log slyngshede@dns1004 FAIL - running authdns-update [12:17:58] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add lswtest back as being planned won't work - cmooney@cumin1003" [12:18:22] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add lswtest back as being planned won't work - cmooney@cumin1003" [12:18:31] !log installing init-system-helpers bugfix updates from Bookworm point release [12:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:31] FIRING: Traffic on tunnel link: Alert for device cr1-drmrs.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [12:21:43] !log slyngshede@dns1004 START - running authdns-update [12:22:15] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1284616 (owner: 10L10n-bot) [12:23:04] !log slyngshede@dns1004 FAIL - running authdns-update [12:24:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install5004.wikimedia.org with OS bookworm [12:24:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host install5004.wikimedia.org [12:24:40] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11898378 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host install5004.wikimedia.org with OS bookworm completed: - inst... [12:25:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [12:26:31] RESOLVED: Traffic on tunnel link: Device cr1-drmrs.wikimedia.org recovered from Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [12:30:11] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1205.eqiad.wmnet with OS trixie [12:35:32] (03PS1) 10Muehlenhoff: Assign the installserver role to install5004 [puppet] - 10https://gerrit.wikimedia.org/r/1284626 (https://phabricator.wikimedia.org/T421863) [12:37:45] (03PS1) 10DCausse: cirrus: use a keywork tokenizer for the plain field for autocomplete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284628 (https://phabricator.wikimedia.org/T420427) [12:39:27] (03CR) 10Ayounsi: [C:03+1] Assign the installserver role to install5004 [puppet] - 10https://gerrit.wikimedia.org/r/1284626 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [12:40:38] (03PS1) 10Ayounsi: eqsin: update install server IP [homer/public] - 10https://gerrit.wikimedia.org/r/1284629 (https://phabricator.wikimedia.org/T421863) [12:43:58] (03CR) 10Muehlenhoff: [C:03+2] Assign the installserver role to install5004 [puppet] - 10https://gerrit.wikimedia.org/r/1284626 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [12:44:20] !log sukhe@dns1004 START - running authdns-update [12:45:39] !log sukhe@dns1004 FAIL - running authdns-update [12:46:08] (03CR) 10Jforrester: [C:03+2] wikifunctions: move orchestrator -> eval calls to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284556 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [12:47:39] (03CR) 10Jforrester: [C:03+1] Remove custom user groups from Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281491 (https://phabricator.wikimedia.org/T423578) (owner: 10Zabe) [12:48:16] (03Merged) 10jenkins-bot: wikifunctions: move orchestrator -> eval calls to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284556 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [12:50:38] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [12:51:16] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [12:51:21] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [12:51:56] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [12:54:45] FIRING: CirrusStreamingUpdaterUnknownErrors: CirrusSearch consumer-cloudelastic@eqiad is failing write requests because of unknown errors - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterUnknownErrors [12:55:28] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [12:56:27] (03PS1) 10Kosta Harlan: EventStreamConfig: Register special_user_login event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284633 (https://phabricator.wikimedia.org/T425631) [12:58:12] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:59:11] (03CR) 10Trueg: openjdk-25-jre/openjdk-25-jdk (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg) [12:59:47] (03PS1) 10Jelto: miscweb: use url-downloader proxy for wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284634 (https://phabricator.wikimedia.org/T333143) [13:00:05] Urbanecm and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T1300). [13:00:05] Tran and James_F: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:33] My two are a trivial config change and a train-sort-of-un-blocker. [13:00:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [13:00:47] But both can go out with Tran's stuff if they're here. [13:00:52] Or I can just ship it now? [13:01:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/3D] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284547 (owner: 10TheDJ) [13:01:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275467 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus) [13:01:42] Shipping. [13:02:51] !log slyngshede@cumin1003 conftool action : set/pooled=yes; selector: name=dns4004.wikimedia.org [reason: ulsfo switch refresh T408892] [13:02:55] T408892: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892 [13:03:14] (03Merged) 10jenkins-bot: mc: Set server, instead of host and port, for wgWikiLambdaObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1275467 (https://phabricator.wikimedia.org/T423311) (owner: 10RLazarus) [13:03:18] (03Merged) 10jenkins-bot: Remove the progress bar [extensions/3D] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284547 (owner: 10TheDJ) [13:03:46] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1284547|Remove the progress bar]], [[gerrit:1275467|mc: Set server, instead of host and port, for wgWikiLambdaObjectCaches (T423311)]] [13:03:49] T423311: Writes to /*/wf-wan/ failing with CONNECTION FAILURE or SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY (mcrouter not being reached?) - https://phabricator.wikimedia.org/T423311 [13:04:36] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [13:04:45] RESOLVED: CirrusStreamingUpdaterUnknownErrors: CirrusSearch consumer-cloudelastic@eqiad is failing write requests because of unknown errors - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterUnknownErrors [13:05:43] !log jforrester@deploy1003 rzl, jforrester, hartman: Backport for [[gerrit:1284547|Remove the progress bar]], [[gerrit:1275467|mc: Set server, instead of host and port, for wgWikiLambdaObjectCaches (T423311)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:06:29] !log jforrester@deploy1003 rzl, jforrester, hartman: Continuing with deployment [13:08:42] Msz2001: Do you know if Tran is around to deploy their patches? [13:10:17] Should be [13:10:41] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1284547|Remove the progress bar]], [[gerrit:1275467|mc: Set server, instead of host and port, for wgWikiLambdaObjectCaches (T423311)]] (duration: 06m 55s) [13:10:44] T423311: Writes to /*/wf-wan/ failing with CONNECTION FAILURE or SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY (mcrouter not being reached?) - https://phabricator.wikimedia.org/T423311 [13:11:16] It's not an area I'm familiar with, so I'd be hesitant to deploy. [13:11:29] I pinged Tran on Slack [13:11:36] Ack. Thanks. [13:14:21] FIRING: [2x] JobUnavailable: Reduced availability for job probes/swagger in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:15:07] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -3d 23h 20m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [13:18:54] (03PS6) 10Bking: perf(opensearch): increase 'vm.max_map_count' to 1048576 [puppet] - 10https://gerrit.wikimedia.org/r/1282320 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [13:19:15] o/ hello hello sorry I'm late; was distracted with some coding [13:19:21] FIRING: [2x] JobUnavailable: Reduced availability for job probes/swagger in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:19:26] (03CR) 10CI reject: [V:04-1] perf(opensearch): increase 'vm.max_map_count' to 1048576 [puppet] - 10https://gerrit.wikimedia.org/r/1282320 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [13:19:54] There's no deployment going on – James's patch has been already deployed [13:20:38] Oh cool I'm going to go ahead and deploy mine then [13:21:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284553 (https://phabricator.wikimedia.org/T424008) (owner: 10STran) [13:21:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [extensions/ReportIncident] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284569 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [13:22:01] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [homer/public] - 10https://gerrit.wikimedia.org/r/1284629 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [13:22:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [13:22:48] (03Merged) 10jenkins-bot: Enable staggered rollout for IRS on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284553 (https://phabricator.wikimedia.org/T424008) (owner: 10STran) [13:23:34] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:23:35] (03PS1) 10Muehlenhoff: Update DHCP server for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1284636 (https://phabricator.wikimedia.org/T421863) [13:24:15] (03CR) 10Ayounsi: [C:03+1] Update DHCP server for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1284636 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [13:24:24] (03CR) 10Ayounsi: [C:03+2] eqsin: update install server IP [homer/public] - 10https://gerrit.wikimedia.org/r/1284629 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [13:24:56] (03Merged) 10jenkins-bot: Fix when user is considered exposed to the feature in the experiment [extensions/ReportIncident] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284569 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [13:25:07] (03PS4) 10Audrey Penven: Enable and configure WikiProjects prototype on WikiData beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) [13:25:26] !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1284553|Enable staggered rollout for IRS on enwiki (T424008)]], [[gerrit:1284569|Fix when user is considered exposed to the feature in the experiment (T424075)]] [13:25:31] T424008: Enable enwiki trial at 5% of eligible users - https://phabricator.wikimedia.org/T424008 [13:25:31] T424075: Update instrumentation MVP for enwiki 5% rollout - https://phabricator.wikimedia.org/T424075 [13:26:20] (03Merged) 10jenkins-bot: eqsin: update install server IP [homer/public] - 10https://gerrit.wikimedia.org/r/1284629 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [13:27:24] !log stran@deploy1003 stran: Backport for [[gerrit:1284553|Enable staggered rollout for IRS on enwiki (T424008)]], [[gerrit:1284569|Fix when user is considered exposed to the feature in the experiment (T424075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:28:22] testing now [13:30:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270482 (https://phabricator.wikimedia.org/T421850) (owner: 10Audrey Penven) [13:30:15] (03CR) 10Muehlenhoff: [C:03+2] Remove role::mail::mx and related Puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1283025 (https://phabricator.wikimedia.org/T325394) (owner: 10Muehlenhoff) [13:30:17] looks good, continuing [13:30:21] !log stran@deploy1003 stran: Continuing with deployment [13:30:46] (03PS1) 10Ayounsi: ulsfo: remove VRRP checks on CR, add mgmt switch monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1284640 (https://phabricator.wikimedia.org/T408892) [13:31:02] (03Abandoned) 10Muehlenhoff: profile::mail::mx: Mark the SMTP as intentionally open to the world [puppet] - 10https://gerrit.wikimedia.org/r/1283021 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [13:33:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:34:31] !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1284553|Enable staggered rollout for IRS on enwiki (T424008)]], [[gerrit:1284569|Fix when user is considered exposed to the feature in the experiment (T424075)]] (duration: 09m 05s) [13:34:36] T424008: Enable enwiki trial at 5% of eligible users - https://phabricator.wikimedia.org/T424008 [13:34:37] T424075: Update instrumentation MVP for enwiki 5% rollout - https://phabricator.wikimedia.org/T424075 [13:35:07] (03PS2) 10Jelto: miscweb: use url-downloader proxy for wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284634 (https://phabricator.wikimedia.org/T333143) [13:36:05] done [13:39:46] (03PS2) 10Ayounsi: ulsfo: remove VRRP checks on CR, add mgmt switch monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1284640 (https://phabricator.wikimedia.org/T408892) [13:40:04] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1284640 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [13:42:27] (03CR) 10Jasmine: [C:03+2] kafka-main: set codfw brokers inter-broker protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1283988 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [13:43:54] (03CR) 10Muehlenhoff: [C:03+2] Update DHCP server for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1284636 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [13:44:04] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! Or more to the point it will look good to me in Icinga :P" [puppet] - 10https://gerrit.wikimedia.org/r/1284640 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [13:44:44] on-callers (federico3, fabfur): for awareness, doing some work on the kafka clusters related to the 3.7 upgrade [13:47:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [13:48:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [13:48:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:49:04] (03PS3) 10Jelto: miscweb: use url-downloader proxy for wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284634 (https://phabricator.wikimedia.org/T333143) [13:51:06] thanks jasmine_ [13:51:42] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:52:02] jasmine@cumin2002 roll-restart-reboot-brokers (PID 3733189) is awaiting input [13:52:40] (03CR) 10Trueg: [C:03+1] blazegraph: group alerts by instance [alerts] - 10https://gerrit.wikimedia.org/r/1278493 (https://phabricator.wikimedia.org/T418708) (owner: 10Gmodena) [13:53:29] (03CR) 10Trueg: [C:03+2] blazegraph: group alerts by instance [alerts] - 10https://gerrit.wikimedia.org/r/1278493 (https://phabricator.wikimedia.org/T418708) (owner: 10Gmodena) [13:53:31] !log jasmine@cumin2002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-codfw [13:53:56] (03CR) 10Ayounsi: [C:03+2] ulsfo: remove VRRP checks on CR, add mgmt switch monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1284640 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [13:54:09] (03PS4) 10Jelto: miscweb: use url-downloader proxy for wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284634 (https://phabricator.wikimedia.org/T333143) [13:54:35] (03CR) 10Ayounsi: [C:03+1] ulsfo: remove VRRP checks on CR, add mgmt switch monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1284640 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [13:54:45] (03CR) 10Ayounsi: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1284640 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [13:55:28] (03Merged) 10jenkins-bot: blazegraph: group alerts by instance [alerts] - 10https://gerrit.wikimedia.org/r/1278493 (https://phabricator.wikimedia.org/T418708) (owner: 10Gmodena) [13:56:37] (03PS1) 10Jasmine: kafka-main: set eqiad (all) brokers inter-broker protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1284646 (https://phabricator.wikimedia.org/T419216) [13:57:52] (03CR) 10Jasmine: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1284646 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [13:57:57] FIRING: [2x] ProbeDown: Ripe Atlas anchor atlas5001:80 is not returning HTTP 200 OK on port 80 - https://wikitech.wikimedia.org/wiki/RIPE_Atlas#HTTP_checks_failing - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:58:15] (03CR) 10Ayounsi: [C:03+2] ulsfo: remove VRRP checks on CR, add mgmt switch monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1284640 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [13:58:41] (03PS1) 10Majavah: P:openstack: neutron: Remove unused nic_rename_mac setting [puppet] - 10https://gerrit.wikimedia.org/r/1284647 [13:58:41] (03PS1) 10Majavah: interface: Remove unused rename define [puppet] - 10https://gerrit.wikimedia.org/r/1284648 [13:58:41] (03PS1) 10Majavah: P:openstack: neutron: Remove unused l3_agent_bridge settings [puppet] - 10https://gerrit.wikimedia.org/r/1284649 [13:58:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [13:58:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [14:08:33] PROBLEM - MariaDB Replica Lag: backup1-eqiad on db1205 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:08:33] PROBLEM - mysqld processes on db1205 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:08:33] PROBLEM - MariaDB Replica SQL: backup1-eqiad on db1205 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:08:37] PROBLEM - MariaDB Replica IO: backup1-eqiad on db1205 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:08:37] PROBLEM - MariaDB read only backup1-eqiad on db1205 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:08:56] (03PS7) 10Bking: perf(opensearch): increase 'vm.max_map_count' to 1048576 [puppet] - 10https://gerrit.wikimedia.org/r/1282320 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [14:10:19] !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/device-analytics: apply [14:10:27] !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [14:11:34] FIRING: DiskSpace: Disk space build2001:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:12:05] !log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [14:12:20] !log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [14:12:22] !log jasmine@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-codfw [14:12:24] (03CR) 10Bking: [C:03+2] perf(opensearch): increase 'vm.max_map_count' to 1048576 [puppet] - 10https://gerrit.wikimedia.org/r/1282320 (https://phabricator.wikimedia.org/T425301) (owner: 10Gehel) [14:15:07] (03PS1) 10Ayounsi: Icinga: display Nokia image for network devices [puppet] - 10https://gerrit.wikimedia.org/r/1284656 [14:15:14] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1284656 (owner: 10Ayounsi) [14:15:50] (03PS4) 10Daniel Kinzler: Move Makefiles to standard location [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) [14:15:52] (03CR) 10Daniel Kinzler: Move Makefiles to standard location (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler) [14:16:34] RESOLVED: DiskSpace: Disk space build2001:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:17:21] (03PS5) 10Jelto: miscweb: use url-downloader proxy for wmf-navigator, add service listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284634 (https://phabricator.wikimedia.org/T333143) [14:18:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:19:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [14:21:12] (03PS1) 10Muehlenhoff: Point webproxy in eqsin to install5004 [dns] - 10https://gerrit.wikimedia.org/r/1284657 (https://phabricator.wikimedia.org/T421863) [14:22:44] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1284656 (owner: 10Ayounsi) [14:23:02] (03CR) 10Ayounsi: [C:03+1] Point webproxy in eqsin to install5004 [dns] - 10https://gerrit.wikimedia.org/r/1284657 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [14:23:13] (03CR) 10Ayounsi: [C:03+2] Icinga: display Nokia image for network devices [puppet] - 10https://gerrit.wikimedia.org/r/1284656 (owner: 10Ayounsi) [14:23:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=ulsfo - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:23:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:24:23] (03CR) 10JMeybohm: [C:03+1] "this is the way" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284634 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [14:24:36] !log slyngshede@cumin1003 START - Cookbook sre.dns.netbox [14:25:10] !log ebysans@deploy1003 helmfile [codfw] START helmfile.d/services/device-analytics: apply [14:25:13] !log ebysans@deploy1003 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [14:26:04] !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/edit-analytics: apply [14:26:20] !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [14:27:07] (03CR) 10Jelto: [C:03+2] miscweb: use url-downloader proxy for wmf-navigator, add service listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284634 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [14:28:25] !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating DNS snippets - slyngshede@cumin1003" [14:28:25] (03CR) 10JMeybohm: [C:03+1] kafka-main: set eqiad (all) brokers inter-broker protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1284646 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [14:28:26] (03CR) 10Muehlenhoff: [C:03+2] Point webproxy in eqsin to install5004 [dns] - 10https://gerrit.wikimedia.org/r/1284657 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [14:28:30] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating DNS snippets - slyngshede@cumin1003" [14:28:30] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:28:47] !log jmm@dns1004 START - running authdns-update [14:28:51] !log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [14:29:06] !log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [14:29:34] FIRING: DiskSpace: Disk space build2001:9100:/ 3.433% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:29:36] (03Merged) 10jenkins-bot: miscweb: use url-downloader proxy for wmf-navigator, add service listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284634 (https://phabricator.wikimedia.org/T333143) (owner: 10Jelto) [14:29:42] moritzm: let us know how that goes. slyngs and I were having some fun with that, should hopefully work for you [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T1430) [14:30:11] !log jmm@dns1004 END - running authdns-update [14:30:25] !log Deploying Refinery at 4734c67 for weekly deployment train [14:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:36] !log ebysans@deploy1003 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [14:30:39] !log ebysans@deploy1003 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [14:30:52] !log slyngshede@dns1004 START - running authdns-update [14:30:58] !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/editor-analytics: apply [14:30:59] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:31:07] !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [14:31:23] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [14:31:57] (03CR) 10Jasmine: [C:03+2] kafka-main: set eqiad (all) brokers inter-broker protocol to 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1284646 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [14:31:58] !log akhatun@deploy1003 Started deploy [analytics/refinery@4734c67] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@4734c67c] [14:32:01] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [14:32:12] !log slyngshede@dns1004 END - running authdns-update [14:32:41] sukhe: the DNS deploy for the proxy change worked just fine [14:32:47] !log slyngshede@cumin1003 conftool action : set/pooled=yes; selector: cluster=dnsbox,dc=ulsfo [reason: ulsfo switch refresh T408892] [14:32:51] T408892: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892 [14:33:15] moritzm: nice, thanks to slyngs :) [14:33:52] !log akhatun@deploy1003 Finished deploy [analytics/refinery@4734c67] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@4734c67c] (duration: 01m 54s) [14:34:09] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1282372 (https://phabricator.wikimedia.org/T422646) (owner: 10Majavah) [14:34:58] (03CR) 10Majavah: [C:03+2] P:zookeeper: Allow WMCS to use cloud-private FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/1282372 (https://phabricator.wikimedia.org/T422646) (owner: 10Majavah) [14:35:36] !log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [14:35:49] !log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [14:36:01] !log akhatun@deploy1003 Started deploy [analytics/refinery@4734c67]: Regular analytics weekly train [analytics/refinery@4734c67c] [14:36:59] !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/geo-analytics: apply [14:37:06] (03PS1) 10Muehlenhoff: Remove ganeti5004 from eqsin cluster [puppet] - 10https://gerrit.wikimedia.org/r/1284665 (https://phabricator.wikimedia.org/T421863) [14:37:07] !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [14:37:58] RESOLVED: [2x] ProbeDown: Ripe Atlas anchor atlas5001:80 is not returning HTTP 200 OK on port 80 - https://wikitech.wikimedia.org/wiki/RIPE_Atlas#HTTP_checks_failing - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:39:34] (03Abandoned) 10Phuedx: mw.testKitchen.getExperiment() -> mw.testKitchen.compat.getExperiment() [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1280210 (https://phabricator.wikimedia.org/T419513) (owner: 10Phuedx) [14:40:07] !log jasmine@cumin2002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-eqiad [14:40:39] !log akhatun@deploy1003 Finished deploy [analytics/refinery@4734c67]: Regular analytics weekly train [analytics/refinery@4734c67c] (duration: 04m 38s) [14:41:31] FIRING: Traffic on tunnel link: Alert for device cr1-drmrs.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [14:42:01] !log akhatun@deploy1003 Started deploy [analytics/refinery@4734c67] (thin): Regular analytics weekly train THIN [analytics/refinery@4734c67c] [14:43:12] !log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [14:43:29] !log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [14:44:02] !log akhatun@deploy1003 Finished deploy [analytics/refinery@4734c67] (thin): Regular analytics weekly train THIN [analytics/refinery@4734c67c] (duration: 02m 01s) [14:46:31] RESOLVED: Traffic on tunnel link: Device cr1-drmrs.wikimedia.org recovered from Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [14:50:06] (03CR) 10Trueg: openjdk-25-jre/openjdk-25-jdk (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1283027 (https://phabricator.wikimedia.org/T425636) (owner: 10Trueg) [14:50:50] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:52:13] !log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [14:52:29] !log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [14:52:30] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:53:28] !log ebysans@deploy1003 helmfile [codfw] START helmfile.d/services/media-analytics: apply [14:53:31] !log ebysans@deploy1003 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [14:54:04] !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/page-analytics: apply [14:54:13] !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [14:54:23] !log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [14:54:39] !log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [14:54:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [14:54:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [14:55:42] (03PS1) 10Jelto: miscweb: remove VITE_API_BASE_URL from wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284669 (https://phabricator.wikimedia.org/T414405) [14:55:52] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11899153 (10Dzahn) I assume this is also a request for the LDAP groups `wmde` and `nda` because that is standard for WMDE staff. [14:58:07] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11899159 (10catherine.kelsey.wmde) Yes please, I actually created another task for these groups here: https://phabricator.wikimedia.org/T425566 - a... [14:58:30] (03CR) 10Snwachukwu: [C:03+2] Edit Analytics base image bump to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284179 (https://phabricator.wikimedia.org/T425310) (owner: 10Snwachukwu) [14:58:52] (03CR) 10Snwachukwu: [V:03+2 C:03+2] Edit Analytics base image bump to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284179 (https://phabricator.wikimedia.org/T425310) (owner: 10Snwachukwu) [14:58:58] !log jasmine@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-eqiad [14:59:45] (03PS1) 10Muehlenhoff: profile::mail::smarthost: Remove exim4-smtp firewall service [puppet] - 10https://gerrit.wikimedia.org/r/1284671 (https://phabricator.wikimedia.org/T149804) [14:59:52] (03CR) 10Jelto: [C:03+2] miscweb: remove VITE_API_BASE_URL from wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284669 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [15:00:04] brennen and jeena: #bothumor I � Unicode. All rise for Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T1500). [15:01:22] !log Deployed refinery using scap, then deployed onto hdfs [15:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:33] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 71%, RTA = 6044.43 ms [15:02:19] RECOVERY - Host titan1002 is UP: PING WARNING - Packet loss = 0%, RTA = 649.80 ms [15:02:58] FIRING: ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:03:06] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:03:06] !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/edit-analytics: apply [15:03:16] !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [15:03:25] (03Merged) 10jenkins-bot: miscweb: remove VITE_API_BASE_URL from wmf-navigator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284669 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [15:03:33] RECOVERY - mysqld processes on db1205 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:03:39] RECOVERY - MariaDB read only backup1-eqiad on db1205 is OK: Version 10.11.16-MariaDB-log, Uptime 56s, read_only: True, event_scheduler: True, 12.35 QPS, connection latency: 0.030271s, query latency: 0.000592s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:05:00] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11899187 (10MoritzMuehlenhoff) [15:05:08] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:05:46] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [15:06:16] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [15:06:28] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [15:06:33] RECOVERY - MariaDB Replica SQL: backup1-eqiad on db1205 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:06:37] RECOVERY - MariaDB Replica IO: backup1-eqiad on db1205 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:06:47] (03CR) 10Ayounsi: [C:03+1] Remove ganeti5004 from eqsin cluster [puppet] - 10https://gerrit.wikimedia.org/r/1284665 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [15:06:57] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [15:07:30] ACKNOWLEDGEMENT - MariaDB Replica Lag: backup1-eqiad on db1205 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4100.10 seconds Jcrespo recovery after reimage https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:07:33] RECOVERY - MariaDB Replica Lag: backup1-eqiad on db1205 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:07:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1284671 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [15:07:57] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:09:25] (03PS2) 10Muehlenhoff: profile::mail::smarthost: Remove exim4-smtp firewall service [puppet] - 10https://gerrit.wikimedia.org/r/1284671 (https://phabricator.wikimedia.org/T149804) [15:09:30] (03PS2) 10Majavah: P:openstack: neutron: Remove unused nic_rename_mac setting [puppet] - 10https://gerrit.wikimedia.org/r/1284647 [15:09:30] (03PS2) 10Majavah: interface: Remove unused rename define [puppet] - 10https://gerrit.wikimedia.org/r/1284648 [15:09:30] (03PS2) 10Majavah: P:openstack: neutron: Remove unused l3_agent_bridge settings [puppet] - 10https://gerrit.wikimedia.org/r/1284649 [15:09:31] (03PS1) 10Majavah: P:openstack: nova: Drop network_flat_tagged_base_interface option [puppet] - 10https://gerrit.wikimedia.org/r/1284674 [15:09:32] (03PS1) 10Majavah: P:openstack: nova: Set MTU on flat VLAN interface in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1284675 (https://phabricator.wikimedia.org/T425172) [15:10:13] (03PS2) 10Majavah: P:openstack: nova: Set MTU on flat VLAN interface in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1284675 (https://phabricator.wikimedia.org/T425674) [15:11:23] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1284675 (https://phabricator.wikimedia.org/T425674) (owner: 10Majavah) [15:12:00] on-callers (federico3, fabfur): kafka clusters work done! thanks [15:12:25] thanks [15:12:41] !log sukhe@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool ulsfo for service: upload-addrs [reason: no reason specified, no task ID specified] [15:12:43] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool ulsfo for service: upload-addrs [reason: no reason specified, no task ID specified] [15:12:59] (03CR) 10Ebernhardson: [C:03+1] cirrus: use a keywork tokenizer for the plain field for autocomplete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284628 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [15:15:19] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4050.ulsfo.wmnet [15:16:57] FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:17:03] yeah ok [15:17:05] !ack [15:17:06] 7917 (ACKED) [2x] ProbeDown sre (2620:0:863:ed1a::2:b ip6 probes/service ulsfo) [15:17:25] !ack [15:17:25] All incidents are already acked. [15:17:47] @sukhe: expected? [15:18:18] !log sukhe@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs4009*} and A:liberica [15:18:26] !log sukhe@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs4009*} and A:liberica [15:18:34] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11899290 (10Dzahn) @catherine.kelsey.wmde Ah, all good. It's perfectly fine to create another task; if not even better. I had not noticed and this... [15:18:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=ulsfo - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:19:21] RESOLVED: JobUnavailable: Reduced availability for job probes/swagger in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:20:47] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11899297 (10Dzahn) @catherine.kelsey.wmde Hi, could you please send an email to [[https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) | Katie Francis]] of the legal team that yo... [15:21:51] (03PS1) 10Ladsgroup: Disable FR on wikinews wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284683 (https://phabricator.wikimedia.org/T423577) [15:22:01] !log sukhe@cumin1003 START - Cookbook sre.hosts.remove-downtime for 39 hosts [15:22:12] (03PS1) 10Snwachukwu: Edit Analytics Production Base Image Bump to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284684 (https://phabricator.wikimedia.org/T425310) [15:22:25] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 39 hosts [15:23:04] !log sukhe@cumin1003 START - Cookbook sre.loadbalancer.upgrade restart P{lvs4009*} and A:liberica [15:23:19] !log sukhe@cumin1003 START - Cookbook sre.loadbalancer.admin depooling P{lvs4009.ulsfo.wmnet} and A:liberica [15:23:30] !log sukhe@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs4009.ulsfo.wmnet} and A:liberica [15:23:41] !log sukhe@cumin1003 START - Cookbook sre.loadbalancer.admin pooling P{lvs4009.ulsfo.wmnet} and A:liberica [15:23:58] Amir1: why did i read your commit message as 'Disable French on Wikinews wikis' [15:24:03] !log sukhe@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs4009.ulsfo.wmnet} and A:liberica [15:24:05] !log sukhe@cumin1003 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restart P{lvs4009*} and A:liberica [15:26:37] PROBLEM - NTP peers and stratum check on dns4004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [15:26:57] RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:30:22] (03PS1) 10Arlolra: Provide page context for LintErrorChecker [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284687 (https://phabricator.wikimedia.org/T419596) [15:30:24] (03CR) 10Snwachukwu: [C:03+2] Edit Analytics Production Base Image Bump to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284684 (https://phabricator.wikimedia.org/T425310) (owner: 10Snwachukwu) [15:30:50] (03CR) 10Snwachukwu: [V:03+2 C:03+2] Edit Analytics Production Base Image Bump to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284684 (https://phabricator.wikimedia.org/T425310) (owner: 10Snwachukwu) [15:31:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284687 (https://phabricator.wikimedia.org/T419596) (owner: 10Arlolra) [15:31:29] PROBLEM - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns4004 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:31:39] !log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [15:31:54] !log sukhe@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool ulsfo for service: upload-addrs [reason: no reason specified, no task ID specified] [15:31:54] !log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [15:31:56] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool ulsfo for service: upload-addrs [reason: no reason specified, no task ID specified] [15:32:41] !log ebysans@deploy1003 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [15:32:55] !log ebysans@deploy1003 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [15:44:20] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-ntp rolling restart_daemons on A:dnsbox and A:ulsfo and (A:dnsbox) [15:49:37] 10ops-esams, 06SRE, 06Commons, 06DC-Ops, and 3 others: ESAMS and others serving older revisions of overwritten files - https://phabricator.wikimedia.org/T425216#11899449 (10ssingh) ` sukhe@cumin1003:~$ sudo cumin "A:cp and not P{cp2041* or cp2042*}" "curl -s https://upload.wikimedia.org/wikipedia/test/4/45... [15:50:34] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on ms-backup[2003-2004].codfw.wmnet,ms-backup[1003-1004].eqiad.wmnet with reason: restart [15:51:44] (03PS1) 10Jelto: miscweb: bump wmf-navigator images and set wiki base urls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284694 (https://phabricator.wikimedia.org/T414405) [15:54:49] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns4004 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:54:50] (03CR) 10Jelto: [C:03+2] miscweb: bump wmf-navigator images and set wiki base urls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284694 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [15:57:26] (03Merged) 10jenkins-bot: miscweb: bump wmf-navigator images and set wiki base urls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284694 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [15:59:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [15:59:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [16:00:00] !ack [16:00:00] !ack [16:00:01] All incidents are already acked. [16:00:03] 7918 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [16:00:05] jhathaway and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:25] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [16:02:14] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-ntp (exit_code=0) rolling restart_daemons on A:dnsbox and A:ulsfo and (A:dnsbox) [16:04:49] RECOVERY - NTP peers and stratum check on dns4004 is OK: NTP OK: Offset -4.63e-06 secs, stratum=2 https://wikitech.wikimedia.org/wiki/NTP [16:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:22] (03PS1) 10Cathal Mooney: Nokia: Create ACL prefix-list entry for BGP peers and use in CPM acl [homer/public] - 10https://gerrit.wikimedia.org/r/1284695 (https://phabricator.wikimedia.org/T425703) [16:08:49] (03CR) 10CI reject: [V:04-1] Nokia: Create ACL prefix-list entry for BGP peers and use in CPM acl [homer/public] - 10https://gerrit.wikimedia.org/r/1284695 (https://phabricator.wikimedia.org/T425703) (owner: 10Cathal Mooney) [16:09:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:42] (03PS1) 10Ssingh: geo-maps: temporarily move California to codfw [dns] - 10https://gerrit.wikimedia.org/r/1284699 [16:11:20] (03PS2) 10Cathal Mooney: Nokia: Create ACL prefix-list entry for BGP peers and use in CPM acl [homer/public] - 10https://gerrit.wikimedia.org/r/1284695 (https://phabricator.wikimedia.org/T425703) [16:12:27] (03CR) 10Fabfur: [C:03+1] geo-maps: temporarily move California to codfw [dns] - 10https://gerrit.wikimedia.org/r/1284699 (owner: 10Ssingh) [16:12:44] (03CR) 10Ssingh: [C:03+2] geo-maps: temporarily move California to codfw [dns] - 10https://gerrit.wikimedia.org/r/1284699 (owner: 10Ssingh) [16:12:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284575 (https://phabricator.wikimedia.org/T424102) (owner: 10Esanders) [16:12:55] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [16:13:01] !log sukhe@dns1004 START - running authdns-update [16:13:06] !log sukhe@dns1004 START - running authdns-update [16:14:18] (03PS1) 10Jelto: miscweb: remove typo in wmf-navigator HTTP_PROXY setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284701 (https://phabricator.wikimedia.org/T414405) [16:14:40] (03CR) 10Cathal Mooney: [C:04-1] "Not to be merged until Areleon support source-prefix" [homer/public] - 10https://gerrit.wikimedia.org/r/1284695 (https://phabricator.wikimedia.org/T425703) (owner: 10Cathal Mooney) [16:14:49] !log sukhe@dns1004 END - running authdns-update [16:15:09] (03PS1) 10DLynch: Remove duplicate definition of EditCheckAction#isTagged [extensions/VisualEditor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284702 (https://phabricator.wikimedia.org/T425583) [16:15:32] (03PS1) 10DLynch: Save action filtering info in ContentBranchNodeCheck#onDocumentChange [extensions/VisualEditor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284703 (https://phabricator.wikimedia.org/T425583) [16:15:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/VisualEditor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284702 (https://phabricator.wikimedia.org/T425583) (owner: 10DLynch) [16:15:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/VisualEditor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284703 (https://phabricator.wikimedia.org/T425583) (owner: 10DLynch) [16:17:42] (03PS1) 10Ssingh: Revert "geo-maps: temporarily move California to codfw" [dns] - 10https://gerrit.wikimedia.org/r/1284705 [16:18:20] (03CR) 10Jelto: [C:03+2] miscweb: remove typo in wmf-navigator HTTP_PROXY setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284701 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [16:21:02] (03Merged) 10jenkins-bot: miscweb: remove typo in wmf-navigator HTTP_PROXY setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284701 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [16:25:21] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 36 hosts with reason: restart [16:30:23] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2183.codfw.wmnet,db1204.eqiad.wmnet with reason: restart [16:32:34] !log restarting backup1-* database primary hosts [16:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/2 (Core: asw1-23-ulsfo:ethernet-1/55 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:33:52] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [16:34:15] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [16:34:21] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:58] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [16:38:36] (03Abandoned) 10Kosta Harlan: EventStreamConfig: Register special_user_login event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284633 (https://phabricator.wikimedia.org/T425631) (owner: 10Kosta Harlan) [16:39:21] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:39:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:40:21] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [16:40:35] !ack [16:40:35] All incidents are already acked. [16:40:35] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11899685 (10Dzahn) [16:47:02] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [16:48:10] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [16:48:46] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [16:50:18] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2199.codfw.wmnet,db1245.eqiad.wmnet with reason: restart [16:55:21] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr3-ulsfo:xe-0/1/3 (Peering: ... [16:55:21] SFMIX (N/A) {#1061}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=ulsfo+prometheus%2Fops&var-device=cr3-ulsfo:9804&var-interface=xe-0%2F1%2F3 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [16:55:31] !ack [16:55:32] All incidents are already acked. [17:00:05] bd808: Your horoscope predicts another Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T1700) [17:06:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr4-ulsfo:xe-0/1/4 (Peering: ... [17:06:51] Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=ulsfo+prometheus%2Fops&var-device=cr4-ulsfo:9804&var-interface=xe-0%2F1%2F4 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [17:06:54] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2200.codfw.wmnet,db1216.eqiad.wmnet with reason: restart [17:06:57] uh? [17:07:16] hmm [17:07:19] !ack [17:07:20] 7919 (ACKED) TransitPeeringTransportOutSaturation network sre (cr4-ulsfo:9804 Peering: Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749} xe-0/1/4 gnmi ulsfo) [17:11:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr3-ulsfo:xe-0/1/3 (Peering: SFMIX (N/A) {#1061}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [17:12:13] !ack [17:12:14] All incidents are already acked. [17:15:07] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -4d 3h 20m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [17:18:34] !incidents [17:18:35] 7919 (ACKED) TransitPeeringTransportOutSaturation network sre (cr4-ulsfo:9804 Peering: Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749} xe-0/1/4 gnmi ulsfo) [17:18:35] 7918 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [17:18:35] 7917 (RESOLVED) [2x] ProbeDown sre (2620:0:863:ed1a::2:b ip6 probes/service ulsfo) [17:18:36] 7916 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [17:18:36] 7913 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [17:21:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr4-ulsfo:xe-0/1/4 (Peering: ... [17:21:51] Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=ulsfo+prometheus%2Fops&var-device=cr4-ulsfo:9804&var-interface=xe-0%2F1%2F4 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [17:22:00] !ack [17:22:00] All incidents are already acked. [17:22:53] !incidents [17:22:53] 7919 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr4-ulsfo:9804 Peering: Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749} xe-0/1/4 gnmi ulsfo) [17:22:53] 7918 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [17:22:53] 7917 (RESOLVED) [2x] ProbeDown sre (2620:0:863:ed1a::2:b ip6 probes/service ulsfo) [17:22:54] 7916 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [17:22:54] 7913 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [17:26:28] (03PS1) 10Snwachukwu: AQS Services Staging Base Image Bump to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284740 (https://phabricator.wikimedia.org/T425310) [17:28:31] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [17:29:06] (03CR) 10Snwachukwu: [C:03+2] AQS Services Staging Base Image Bump to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284740 (https://phabricator.wikimedia.org/T425310) (owner: 10Snwachukwu) [17:29:30] (03CR) 10Snwachukwu: [V:03+2 C:03+2] AQS Services Staging Base Image Bump to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284740 (https://phabricator.wikimedia.org/T425310) (owner: 10Snwachukwu) [17:30:21] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr3-ulsfo:xe-0/1/3 (Peering: SFMIX (N/A) {#1061}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [17:30:23] (03PS2) 10Krinkle: Profiler: Set explicit "excimer-wall" redis channel instead of concat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237662 [17:30:27] (03CR) 10Krinkle: Profiler: Set explicit "excimer-wall" redis channel instead of concat (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237662 (owner: 10Krinkle) [17:30:37] !ack [17:30:37] !ack [17:30:37] 7920 (ACKED) TransitPeeringTransportOutSaturation network sre (gnmi ulsfo) [17:30:38] All incidents are already acked. [17:32:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237662 (owner: 10Krinkle) [17:32:45] !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/geo-analytics: apply [17:32:57] !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [17:33:03] (03Merged) 10jenkins-bot: Profiler: Set explicit "excimer-wall" redis channel instead of concat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237662 (owner: 10Krinkle) [17:33:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:33:34] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1237662|Profiler: Set explicit "excimer-wall" redis channel instead of concat]] [17:44:46] (03PS1) 10Snwachukwu: Place version under main_app key in device-analytics values-staging file. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284747 (https://phabricator.wikimedia.org/T425310) [17:45:43] !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/editor-analytics: apply [17:45:55] !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [17:47:03] (03CR) 10Snwachukwu: [C:03+2] Place version under main_app key in device-analytics values-staging file. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284747 (https://phabricator.wikimedia.org/T425310) (owner: 10Snwachukwu) [17:47:15] (03CR) 10Snwachukwu: [V:03+2 C:03+2] Place version under main_app key in device-analytics values-staging file. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284747 (https://phabricator.wikimedia.org/T425310) (owner: 10Snwachukwu) [17:50:53] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1237662|Profiler: Set explicit "excimer-wall" redis channel instead of concat]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:51:15] !log krinkle@deploy1003 krinkle: Continuing with deployment [17:54:07] 10SRE-Access-Requests, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Requesting access to analytics-admins for Jerrywang - https://phabricator.wikimedia.org/T419820#11900021 (10Dzahn) Any updates on this? The access request tag makes this part of the clinic duty workboard. [17:55:21] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [17:55:34] !ack [17:55:34] All incidents are already acked. [17:56:31] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for ArthurTaylor - https://phabricator.wikimedia.org/T424317#11900025 (10Dzahn) [17:58:57] !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/device-analytics: apply [17:59:07] !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [18:00:05] brennen and jeena: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T1800). nyaa~ [18:00:21] RESOLVED: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [18:00:28] o/ [18:00:29] !ack [18:00:29] All incidents are already acked. [18:00:42] brennen: you're good to go with the train [18:01:03] cdanis: thanks, i was about to ask. have been in other windows and am not caught up on scrollback. [18:01:12] (03PS1) 10Muehlenhoff: Blacklist some network protocols as defence in depth [puppet] - 10https://gerrit.wikimedia.org/r/1284760 [18:01:37] grattoir du jour [18:02:08] !log 1.47.0-wmf.1 train status (T423910): blockers resolved, rolling to all wikis [18:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:11] T423910: 1.47.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T423910 [18:02:37] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for ArthurTaylor - https://phabricator.wikimedia.org/T424317#11900058 (10Dzahn) Hi @ArthurTaylor You are already a member of the requested group. ` [an-master1003:~] $ id arthurtaylor uid=45664(arthurtaylor) gid=500(wikidev) grou... [18:02:49] (03PS2) 10Muehlenhoff: Blacklist some network protocols as defence in depth [puppet] - 10https://gerrit.wikimedia.org/r/1284760 [18:02:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-23-ulsfo:ethernet-1/55 (Core: cr3-ulsfo:et-0/0/2 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-23-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [18:02:58] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1237662|Profiler: Set explicit "excimer-wall" redis channel instead of concat]] (duration: 29m 24s) [18:03:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1284760 (owner: 10Muehlenhoff) [18:03:35] brennen: all yours [18:03:40] (03CR) 10Fabfur: [C:03+1] Revert "geo-maps: temporarily move California to codfw" [dns] - 10https://gerrit.wikimedia.org/r/1284705 (owner: 10Ssingh) [18:03:44] (03CR) 10CDanis: [C:03+2] Revert "geo-maps: temporarily move California to codfw" [dns] - 10https://gerrit.wikimedia.org/r/1284705 (owner: 10Ssingh) [18:04:29] !log cdanis@dns1005 START - running authdns-update [18:04:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:05:05] (03PS1) 10TrainBranchBot: group2 to 1.47.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284768 (https://phabricator.wikimedia.org/T423910) [18:05:08] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284768 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot) [18:05:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:05:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:06:09] !log cdanis@dns1005 END - running authdns-update [18:06:18] (03CR) 10CDanis: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1284760 (owner: 10Muehlenhoff) [18:06:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:06:44] (03Merged) 10jenkins-bot: group2 to 1.47.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284768 (https://phabricator.wikimedia.org/T423910) (owner: 10TrainBranchBot) [18:07:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/2 (Core: asw1-23-ulsfo:ethernet-1/55 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:08:36] (03PS1) 10Snwachukwu: AQS services Production Runtime Base Image Bump to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284770 (https://phabricator.wikimedia.org/T425310) [18:08:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11900078 (10Dzahn) [18:09:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:09:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:10:12] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11900085 (10Dzahn) [18:12:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:14:52] (03PS1) 10Mmartorana: Make email confirmation banner a standalone RL module [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284771 (https://phabricator.wikimedia.org/T425677) [18:15:14] (03CR) 10Snwachukwu: [C:03+2] AQS services Production Runtime Base Image Bump to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284770 (https://phabricator.wikimedia.org/T425310) (owner: 10Snwachukwu) [18:15:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:15:25] (03CR) 10Snwachukwu: [V:03+2 C:03+2] AQS services Production Runtime Base Image Bump to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284770 (https://phabricator.wikimedia.org/T425310) (owner: 10Snwachukwu) [18:15:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284771 (https://phabricator.wikimedia.org/T425677) (owner: 10Mmartorana) [18:16:45] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 105779456 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:17:16] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.47.0-wmf.1 refs T423910 [18:17:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:17:19] T423910: 1.47.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T423910 [18:18:23] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11900137 (10MoritzMuehlenhoff) [18:18:44] !log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [18:18:45] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3421328 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:19:03] !log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [18:19:52] !log ebysans@deploy1003 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [18:20:09] !log ebysans@deploy1003 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [18:20:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:20:20] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11900140 (10Dzahn) >>! In T420500#11886631, @AnnieKim_WMDE wrote: > Hello! Is there anything else I can or n... [18:20:51] (03CR) 10Muehlenhoff: [C:03+2] Blacklist some network protocols as defence in depth [puppet] - 10https://gerrit.wikimedia.org/r/1284760 (owner: 10Muehlenhoff) [18:20:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:21:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:21:36] !log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [18:21:49] !log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [18:22:30] !log ebysans@deploy1003 helmfile [codfw] START helmfile.d/services/device-analytics: apply [18:22:44] !log ebysans@deploy1003 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [18:24:48] !log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [18:24:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:25:04] !log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [18:25:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:26:15] !log ebysans@deploy1003 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [18:26:29] !log ebysans@deploy1003 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [18:26:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:27:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:27:53] (03CR) 10Snwachukwu: "deployed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1281589 (https://phabricator.wikimedia.org/T412830) (owner: 10Eevans) [18:28:16] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11900153 (10Dzahn) >>! In T420500#11832838, @Martyn.ranyard wrote: > I think kerberos is automatically enabl... [18:29:50] FIRING: DiskSpace: Disk space build2001:9100:/ 0.6029% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:30:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:30:22] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11900168 (10Dzahn) 05Open→03In progress [18:30:25] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for ArthurTaylor - https://phabricator.wikimedia.org/T424317#11900169 (10Dzahn) 05Open→03In progress [18:30:25] (03CR) 10RLazarus: "One more step: this adds the values files, but doesn't produce any diffs (i.e. "helmfile apply" won't create your new Kubernetes objects) " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1274165 (https://phabricator.wikimedia.org/T423627) (owner: 10Ecarg) [18:30:34] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11900170 (10Dzahn) 05Open→03In progress [18:30:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:30:59] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:31:08] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde & ldap/nda. for catherinekelsey - https://phabricator.wikimedia.org/T425566#11900173 (10Dzahn) 05Open→03In progress a:03catherine.kelsey.wmde [18:31:09] (03PS1) 10Dzahn: admin: add SSH key and kerberos for Annie Kim WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1284777 (https://phabricator.wikimedia.org/T420500) [18:31:40] (03CR) 10Dzahn: [C:04-2] admin: add SSH key and kerberos for Annie Kim WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1284777 (https://phabricator.wikimedia.org/T420500) (owner: 10Dzahn) [18:31:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:34:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:40:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:41:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:41:26] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1283968/8535/zuul1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1283968 (owner: 10Hashar) [18:43:35] (03CR) 10Dzahn: [C:04-1] "not yet - just adding the -1 to clarify" [puppet] - 10https://gerrit.wikimedia.org/r/1278521 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [18:45:31] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [18:47:20] (03CR) 10Dzahn: [V:03+1 C:03+2] "deployed on zuul1003 - almost no change except the role name" [puppet] - 10https://gerrit.wikimedia.org/r/1283968 (owner: 10Hashar) [18:49:11] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [pc1022~] - vriley@cumin1003" [18:49:17] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [pc1022~] - vriley@cumin1003" [18:49:17] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:49:41] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host pc1022 [18:51:18] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc1022 [18:52:02] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host pc1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:58:56] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity and wmf LDAP group for GWeld - https://phabricator.wikimedia.org/T425727 (10GWeld) 03NEW [19:02:56] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11900271 (10MoritzMuehlenhoff) [19:04:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:05:10] (03PS1) 10Bking: dse-k8s: raise vm.max_map_count for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1284792 (https://phabricator.wikimedia.org/T425681) [19:05:36] (03PS2) 10Bking: dse-k8s: raise vm.max_map_count for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1284792 (https://phabricator.wikimedia.org/T425681) [19:07:03] oncall handoff: see _security and -private :-/ [19:07:46] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1284792 (https://phabricator.wikimedia.org/T425681) (owner: 10Bking) [19:08:26] vriley@cumin1003 reimage (PID 1207317) is awaiting input [19:09:37] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host pc1022.eqiad.wmnet with OS trixie [19:09:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11900292 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host pc1022.eqiad.wmnet with OS trixie [19:10:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:11:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:14:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:15:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:15:55] (03PS1) 10Andrew Bogott: setup_capi.sh: fix path for clusterctl.yaml, increase logging [puppet] - 10https://gerrit.wikimedia.org/r/1284801 [19:15:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:16:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:17:03] (03CR) 10Andrew Bogott: [C:03+2] setup_capi.sh: fix path for clusterctl.yaml, increase logging [puppet] - 10https://gerrit.wikimedia.org/r/1284801 (owner: 10Andrew Bogott) [19:18:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:19:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:26:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:27:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:30:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:30:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:35:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:36:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:40:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1015.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:41:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:41:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:44:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:45:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:45:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:46:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:46:34] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 3 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11900420 (10Epidosis) [19:49:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:49:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:51:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:54:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:55:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:56:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:57:46] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:58:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:59:16] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T2000). [20:00:05] arlolra, kemayo, and manfredi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] o/ [20:00:35] o/ [20:00:37] I have three patches, and don't mind deploying them myself. [20:00:52] I can get started [20:01:49] manfredi asked me to deploy their patch even though they aren't here so I'll do both mine and theirs [20:02:10] Go for it. [20:02:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284687 (https://phabricator.wikimedia.org/T419596) (owner: 10Arlolra) [20:02:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284771 (https://phabricator.wikimedia.org/T425677) (owner: 10Mmartorana) [20:02:57] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host pc1022.eqiad.wmnet with OS trixie [20:03:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11900485 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host pc1022.eqiad.wmnet with OS trixie executed with errors: - pc1022 (**FAIL**... [20:03:31] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:04:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:05:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:06:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:07:08] (03Merged) 10jenkins-bot: Provide page context for LintErrorChecker [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284687 (https://phabricator.wikimedia.org/T419596) (owner: 10Arlolra) [20:07:13] (03Merged) 10jenkins-bot: Make email confirmation banner a standalone RL module [extensions/WikimediaEvents] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284771 (https://phabricator.wikimedia.org/T425677) (owner: 10Mmartorana) [20:07:34] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1284687|Provide page context for LintErrorChecker (T419596)]], [[gerrit:1284771|Make email confirmation banner a standalone RL module (T425677)]] [20:07:39] T419596: LintHint script output (via MediaWiki API database request) and "Page information" disagree on Linter errors - https://phabricator.wikimedia.org/T419596 [20:07:39] T425677: SRM for email confirmation banner experiment - https://phabricator.wikimedia.org/T425677 [20:08:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:09:23] !log arlolra@deploy1003 arlolra, mmartorana: Backport for [[gerrit:1284687|Provide page context for LintErrorChecker (T419596)]], [[gerrit:1284771|Make email confirmation banner a standalone RL module (T425677)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:09:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:10:11] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host pc1021.eqiad.wmnet with OS trixie [20:10:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11900517 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host pc1021.eqiad.wmnet with OS trixie [20:10:38] !log arlolra@deploy1003 arlolra, mmartorana: Continuing with deployment [20:14:52] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1284687|Provide page context for LintErrorChecker (T419596)]], [[gerrit:1284771|Make email confirmation banner a standalone RL module (T425677)]] (duration: 07m 18s) [20:14:57] T419596: LintHint script output (via MediaWiki API database request) and "Page information" disagree on Linter errors - https://phabricator.wikimedia.org/T419596 [20:14:57] T425677: SRM for email confirmation banner experiment - https://phabricator.wikimedia.org/T425677 [20:15:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284575 (https://phabricator.wikimedia.org/T424102) (owner: 10Esanders) [20:15:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284702 (https://phabricator.wikimedia.org/T425583) (owner: 10DLynch) [20:15:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284703 (https://phabricator.wikimedia.org/T425583) (owner: 10DLynch) [20:15:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:16:06] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.24.0-a2 [vendor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284832 (https://phabricator.wikimedia.org/T319058) [20:16:18] (03Merged) 10jenkins-bot: Revert "Enable mobile editor abandonment survey on enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284575 (https://phabricator.wikimedia.org/T424102) (owner: 10Esanders) [20:16:46] (03PS1) 10C. Scott Ananian: composer.json: Update webonyx/graphql-php to ^15.32.3 [extensions/Wikibase] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284834 [20:17:03] (03PS2) 10C. Scott Ananian: Upgrading webonyx/graphql-php (v15.31.5 => v15.32.3) [vendor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284828 [20:17:18] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.24.0-a2 [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284837 (https://phabricator.wikimedia.org/T425731) [20:17:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:20:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:21:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:22:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:22:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/Wikibase] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284834 (owner: 10C. Scott Ananian) [20:23:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284837 (https://phabricator.wikimedia.org/T425731) (owner: 10C. Scott Ananian) [20:25:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:26:04] arlolra: how goes the backport? [20:26:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:26:23] cscott: O [20:26:34] I'm done with mine, Kemayo is doing theirs [20:26:57] i saw that jenkins was being a bit grumpy about that [20:27:00] CI is running. 🤷🏻 [20:27:10] And yes, CI has been extremely dysfunctional this week. [20:27:13] it's already failed on patch 2/2 with a spurious error [20:27:32] there's a retry button in spiderpig you can click to get it to resubmit and recheck those [20:27:53] Got to wait for it to actually think the job is done first, I think. [20:28:13] Well, I could stop it. But I have never actually done that. [20:28:14] yeah, you're still waiting for patch 1/2 to succeed (hopefully) [20:28:30] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:28:33] no it's a coin flip each time, you might as well take the heads on the first patch before you try to reflip the second [20:28:56] otherwise you're back to flipping two coins together and hoping they're both heads [20:29:12] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:29:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:30:52] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1021.eqiad.wmnet with reason: host reimage [20:30:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:31:32] (03Merged) 10jenkins-bot: Remove duplicate definition of EditCheckAction#isTagged [extensions/VisualEditor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284702 (https://phabricator.wikimedia.org/T425583) (owner: 10DLynch) [20:31:35] (03CR) 10CI reject: [V:04-1] Save action filtering info in ContentBranchNodeCheck#onDocumentChange [extensions/VisualEditor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284703 (https://phabricator.wikimedia.org/T425583) (owner: 10DLynch) [20:31:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284703 (https://phabricator.wikimedia.org/T425583) (owner: 10DLynch) [20:32:12] Okay, attempt two: go. [20:32:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:32:25] (03CR) 10CI reject: [V:04-1] Bump wikimedia/parsoid to 0.24.0-a2 [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284837 (https://phabricator.wikimedia.org/T425731) (owner: 10C. Scott Ananian) [20:33:16] i've got four coins to flip after you, i'm a bit nervous about that. [20:33:58] (03CR) 10C. Scott Ananian: "recheck" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284837 (https://phabricator.wikimedia.org/T425731) (owner: 10C. Scott Ananian) [20:34:06] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1021.eqiad.wmnet with reason: host reimage [20:34:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:36:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:39:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:40:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:41:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:41:51] !log krinkle@deploy1003$ mwscript deleteEqualMessages.php nlwiki [20:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:56] !log krinkle@deploy1003$ mwscript deleteEqualMessages.php commonswiki [20:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:13] (03Merged) 10jenkins-bot: Save action filtering info in ContentBranchNodeCheck#onDocumentChange [extensions/VisualEditor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284703 (https://phabricator.wikimedia.org/T425583) (owner: 10DLynch) [20:42:19] woo [20:42:34] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1284575|Revert "Enable mobile editor abandonment survey on enwiki" (T424102)]], [[gerrit:1284702|Remove duplicate definition of EditCheckAction#isTagged (T425583)]], [[gerrit:1284703|Save action filtering info in ContentBranchNodeCheck#onDocumentChange (T425583)]] [20:42:38] T424102: Deploy config change to stop "Exit the editor" survey (v1.0) - https://phabricator.wikimedia.org/T424102 [20:42:39] T425583: Link-based edit checks are not dismissable - https://phabricator.wikimedia.org/T425583 [20:43:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11900665 (10Jclark-ctr) @elukey 1006 and 1008's BMC and BIOS firmwares have been upgraded [20:44:14] !log kemayo@deploy1003 esanders, kemayo: Backport for [[gerrit:1284575|Revert "Enable mobile editor abandonment survey on enwiki" (T424102)]], [[gerrit:1284702|Remove duplicate definition of EditCheckAction#isTagged (T425583)]], [[gerrit:1284703|Save action filtering info in ContentBranchNodeCheck#onDocumentChange (T425583)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be v [20:44:14] erified there. [20:44:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:45:03] !log kemayo@deploy1003 esanders, kemayo: Continuing with deployment [20:45:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:46:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:46:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:48:04] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [20:49:12] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1284575|Revert "Enable mobile editor abandonment survey on enwiki" (T424102)]], [[gerrit:1284702|Remove duplicate definition of EditCheckAction#isTagged (T425583)]], [[gerrit:1284703|Save action filtering info in ContentBranchNodeCheck#onDocumentChange (T425583)]] (duration: 06m 38s) [20:49:16] T424102: Deploy config change to stop "Exit the editor" survey (v1.0) - https://phabricator.wikimedia.org/T424102 [20:49:17] T425583: Link-based edit checks are not dismissable - https://phabricator.wikimedia.org/T425583 [20:49:17] cscott: Okay, my deployment is done. Good luck! [20:49:44] fingers crossed. four heads in a row! four heads in a row! [20:50:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [vendor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284828 (owner: 10C. Scott Ananian) [20:50:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284834 (owner: 10C. Scott Ananian) [20:50:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [vendor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284832 (https://phabricator.wikimedia.org/T319058) (owner: 10C. Scott Ananian) [20:50:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284837 (https://phabricator.wikimedia.org/T425731) (owner: 10C. Scott Ananian) [20:51:10] vriley@cumin1003 reimage (PID 1214325) is awaiting input [20:53:47] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [20:53:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1021.eqiad.wmnet with OS trixie [20:53:56] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11900686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host pc1021.eqiad.wmnet with OS trixie completed: - pc1021 (**PASS**) - Remov... [20:54:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:55:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:55:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:57:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:58:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:00:04] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260507T2100) [21:00:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:01:48] looks like the wikibase patch is failing. [21:02:17] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:02:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:02:20] (03Merged) 10jenkins-bot: Upgrading webonyx/graphql-php (v15.31.5 => v15.32.3) [vendor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284828 (owner: 10C. Scott Ananian) [21:02:27] (03CR) 10CI reject: [V:04-1] composer.json: Update webonyx/graphql-php to ^15.32.3 [extensions/Wikibase] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284834 (owner: 10C. Scott Ananian) [21:04:30] (03CR) 10C. Scott Ananian: [C:03+2] composer.json: Update webonyx/graphql-php to ^15.32.3 [extensions/Wikibase] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284834 (owner: 10C. Scott Ananian) [21:05:17] subbu: that's probably the best one to fail, since it won't cascade and cancel all the others [21:05:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:06:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:07:46] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a2 [vendor] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284832 (https://phabricator.wikimedia.org/T319058) (owner: 10C. Scott Ananian) [21:08:27] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a2 [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284837 (https://phabricator.wikimedia.org/T425731) (owner: 10C. Scott Ananian) [21:09:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [21:10:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284834 (owner: 10C. Scott Ananian) [21:11:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:12:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:13:42] (03PS1) 10Dduvall: zuul: Fix launcher connection section in zuul.conf [puppet] - 10https://gerrit.wikimedia.org/r/1284866 [21:15:14] (03Merged) 10jenkins-bot: composer.json: Update webonyx/graphql-php to ^15.32.3 [extensions/Wikibase] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1284834 (owner: 10C. Scott Ananian) [21:15:22] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -4d 7h 20m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [21:15:34] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1284828|Upgrading webonyx/graphql-php (v15.31.5 => v15.32.3)]], [[gerrit:1284834|composer.json: Update webonyx/graphql-php to ^15.32.3]], [[gerrit:1284832|Bump wikimedia/parsoid to 0.24.0-a2 (T319058 T368724 T373384 T420336 T423241 T423701 T424446 T424773 T425008 T425056 T425107 T425731)]], [[gerrit:1284837|Bump wikimedia/parsoid to 0.24.0-a2 (T425731)] [21:15:34] ] [21:15:52] ok, let's goooo [21:16:13] T319058: ParserTests defaults to wgAllowExternalImages=true - https://phabricator.wikimedia.org/T319058 [21:16:13] T368724: Rendering diff on broken link with template (visual diff testing) - https://phabricator.wikimedia.org/T368724 [21:16:14] T373384: Parsoid doesn't properly handle double-underscore magic words - https://phabricator.wikimedia.org/T373384 [21:16:14] T420336: mw-parsoid improvements - https://phabricator.wikimedia.org/T420336 [21:16:15] T423241: Thumbtime for TimedMedia no longer working - https://phabricator.wikimedia.org/T423241 [21:16:15] T423701: Serialize ContentHolder (or at least its fragments) in ParserOutput - https://phabricator.wikimedia.org/T423701 [21:16:16] T424446: Space added for no reason - https://phabricator.wikimedia.org/T424446 [21:16:16] T424773: __NOCONTENTCONVERT__ is not honored in Parsoid - https://phabricator.wikimedia.org/T424773 [21:16:16] T425008: WrapSectionState crash: Wikimedia\Assert\InvariantException: Invariant failed: Expected only language variants to be missing about ids. - https://phabricator.wikimedia.org/T425008 [21:16:17] T425056: 1.46.0-wmf.26 broke itwiki's Template:Divisa calcio - https://phabricator.wikimedia.org/T425056 [21:16:17] T425107: __TOC__ not showing up in Parsoid on Norwegian Bokmål village pump pages - https://phabricator.wikimedia.org/T425107 [21:16:18] T425731: CTT tasks week of 2026-05-01 - https://phabricator.wikimedia.org/T425731 [21:16:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:17:17] RESOLVED: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:17:28] !log cscott@deploy1003 cscott: Backport for [[gerrit:1284828|Upgrading webonyx/graphql-php (v15.31.5 => v15.32.3)]], [[gerrit:1284834|composer.json: Update webonyx/graphql-php to ^15.32.3]], [[gerrit:1284832|Bump wikimedia/parsoid to 0.24.0-a2 (T319058 T368724 T373384 T420336 T423241 T423701 T424446 T424773 T425008 T425056 T425107 T425731)]], [[gerrit:1284837|Bump wikimedia/parsoid to 0.24.0-a2 (T425731)]] synced to the t [21:17:28] estservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:19:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:20:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:22:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:23:23] !log cscott@deploy1003 cscott: Continuing with deployment [21:24:06] (03CR) 10Dzahn: [C:03+2] zuul: Fix launcher connection section in zuul.conf [puppet] - 10https://gerrit.wikimedia.org/r/1284866 (owner: 10Dduvall) [21:24:21] (03PS1) 10Dduvall: zuul: Support http proxy configuration for zuul-web [puppet] - 10https://gerrit.wikimedia.org/r/1284869 [21:25:28] (03CR) 10Dzahn: [C:03+2] "oof, yea, that looks like copy/paste issue. sorry about that" [puppet] - 10https://gerrit.wikimedia.org/r/1284866 (owner: 10Dduvall) [21:26:24] (03CR) 10Dduvall: "I think it was from my patchset actually. :) Sorry!" [puppet] - 10https://gerrit.wikimedia.org/r/1284866 (owner: 10Dduvall) [21:27:34] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1284828|Upgrading webonyx/graphql-php (v15.31.5 => v15.32.3)]], [[gerrit:1284834|composer.json: Update webonyx/graphql-php to ^15.32.3]], [[gerrit:1284832|Bump wikimedia/parsoid to 0.24.0-a2 (T319058 T368724 T373384 T420336 T423241 T423701 T424446 T424773 T425008 T425056 T425107 T425731)]], [[gerrit:1284837|Bump wikimedia/parsoid to 0.24.0-a2 (T425731) [21:27:34] ]] (duration: 12m 00s) [21:27:54] T319058: ParserTests defaults to wgAllowExternalImages=true - https://phabricator.wikimedia.org/T319058 [21:27:54] T368724: Rendering diff on broken link with template (visual diff testing) - https://phabricator.wikimedia.org/T368724 [21:27:54] T373384: Parsoid doesn't properly handle double-underscore magic words - https://phabricator.wikimedia.org/T373384 [21:27:55] T420336: mw-parsoid improvements - https://phabricator.wikimedia.org/T420336 [21:27:55] T423241: Thumbtime for TimedMedia no longer working - https://phabricator.wikimedia.org/T423241 [21:27:57] T423701: Serialize ContentHolder (or at least its fragments) in ParserOutput - https://phabricator.wikimedia.org/T423701 [21:27:57] T424446: Space added for no reason - https://phabricator.wikimedia.org/T424446 [21:27:57] T424773: __NOCONTENTCONVERT__ is not honored in Parsoid - https://phabricator.wikimedia.org/T424773 [21:27:58] T425008: WrapSectionState crash: Wikimedia\Assert\InvariantException: Invariant failed: Expected only language variants to be missing about ids. - https://phabricator.wikimedia.org/T425008 [21:27:58] T425056: 1.46.0-wmf.26 broke itwiki's Template:Divisa calcio - https://phabricator.wikimedia.org/T425056 [21:27:59] T425107: __TOC__ not showing up in Parsoid on Norwegian Bokmål village pump pages - https://phabricator.wikimedia.org/T425107 [21:27:59] T425731: CTT tasks week of 2026-05-01 - https://phabricator.wikimedia.org/T425731 [21:28:01] ok, done! [21:29:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:30:40] (03CR) 10Dzahn: [C:03+2] zuul: Support http proxy configuration for zuul-web [puppet] - 10https://gerrit.wikimedia.org/r/1284869 (owner: 10Dduvall) [21:30:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:32:48] (03CR) 10JHathaway: "per our discussion, I think this is fine as a temporary why to get these hosts provisioned, but I am a bit wary of merging as is." [cookbooks] - 10https://gerrit.wikimedia.org/r/1266257 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [21:40:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:41:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:44:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:45:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:45:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:48:23] (03PS2) 10Cwhite: add unknown_data_type to normalized object [software/ecs] - 10https://gerrit.wikimedia.org/r/1283814 [21:49:12] (03CR) 10Cwhite: [C:03+2] add unknown_data_type to normalized object [software/ecs] - 10https://gerrit.wikimedia.org/r/1283814 (owner: 10Cwhite) [21:49:34] (03Merged) 10jenkins-bot: add unknown_data_type to normalized object [software/ecs] - 10https://gerrit.wikimedia.org/r/1283814 (owner: 10Cwhite) [21:50:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:51:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:52:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:59:55] (03CR) 10Cwhite: [C:03+2] add query object [software/ecs] - 10https://gerrit.wikimedia.org/r/1280737 (https://phabricator.wikimedia.org/T423986) (owner: 10Cwhite) [22:00:05] (03CR) 10CI reject: [V:04-1] add query object [software/ecs] - 10https://gerrit.wikimedia.org/r/1280737 (https://phabricator.wikimedia.org/T423986) (owner: 10Cwhite) [22:00:17] (03PS6) 10Cwhite: add query object [software/ecs] - 10https://gerrit.wikimedia.org/r/1280737 (https://phabricator.wikimedia.org/T423986) [22:00:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:00:36] (03CR) 10Cwhite: add query object [software/ecs] - 10https://gerrit.wikimedia.org/r/1280737 (https://phabricator.wikimedia.org/T423986) (owner: 10Cwhite) [22:01:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:04:15] (03CR) 10Cwhite: [C:03+2] add query object [software/ecs] - 10https://gerrit.wikimedia.org/r/1280737 (https://phabricator.wikimedia.org/T423986) (owner: 10Cwhite) [22:04:41] (03Merged) 10jenkins-bot: add query object [software/ecs] - 10https://gerrit.wikimedia.org/r/1280737 (https://phabricator.wikimedia.org/T423986) (owner: 10Cwhite) [22:05:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:07:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:09:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:11:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:11:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:12:22] (03PS1) 10Aleksandar Mastilovic: Add x_wmf_ratelimit_class and x_trusted_request to Turnilo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1284876 (https://phabricator.wikimedia.org/T419736) [22:14:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:14:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:15:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:16:12] 06SRE: Update roll-restart-reboot-brokers.py to display broker id and FQDN of the broker - https://phabricator.wikimedia.org/T425747 (10jasmine_) 03NEW [22:16:41] 06SRE: Update roll-restart-reboot-brokers.py to display broker id and FQDN of the broker - https://phabricator.wikimedia.org/T425747#11901076 (10jasmine_) [22:16:47] !log amastilovic@deploy1003 Started deploy [analytics/refinery@b38efb1] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@b38efb19] [22:17:30] 06SRE: Update roll-restart-reboot-brokers.py to display broker id and FQDN of the broker - https://phabricator.wikimedia.org/T425747#11901078 (10jasmine_) [22:17:32] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores: Upgrade Kafka to version 3.x - https://phabricator.wikimedia.org/T416669#11901080 (10jasmine_) [22:18:42] !log amastilovic@deploy1003 Finished deploy [analytics/refinery@b38efb1] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@b38efb19] (duration: 01m 55s) [22:19:10] !log amastilovic@deploy1003 Started deploy [analytics/refinery@b38efb1]: Regular analytics weekly train [analytics/refinery@b38efb19] [22:19:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:19:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:20:17] FIRING: ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:21:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:21:52] (03PS5) 10Cwhite: logstash: import ecs 1.11.0-8 template file [puppet] - 10https://gerrit.wikimedia.org/r/1283683 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [22:21:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:23:02] !log amastilovic@deploy1003 Finished deploy [analytics/refinery@b38efb1]: Regular analytics weekly train [analytics/refinery@b38efb19] (duration: 03m 52s) [22:23:08] !log amastilovic@deploy1003 Started deploy [analytics/refinery@b38efb1] (thin): Regular analytics weekly train THIN [analytics/refinery@b38efb19] [22:24:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:25:02] !log amastilovic@deploy1003 Finished deploy [analytics/refinery@b38efb1] (thin): Regular analytics weekly train THIN [analytics/refinery@b38efb19] (duration: 01m 53s) [22:25:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:27:12] (03CR) 10Cwhite: [C:03+2] logstash: import ecs 1.11.0-8 template file [puppet] - 10https://gerrit.wikimedia.org/r/1283683 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [22:29:49] FIRING: DiskSpace: Disk space build2001:9100:/ 0.5991% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:29:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:30:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:31:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:33:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:34:30] (03PS3) 10Cwhite: logstash: filter_on_templates - handle unknown data types [puppet] - 10https://gerrit.wikimedia.org/r/1283810 [22:35:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:36:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:37:41] (03CR) 10Cwhite: [C:03+2] logstash: filter_on_templates - handle unknown data types [puppet] - 10https://gerrit.wikimedia.org/r/1283810 (owner: 10Cwhite) [22:44:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:46:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:48:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:49:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:53:33] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host pc1022.eqiad.wmnet with OS trixie [22:53:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11901126 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host pc1022.eqiad.wmnet with OS trixie [22:54:25] (03PS14) 10Cwhite: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [22:56:06] (03PS9) 10Tiziano Fogli: rsyslog: forward thanos-query-frontend logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1275799 (https://phabricator.wikimedia.org/T423986) [22:58:00] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [23:00:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:02:18] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:05:27] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1022.eqiad.wmnet with reason: host reimage [23:08:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1014.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:09:42] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1022.eqiad.wmnet with reason: host reimage [23:09:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1021.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:10:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:13:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:15:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:16:22] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:17:56] (03PS1) 10Codename Noreste: Completely disable MediaWiki page patrolling functions on German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284900 (https://phabricator.wikimedia.org/T316393) [23:19:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:21:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:23:22] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:24:28] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [23:24:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [23:25:21] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [23:25:22] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1022.eqiad.wmnet with OS trixie [23:25:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11901189 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host pc1022.eqiad.wmnet with OS trixie completed: - pc1022 (**PASS**) - Remov... [23:25:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:26:22] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:33:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:35:17] RESOLVED: ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:35:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:39:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:40:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1284902 [23:40:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1284902 (owner: 10TrainBranchBot) [23:41:22] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:41:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:41:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:42:22] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:44:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:45:22] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:46:22] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:50:22] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:50:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1284902 (owner: 10TrainBranchBot) [23:51:22] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:51:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:55:22] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:55:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:56:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:57:22] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal