[00:03:29] FIRING: SystemdUnitFailed: wmf_auto_restart_ipmiseld.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1160339 [00:08:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1160339 (owner: 10TrainBranchBot) [00:12:21] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp30[66-81].esams.wmnet} and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [00:12:26] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [00:14:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P78270 and previous config saved to /var/cache/conftool/dbconfig/20250618-001408-ladsgroup.json [00:18:28] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:26:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154140 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [00:26:52] (03Merged) 10jenkins-bot: multiversion: Re-use prod for beta setSiteInfoForWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154140 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [00:27:31] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1154140|multiversion: Re-use prod for beta setSiteInfoForWiki (T289318)]] [00:27:36] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [00:29:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P78271 and previous config saved to /var/cache/conftool/dbconfig/20250618-002915-ladsgroup.json [00:29:22] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1160339 (owner: 10TrainBranchBot) [00:29:46] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1154140|multiversion: Re-use prod for beta setSiteInfoForWiki (T289318)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:30:49] !log sukhe@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on durum7003.magru.wmnet with reason: insetup host; will resolve service errors later [00:33:56] !log krinkle@deploy1003 krinkle: Continuing with sync [00:39:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:40:54] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1154140|multiversion: Re-use prod for beta setSiteInfoForWiki (T289318)]] (duration: 13m 23s) [00:40:59] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [00:44:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T382778)', diff saved to https://phabricator.wikimedia.org/P78272 and previous config saved to /var/cache/conftool/dbconfig/20250618-004423-ladsgroup.json [00:44:27] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2195.codfw.wmnet with reason: Maintenance [00:44:28] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [00:44:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T382778)', diff saved to https://phabricator.wikimedia.org/P78273 and previous config saved to /var/cache/conftool/dbconfig/20250618-004434-ladsgroup.json [00:47:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T382778)', diff saved to https://phabricator.wikimedia.org/P78274 and previous config saved to /var/cache/conftool/dbconfig/20250618-004745-ladsgroup.json [00:51:06] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [00:52:15] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [01:02:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P78275 and previous config saved to /var/cache/conftool/dbconfig/20250618-010253-ladsgroup.json [01:18:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P78276 and previous config saved to /var/cache/conftool/dbconfig/20250618-011800-ladsgroup.json [01:33:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T382778)', diff saved to https://phabricator.wikimedia.org/P78277 and previous config saved to /var/cache/conftool/dbconfig/20250618-013307-ladsgroup.json [01:33:12] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [01:33:23] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2198.codfw.wmnet with reason: Maintenance [01:39:27] (03PS1) 10Krinkle: varnish: Set X-Analytics `ismobile=1` for mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/1160381 (https://phabricator.wikimedia.org/T390924) [02:03:50] (03PS2) 10Krinkle: varnish: Set X-Analytics `ismobile=1` for mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/1160381 (https://phabricator.wikimedia.org/T390924) [02:03:52] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160381 (https://phabricator.wikimedia.org/T390924) (owner: 10Krinkle) [02:30:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.03%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [02:35:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.03%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [03:31:12] (03CR) 10Krinkle: "Deployed in Beta Cluster." [puppet] - 10https://gerrit.wikimedia.org/r/1160381 (https://phabricator.wikimedia.org/T390924) (owner: 10Krinkle) [03:35:32] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10926409 (10Krinkle) [03:58:35] (03PS1) 10Krinkle: beta: Remove unused beta-specific "w.beta.wmcloud.org" vhost [puppet] - 10https://gerrit.wikimedia.org/r/1160441 (https://phabricator.wikimedia.org/T396012) [04:00:15] (03PS2) 10Krinkle: beta: Remove unused beta-specific "w.beta.wmcloud.org" vhost [puppet] - 10https://gerrit.wikimedia.org/r/1160441 (https://phabricator.wikimedia.org/T396012) [04:00:17] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160441 (https://phabricator.wikimedia.org/T396012) (owner: 10Krinkle) [04:00:34] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160441 (https://phabricator.wikimedia.org/T396012) (owner: 10Krinkle) [04:03:18] (03CR) 10Tim Starling: [C:03+1] varnish: Set X-Analytics `ismobile=1` for mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/1160381 (https://phabricator.wikimedia.org/T390924) (owner: 10Krinkle) [04:03:28] FIRING: SystemdUnitFailed: wmf_auto_restart_ipmiseld.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:04:08] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts cirrussearch[2055-2060].codfw.wmnet [04:09:19] ryankemper@cumin2002 decommission (PID 2013349) is awaiting input [04:18:28] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:22:09] ryankemper@cumin2002 decommission (PID 2013349) is awaiting input [04:34:39] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [04:39:00] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[2055-2060].codfw.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [04:39:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:40:01] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[2055-2060].codfw.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [04:40:02] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:40:03] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[2055-2060].codfw.wmnet [04:44:01] !log [WDQS] Restarted blazegraph on `wdqs2009` just in case it's locked up [04:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:57:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2203.codfw.wmnet with reason: Maintenance [04:57:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1201 with weight 0 T397198', diff saved to https://phabricator.wikimedia.org/P78278 and previous config saved to /var/cache/conftool/dbconfig/20250618-045741-marostegui.json [04:57:46] T397198: Switchover s6 master (db1173 -> db1201) - https://phabricator.wikimedia.org/T397198 [04:57:50] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s6 T397198 [04:58:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1201 from API/vslow/dump T397198', diff saved to https://phabricator.wikimedia.org/P78279 and previous config saved to /var/cache/conftool/dbconfig/20250618-045821-marostegui.json [04:58:47] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1201 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1160155 (https://phabricator.wikimedia.org/T397198) (owner: 10Gerrit maintenance bot) [05:04:31] RESOLVED: SystemdUnitFailed: wmf_auto_restart_ipmiseld.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:07:38] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:07:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:08:38] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:09:28] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:09:28] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54082 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:09:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:16:02] PROBLEM - mailman3_queue_size on lists1004 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 33 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [05:18:01] !log Starting s6 eqiad failover from db1173 to db1201 - T397198 [05:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:06] T397198: Switchover s6 master (db1173 -> db1201) - https://phabricator.wikimedia.org/T397198 [05:18:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s6 eqiad as read-only for maintenance - T397198', diff saved to https://phabricator.wikimedia.org/P78281 and previous config saved to /var/cache/conftool/dbconfig/20250618-051812-root.json [05:18:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1201 to s6 primary and set section read-write T397198', diff saved to https://phabricator.wikimedia.org/P78282 and previous config saved to /var/cache/conftool/dbconfig/20250618-051836-root.json [05:18:39] marostegui@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [05:18:58] (03PS2) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1160156 (https://phabricator.wikimedia.org/T397198) [05:19:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1173 T397198', diff saved to https://phabricator.wikimedia.org/P78283 and previous config saved to /var/cache/conftool/dbconfig/20250618-051935-root.json [05:19:50] (03CR) 10Marostegui: [C:03+2] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1160156 (https://phabricator.wikimedia.org/T397198) (owner: 10Gerrit maintenance bot) [05:19:53] !log marostegui@dns1006 START - running authdns-update [05:20:47] !log marostegui@dns1006 END - running authdns-update [05:21:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1173.eqiad.wmnet with reason: Maintenance [05:22:43] (03PS1) 10Marostegui: db1173: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160465 (https://phabricator.wikimedia.org/T395989) [05:23:32] (03CR) 10Marostegui: [C:03+2] db1173: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160465 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [05:26:38] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [05:26:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T396130)', diff saved to https://phabricator.wikimedia.org/P78284 and previous config saved to /var/cache/conftool/dbconfig/20250618-052645-marostegui.json [05:26:50] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:30:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78285 and previous config saved to /var/cache/conftool/dbconfig/20250618-053038-root.json [05:32:45] (03PS1) 10Marostegui: db1188: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160466 (https://phabricator.wikimedia.org/T396549) [05:32:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1188', diff saved to https://phabricator.wikimedia.org/P78286 and previous config saved to /var/cache/conftool/dbconfig/20250618-053253-root.json [05:33:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1188.eqiad.wmnet with reason: Maintenance [05:33:37] (03CR) 10Marostegui: [C:03+2] db1188: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160466 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [05:34:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160128 (https://phabricator.wikimedia.org/T395031) (owner: 10KartikMistry) [05:38:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78287 and previous config saved to /var/cache/conftool/dbconfig/20250618-053858-root.json [05:42:37] (03PS1) 10Marostegui: db2160: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160470 (https://phabricator.wikimedia.org/T397161) [05:43:09] (03CR) 10Arnaudb: [C:03+1] gitlab-runner: upgrade default image to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1160119 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto) [05:43:27] (03CR) 10Arnaudb: [C:03+1] gitlab-runner: upgrade default image to bookworm on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/1160120 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto) [05:45:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78288 and previous config saved to /var/cache/conftool/dbconfig/20250618-054543-root.json [05:47:06] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2160.codfw.wmnet with reason: Maintenance [05:47:26] (03CR) 10Marostegui: [C:03+2] db2160: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160470 (https://phabricator.wikimedia.org/T397161) (owner: 10Marostegui) [05:50:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T396130)', diff saved to https://phabricator.wikimedia.org/P78289 and previous config saved to /var/cache/conftool/dbconfig/20250618-055023-marostegui.json [05:50:28] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:51:54] 07sre-alert-triage, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714#10926519 (10Stevemunene) a:03Stevemunene [05:54:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P78290 and previous config saved to /var/cache/conftool/dbconfig/20250618-055404-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T0600) [06:00:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78291 and previous config saved to /var/cache/conftool/dbconfig/20250618-060049-root.json [06:05:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P78292 and previous config saved to /var/cache/conftool/dbconfig/20250618-060531-marostegui.json [06:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:09:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P78293 and previous config saved to /var/cache/conftool/dbconfig/20250618-060910-root.json [06:10:14] (03PS1) 10Phuedx: ext.wikimediaEvents: Repurpose PageVisit instrument [extensions/WikimediaEvents] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160475 (https://phabricator.wikimedia.org/T397138) [06:10:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160475 (https://phabricator.wikimedia.org/T397138) (owner: 10Phuedx) [06:12:22] (03PS1) 10Giuseppe Lavagetto: requestctl: switch CLI from native client to API client [puppet] - 10https://gerrit.wikimedia.org/r/1160476 [06:14:36] (03PS1) 10Giuseppe Lavagetto: Add stub api tokens for hiddenparma [labs/private] - 10https://gerrit.wikimedia.org/r/1160477 [06:14:37] (03CR) 10CI reject: [V:04-1] requestctl: switch CLI from native client to API client [puppet] - 10https://gerrit.wikimedia.org/r/1160476 (owner: 10Giuseppe Lavagetto) [06:15:09] I will be a few minutes late for this morning's backport window but I will be there :) [06:15:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78294 and previous config saved to /var/cache/conftool/dbconfig/20250618-061555-root.json [06:20:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P78295 and previous config saved to /var/cache/conftool/dbconfig/20250618-062038-marostegui.json [06:24:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78296 and previous config saved to /var/cache/conftool/dbconfig/20250618-062416-root.json [06:35:34] (03CR) 10Jelto: "looks mostly good, as mentioned before you should bump the chart version." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [06:35:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T396130)', diff saved to https://phabricator.wikimedia.org/P78297 and previous config saved to /var/cache/conftool/dbconfig/20250618-063546-marostegui.json [06:35:51] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:36:01] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [06:36:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T396130)', diff saved to https://phabricator.wikimedia.org/P78298 and previous config saved to /var/cache/conftool/dbconfig/20250618-063608-marostegui.json [06:39:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78299 and previous config saved to /var/cache/conftool/dbconfig/20250618-063921-root.json [06:46:34] (03PS5) 10Stevemunene: zookeeper: remove an-conf100[1-3] from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135028 (https://phabricator.wikimedia.org/T374922) [06:53:13] (03CR) 10Muehlenhoff: [C:03+2] Add cumin1003 as mysql root client [puppet] - 10https://gerrit.wikimedia.org/r/1145085 (https://phabricator.wikimedia.org/T393990) (owner: 10Muehlenhoff) [06:56:29] (03CR) 10Brouberol: [C:03+2] airflow: derive AIRFLOW_APPOWNER from the user principal in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160115 (https://phabricator.wikimedia.org/T394297) (owner: 10Brouberol) [06:56:32] (03CR) 10Brouberol: [C:03+2] airflow-dev: increase the memory limits of the webserver in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160132 (https://phabricator.wikimedia.org/T394297) (owner: 10Brouberol) [06:59:09] (03Merged) 10jenkins-bot: airflow: derive AIRFLOW_APPOWNER from the user principal in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160115 (https://phabricator.wikimedia.org/T394297) (owner: 10Brouberol) [06:59:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.03%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [06:59:10] (03Merged) 10jenkins-bot: airflow-dev: increase the memory limits of the webserver in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160132 (https://phabricator.wikimedia.org/T394297) (owner: 10Brouberol) [06:59:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T396130)', diff saved to https://phabricator.wikimedia.org/P78300 and previous config saved to /var/cache/conftool/dbconfig/20250618-065936-marostegui.json [06:59:41] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:00:05] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T0700). [07:00:05] georgekyz, kart_, and phuedx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:41] here [07:01:10] georgekyz: deploying yourself? [07:01:28] (03PS1) 10Marostegui: db1156: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160628 (https://phabricator.wikimedia.org/T396549) [07:01:43] Yeap, I will start it in the following minutes [07:02:06] cool. Let me know when done. [07:02:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1156', diff saved to https://phabricator.wikimedia.org/P78301 and previous config saved to /var/cache/conftool/dbconfig/20250618-070239-root.json [07:02:48] urbanecm: around? Can you review my patch meanwhile? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1160128/ [07:03:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[1155-1156].eqiad.wmnet with reason: Maintenance [07:03:20] (03CR) 10Ilias Sarantopoulos: ores-extension: enable extension with revertrisk filter for the third batch of wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [07:03:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 10 hosts with reason: Maintenance [07:03:57] (03CR) 10Marostegui: [C:03+2] db1156: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160628 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [07:04:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.03%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [07:04:31] Starting deployment [07:04:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [07:05:23] (03Merged) 10jenkins-bot: ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [07:06:01] !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1155652|ores-extension: enable extension with revertrisk filter for the third batch of wikis (T395824)]] [07:06:05] T395824: [batch #3] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395824 [07:08:24] !log gkyziridis@deploy1003 gkyziridis: Backport for [[gerrit:1155652|ores-extension: enable extension with revertrisk filter for the third batch of wikis (T395824)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:12:12] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-cluster check 44 services in codfw: maintenance [07:12:13] !log jayme@cumin1002 START - Cookbook sre.discovery.service-route check 48 services: maintenance [07:12:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check 48 services: maintenance [07:12:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) check 44 services in codfw: maintenance [07:12:15] Hello o/ [07:12:17] I'm back [07:12:18] (03CR) 10Jelto: "wow this is nice 🎉 I tried it locally but helmfile fails when installing istio-proxy-settings with:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154293 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [07:13:17] kart_: You deploying yourself after georgekyz? [07:13:49] yes [07:14:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P78302 and previous config saved to /var/cache/conftool/dbconfig/20250618-071443-marostegui.json [07:16:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P78303 and previous config saved to /var/cache/conftool/dbconfig/20250618-071634-root.json [07:17:39] (03PS1) 10Muehlenhoff: Add missing email in user record [puppet] - 10https://gerrit.wikimedia.org/r/1160635 (https://phabricator.wikimedia.org/T397004) [07:18:35] georgekyz: Testing? [07:18:58] (03CR) 10Muehlenhoff: [C:03+2] Add missing email in user record [puppet] - 10https://gerrit.wikimedia.org/r/1160635 (https://phabricator.wikimedia.org/T397004) (owner: 10Muehlenhoff) [07:19:13] Yeap we are testing the patch is deploying ores extension for 9 wikis [07:19:19] testing is taking some time apologies [07:21:58] No worries, just checking! [07:22:22] we finished testing we are going to proceed and sync [07:22:33] !log gkyziridis@deploy1003 gkyziridis: Continuing with sync [07:22:41] cool [07:23:02] (03CR) 10JMeybohm: "Right...good catch! We're configuring helmfile with `--kubeconfig` and ofc. the credentials file exists on my machine. It does not contain" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154293 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [07:23:57] (03PS1) 10Marostegui: db2231: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160656 (https://phabricator.wikimedia.org/T397279) [07:24:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2231', diff saved to https://phabricator.wikimedia.org/P78304 and previous config saved to /var/cache/conftool/dbconfig/20250618-072404-root.json [07:24:32] !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2231.codfw.wmnet with reason: Maintenance [07:25:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2231.codfw.wmnet with reason: Maintenance [07:25:15] (03CR) 10Marostegui: [C:03+2] db2231: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160656 (https://phabricator.wikimedia.org/T397279) (owner: 10Marostegui) [07:29:36] !log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155652|ores-extension: enable extension with revertrisk filter for the third batch of wikis (T395824)]] (duration: 23m 35s) [07:29:41] T395824: [batch #3] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395824 [07:29:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P78305 and previous config saved to /var/cache/conftool/dbconfig/20250618-072951-marostegui.json [07:30:04] I'll start my patch now.. [07:30:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.02%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [07:30:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160128 (https://phabricator.wikimedia.org/T395031) (owner: 10KartikMistry) [07:31:04] Deployment finished successfully [07:31:15] (03Merged) 10jenkins-bot: Enable the Contribute menu on new Wikipedias automatically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160128 (https://phabricator.wikimedia.org/T395031) (owner: 10KartikMistry) [07:31:37] georgekyz: \0/ [07:31:37] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1160128|Enable the Contribute menu on new Wikipedias automatically (T395031 T381371)]] [07:31:39] thnx for your patience [07:31:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78306 and previous config saved to /var/cache/conftool/dbconfig/20250618-073140-root.json [07:31:42] thnx a lot [07:31:43] T395031: Enable the Contribute menu in 7th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T395031 [07:31:44] T381371: Enable the Contribute menu on new Wikipedias automatically - https://phabricator.wikimedia.org/T381371 [07:32:17] (03PS1) 10Filippo Giunchedi: thanos: add option for series limits to store [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318) [07:33:33] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:33:53] !log kartik@deploy1003 kartik: Backport for [[gerrit:1160128|Enable the Contribute menu on new Wikipedias automatically (T395031 T381371)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:33:58] (03CR) 10Volans: "[my 2 cents] I left some general suggestions in the python file, didn't do a full review in detail, leaving the specific logic to the requ" [puppet] - 10https://gerrit.wikimedia.org/r/1160476 (owner: 10Giuseppe Lavagetto) [07:34:20] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:35:07] (03PS1) 10Vgutierrez: hiera: Switch lvs5005 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1160672 (https://phabricator.wikimedia.org/T396561) [07:35:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.02%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [07:35:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78307 and previous config saved to /var/cache/conftool/dbconfig/20250618-073517-root.json [07:36:28] !log kartik@deploy1003 kartik: Continuing with sync [07:36:33] !log brouberol@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [07:36:45] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10926738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin2002 for host kafka-jumbo1018... [07:39:01] jouncebot: nowandnext [07:39:01] For the next 0 hour(s) and 20 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T0700) [07:39:01] In 2 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1000) [07:40:07] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160672 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [07:41:31] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) [07:42:06] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:42:35] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add stub api tokens for hiddenparma [labs/private] - 10https://gerrit.wikimedia.org/r/1160477 (owner: 10Giuseppe Lavagetto) [07:43:22] !log T386098 Killed the `wdqs-main` reload, it can be started up again on the new cumin later [07:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:26] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [07:43:34] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160128|Enable the Contribute menu on new Wikipedias automatically (T395031 T381371)]] (duration: 11m 56s) [07:43:40] T395031: Enable the Contribute menu in 7th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T395031 [07:43:40] T381371: Enable the Contribute menu on new Wikipedias automatically - https://phabricator.wikimedia.org/T381371 [07:44:06] phuedx: I'm done. [07:44:13] kart_: ACK [07:44:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T396130)', diff saved to https://phabricator.wikimedia.org/P78308 and previous config saved to /var/cache/conftool/dbconfig/20250618-074459-marostegui.json [07:45:04] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:45:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance [07:45:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T396130)', diff saved to https://phabricator.wikimedia.org/P78309 and previous config saved to /var/cache/conftool/dbconfig/20250618-074521-marostegui.json [07:46:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160475 (https://phabricator.wikimedia.org/T397138) (owner: 10Phuedx) [07:46:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78310 and previous config saved to /var/cache/conftool/dbconfig/20250618-074646-root.json [07:47:45] (03Merged) 10jenkins-bot: ext.wikimediaEvents: Repurpose PageVisit instrument [extensions/WikimediaEvents] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160475 (https://phabricator.wikimedia.org/T397138) (owner: 10Phuedx) [07:48:12] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1160475|ext.wikimediaEvents: Repurpose PageVisit instrument (T397138)]] [07:48:16] T397138: Run a second synthetic A/A test - https://phabricator.wikimedia.org/T397138 [07:48:28] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78311 and previous config saved to /var/cache/conftool/dbconfig/20250618-075022-root.json [07:50:40] !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1160475|ext.wikimediaEvents: Repurpose PageVisit instrument (T397138)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:52:12] (03PS5) 10Kosta Harlan: Configure instrument for CheckUser - UserInfoCard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) (owner: 10Mimurawil) [07:52:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.082s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:52:32] (03CR) 10Fabfur: [C:03+1] "godspeed!" [puppet] - 10https://gerrit.wikimedia.org/r/1160672 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [07:54:20] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:57:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.082s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:57:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:57:59] Took me a little while but I've confirmed the change looks good on testwiki [07:58:04] Continuing [07:58:07] !log phuedx@deploy1003 phuedx: Continuing with sync [08:00:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.03%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [08:00:25] FIRING: SystemdUnitFailed: prometheus-puppet-agent-stats.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78312 and previous config saved to /var/cache/conftool/dbconfig/20250618-080152-root.json [08:02:23] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2212* slowly with 10 steps - Pooling in [08:02:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:02:53] (03PS1) 10Vgutierrez: sre.loadbalancer.upgrade: Avoid depooling several LBs at the same time [cookbooks] - 10https://gerrit.wikimedia.org/r/1160680 [08:03:11] jouncebot: now [08:03:11] No deployments scheduled for the next 1 hour(s) and 56 minute(s) [08:03:16] jouncebot: refresh [08:03:16] I refreshed my knowledge about deployments. [08:03:18] lies? [08:04:05] isn't it supposed to be the train window right now? [08:05:07] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160475|ext.wikimediaEvents: Repurpose PageVisit instrument (T397138)]] (duration: 16m 55s) [08:05:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.03%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [08:05:12] T397138: Run a second synthetic A/A test - https://phabricator.wikimedia.org/T397138 [08:05:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78314 and previous config saved to /var/cache/conftool/dbconfig/20250618-080528-root.json [08:05:31] hashar: The window started 1 hour ago :) [08:06:20] (03CR) 10Volans: [C:03+1] "LGTM, a dry-run should be able to confirm you the correct actions are performed (either after merging or with test-cookbook)." [cookbooks] - 10https://gerrit.wikimedia.org/r/1160680 (owner: 10Vgutierrez) [08:07:15] phuedx: that is the backport & config one which started an hour ago isn't it? [08:07:42] hashar: You're right. Sorry. I misread your message :) [08:07:46] (03CR) 10Vgutierrez: "a DRY-RUN with `--query P{lvs[7001-7002].magru.wmnet}` confirms that admin cookbook is called with just one instance at a time:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1160680 (owner: 10Vgutierrez) [08:08:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T396130)', diff saved to https://phabricator.wikimedia.org/P78316 and previous config saved to /var/cache/conftool/dbconfig/20250618-080833-marostegui.json [08:08:34] wt:Deployments says it's the UTC-7 version this week? https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1800 [08:08:38] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:08:56] !log UTC morning backport window finished [08:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:51] I guess it got confused somehow [08:11:30] (03CR) 10Giuseppe Lavagetto: "As stated in the comments to the puppet class, I wasn't requesting a review of this code here, given it is a copy from another repository." [puppet] - 10https://gerrit.wikimedia.org/r/1160476 (owner: 10Giuseppe Lavagetto) [08:13:27] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet [08:16:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78317 and previous config saved to /var/cache/conftool/dbconfig/20250618-081657-root.json [08:20:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.077s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:20:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78319 and previous config saved to /var/cache/conftool/dbconfig/20250618-082035-root.json [08:21:11] !log rearm keyholder on cumin2002 [08:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:15] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10926885 (10Fabfur) @Jhancock.wm hi, when do you think we could start reimaging these? Is there something we can do in the meantime to help you with this? [08:21:33] jouncebot: refresh [08:21:34] I refreshed my knowledge about deployments. [08:21:37] jouncebot: now [08:21:37] No deployments scheduled for the next 1 hour(s) and 38 minute(s) [08:21:40] ... [08:21:53] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10926886 (10elukey) @MatthewVernon @MoritzMuehlenhoff I am planning to do the following: * log on thanos-fe1004 * sudo su;... [08:22:18] PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (202827s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [08:22:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest2001.codfw.wmnet [08:22:58] (03CR) 10Vgutierrez: [C:03+2] sre.loadbalancer.upgrade: Avoid depooling several LBs at the same time [cookbooks] - 10https://gerrit.wikimedia.org/r/1160680 (owner: 10Vgutierrez) [08:23:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P78320 and previous config saved to /var/cache/conftool/dbconfig/20250618-082340-marostegui.json [08:24:45] !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet [08:24:46] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs5005.eqsin.wmnet} and A:liberica (T396561) [08:24:52] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [08:25:11] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs5005.eqsin.wmnet} and A:liberica (T396561) [08:25:14] (03PS1) 10Elukey: role::maps::master: fix Tegola container name [puppet] - 10https://gerrit.wikimedia.org/r/1160688 (https://phabricator.wikimedia.org/T396584) [08:25:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.077s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:25:39] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs5005.eqsin.wmnet with reason: switching to katran [08:25:40] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs5005 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1160672 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [08:26:42] (03CR) 10Elukey: [V:03+1 C:03+2] profile::pyrra: improve the istio SLOs template [puppet] - 10https://gerrit.wikimedia.org/r/1160075 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [08:26:54] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6009/co" [puppet] - 10https://gerrit.wikimedia.org/r/1160688 (https://phabricator.wikimedia.org/T396584) (owner: 10Elukey) [08:27:01] (03CR) 10Elukey: [C:03+2] profile::pyrra: fix Citoid's SLO targets [puppet] - 10https://gerrit.wikimedia.org/r/1160076 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [08:27:55] RESOLVED: SystemdUnitFailed: prometheus-puppet-agent-stats.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:28:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2001.codfw.wmnet [08:28:28] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:28:45] (03PS1) 10Jcrespo: bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) [08:29:10] (03CR) 10CI reject: [V:04-1] bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [08:29:14] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:29:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:29:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:29:22] jouncebot: refresh [08:29:23] I refreshed my knowledge about deployments. [08:29:25] jouncebot: now [08:29:25] For the next 1 hour(s) and 30 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T0800) [08:29:39] ah [08:30:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:31:29] (03CR) 10Stevemunene: "Just did a restart of the service and there was no issue encountered" [puppet] - 10https://gerrit.wikimedia.org/r/1135028 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [08:31:30] PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:31:30] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:32:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:33:26] (03PS2) 10Jcrespo: bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) [08:33:35] FIRING: MailmanBounceQueueHigh: Mailman bounce queue on lists1004:9100 has more than 50 messages - https://wikitech.wikimedia.org/wiki/Mailman/Runbooks#MailmanBounceQueueHigh - https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?forceLogin&from=now-3h&orgId=1&to=now&viewPanel=2 - https://alerts.wikimedia.org/?q=alertname%3DMailmanBounceQueueHigh [08:33:50] (03PS10) 10Btullis: Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) [08:33:51] (03CR) 10CI reject: [V:04-1] bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [08:34:19] (03CR) 10Elukey: [V:03+1] "The only big change (see PCC) is related to the send_tile_invalidations systemd timer, that in codfw is currently wrongly configured :D" [puppet] - 10https://gerrit.wikimedia.org/r/1160688 (https://phabricator.wikimedia.org/T396584) (owner: 10Elukey) [08:34:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:34:58] (03PS11) 10Brouberol: Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [08:35:14] (03PS3) 10Jcrespo: bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) [08:35:39] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10926934 (10MatthewVernon) FWIW, I use `sudo bash ; . /etc/swift/accountfile.env`, but yes. Those commands will take so... [08:37:50] I am running the train NOW [08:38:36] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host doh7004.wikimedia.org [08:38:38] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [08:38:41] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs5005.eqsin.wmnet [08:38:42] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs5005.eqsin.wmnet [08:38:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P78322 and previous config saved to /var/cache/conftool/dbconfig/20250618-083847-marostegui.json [08:39:40] (03PS4) 10Jcrespo: bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) [08:39:59] !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on an-coord1003.eqiad.wmnet with reason: Upgrading SSD firmware [08:40:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10926952 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5ea14561-c44c-4cc5-b656-024e47b3bc03) set by btullis@cumin1003 for 1... [08:40:21] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1003.eqiad.wmnet [08:40:28] (03PS1) 10Vgutierrez: hiera: Repool lvs5005 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1160693 (https://phabricator.wikimedia.org/T396561) [08:40:33] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160694 (https://phabricator.wikimedia.org/T392176) [08:40:34] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160694 (https://phabricator.wikimedia.org/T392176) (owner: 10TrainBranchBot) [08:40:51] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160693 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [08:40:53] (03CR) 10CI reject: [V:04-1] hiera: Repool lvs5005 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1160693 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [08:40:58] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [08:41:05] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-coord1003.eqiad.wmnet [08:41:24] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160694 (https://phabricator.wikimedia.org/T392176) (owner: 10TrainBranchBot) [08:41:55] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet [08:41:56] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7004.wikimedia.org - jmm@cumin1003" [08:42:00] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7004.wikimedia.org - jmm@cumin1003" [08:42:01] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:42:01] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache doh7004.wikimedia.org on all recursors [08:42:04] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7004.wikimedia.org on all recursors [08:42:16] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading clouddbs T394372 [08:42:17] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1003.eqiad.wmnet [08:42:20] T394372: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372 [08:42:26] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-coord1003.eqiad.wmnet [08:42:33] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7004.wikimedia.org - jmm@cumin1003" [08:42:37] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7004.wikimedia.org - jmm@cumin1003" [08:43:42] (03CR) 10Brouberol: [C:03+1] Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [08:45:09] (03PS2) 10Vgutierrez: hiera: Repool lvs5005 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1160693 (https://phabricator.wikimedia.org/T396561) [08:45:38] jmm@cumin1003 makevm (PID 2085372) is awaiting input [08:46:07] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host doh7004.wikimedia.org with OS bookworm [08:46:54] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160693 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [08:47:15] (03CR) 10FNegri: [C:03+2] clouddb1016: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154808 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [08:49:50] (03PS3) 10Elukey: profile::pyrra: add SLO ratio for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) [08:50:56] !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.6 refs T392176 [08:51:00] T392176: 1.45.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T392176 [08:51:14] (03PS5) 10Jcrespo: bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) [08:51:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10927000 (10BTullis) Hi @RobH the cookbook failed for an-coord1003 with the following error: ` btullis@cumin1003:~$ sudo cookbook sre.hardware.up... [08:51:55] (03CR) 10Cathal Mooney: [C:03+2] Mark HTTP(S) traffic from dumps with low-priority QoS mark [puppet] - 10https://gerrit.wikimedia.org/r/1160071 (https://phabricator.wikimedia.org/T397153) (owner: 10Cathal Mooney) [08:53:29] (03PS6) 10Jcrespo: bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) [08:53:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T396130)', diff saved to https://phabricator.wikimedia.org/P78324 and previous config saved to /var/cache/conftool/dbconfig/20250618-085354-marostegui.json [08:54:00] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:54:10] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1195.eqiad.wmnet with reason: Maintenance [08:54:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T396130)', diff saved to https://phabricator.wikimedia.org/P78325 and previous config saved to /var/cache/conftool/dbconfig/20250618-085417-marostegui.json [08:54:48] (03CR) 10Giuseppe Lavagetto: "I've made a patch integrating some suggestions in the HIDDEPARMA repository, and merged it." [puppet] - 10https://gerrit.wikimedia.org/r/1160476 (owner: 10Giuseppe Lavagetto) [08:55:33] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [08:57:48] (03CR) 10Vgutierrez: [C:03+2] hiera: Repool lvs5005 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1160693 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [08:58:06] (03CR) 10Effie Mouzeli: [C:03+1] shellbox: migrate to bookworm-based httpd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160223 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [08:59:16] (03PS1) 10Cathal Mooney: Mark outbound rsync traffic from clouddumps as low-prio for qos [puppet] - 10https://gerrit.wikimedia.org/r/1160699 (https://phabricator.wikimedia.org/T397153) [09:00:50] !log btullis@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1003.eqiad.wmnet [09:01:02] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-coord1003.eqiad.wmnet [09:01:22] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153#10927019 (10cmooney) >>! In T397153#10925689, @xcollazo wrote: > Should we also mark rsync traffic as low-priority then? Hmm yeah it might not be a... [09:01:51] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153#10927020 (10cmooney) FWIW the change to mark the HTTP traffic is in place and working ` cmooney@clouddumps1002:~$ sudo iptables -v -n -t mangle -L P... [09:02:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.096s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:02:55] !log btullis@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1003.eqiad.wmnet [09:03:05] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-coord1003.eqiad.wmnet [09:03:16] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet [09:03:32] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet [09:04:35] !log repool lvs5005 (upload) using katran - T396561 [09:04:39] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs5005.eqsin.wmnet} and A:liberica (T396561) [09:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:40] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [09:04:58] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs5005.eqsin.wmnet} and A:liberica (T396561) [09:04:59] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [09:05:46] vgutierrez: \o/ [09:05:57] elukey: <3 [09:07:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.096s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:08:28] (03PS2) 10Giuseppe Lavagetto: requestctl: switch CLI from native client to API client [puppet] - 10https://gerrit.wikimedia.org/r/1160476 [09:09:12] (03CR) 10FNegri: [C:03+2] clouddb1020: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154809 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [09:10:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.458s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:11:21] !log jmm@cumin1003 START - Cookbook sre.netbox.restart-reboot rolling reboot on A:netbox [09:11:24] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors [09:11:27] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors [09:11:28] (03PS1) 10Hnowlan: changeprop: bump concurrency for pcs_rerender_native_on_null [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160704 (https://phabricator.wikimedia.org/T397072) [09:11:56] (03PS4) 10Elukey: profile::pyrra: add SLO ratio for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) [09:12:28] (03CR) 10Elukey: [C:03+2] "Added it, makes total sense!" [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [09:12:56] (03CR) 10Elukey: profile::pyrra: add SLO ratio for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [09:13:58] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on doh7004.wikimedia.org with reason: host reimage [09:15:12] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [09:15:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.053s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:15:40] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors [09:15:43] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors [09:17:27] (03PS3) 10Giuseppe Lavagetto: requestctl: switch CLI from native client to API client [puppet] - 10https://gerrit.wikimedia.org/r/1160476 [09:17:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T396130)', diff saved to https://phabricator.wikimedia.org/P78327 and previous config saved to /var/cache/conftool/dbconfig/20250618-091738-marostegui.json [09:17:43] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:17:49] (03CR) 10FNegri: [C:03+1] "SGTM, but I have a limited understanding on the actual usage of these hosts. It would be interesting to monitor if the bulk of the traffic" [puppet] - 10https://gerrit.wikimedia.org/r/1160699 (https://phabricator.wikimedia.org/T397153) (owner: 10Cathal Mooney) [09:18:07] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7004.wikimedia.org with reason: host reimage [09:18:28] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:32] (03PS1) 10Giuseppe Lavagetto: Use an actual user for the fake api tokens [labs/private] - 10https://gerrit.wikimedia.org/r/1160706 [09:18:42] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Use an actual user for the fake api tokens [labs/private] - 10https://gerrit.wikimedia.org/r/1160706 (owner: 10Giuseppe Lavagetto) [09:18:53] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Upgrading clouddbs T394372 [09:18:58] T394372: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372 [09:19:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:19:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:19:17] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:19:48] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6011/co" [puppet] - 10https://gerrit.wikimedia.org/r/1160476 (owner: 10Giuseppe Lavagetto) [09:20:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:21:11] 06SRE, 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10927063 (10elukey) >>! In T391852#10922168, @Mvolz wrote: >>>! In T391852#10919212, @elukey wrote: >> I am reopening this t... [09:21:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:21:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:22:45] (03CR) 10Jcrespo: "Alex, can I ask you for a review? The director code will need a cleanup afterwards, but I want to first do the migration and remove backup" [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [09:22:55] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:24:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.209s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:24:20] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet [09:24:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:26:30] (03CR) 10Jgiannelos: [C:03+1] changeprop: bump concurrency for pcs_rerender_native_on_null [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160704 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [09:28:14] !log jmm@cumin1003 END (PASS) - Cookbook sre.netbox.restart-reboot (exit_code=0) rolling reboot on A:netbox [09:29:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.206s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:29:27] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10927084 (10MoritzMuehlenhoff) [09:29:33] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [09:29:43] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10927085 (10MoritzMuehlenhoff) [09:30:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.07%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [09:32:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P78329 and previous config saved to /var/cache/conftool/dbconfig/20250618-093245-marostegui.json [09:34:26] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7004.wikimedia.org with OS bookworm [09:34:26] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7004.wikimedia.org [09:35:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.07%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [09:38:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160171 (owner: 10Muehlenhoff) [09:39:39] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [09:40:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:40:30] (03CR) 10Muehlenhoff: [C:03+1] "Looks good (based on https://phabricator.wikimedia.org/T396584#10922024)" [puppet] - 10https://gerrit.wikimedia.org/r/1160688 (https://phabricator.wikimedia.org/T396584) (owner: 10Elukey) [09:40:48] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host netboxdb2003.codfw.wmnet [09:44:34] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2003.codfw.wmnet [09:46:08] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [09:47:33] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10927145 (10BTullis) Hello, just to let you know, I'm now trying the same operation on an-coord1003 T394499#10927000 and getting the same error as @RobH ab... [09:47:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P78331 and previous config saved to /var/cache/conftool/dbconfig/20250618-094752-marostegui.json [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:53:41] (03PS5) 10JMeybohm: kind.sh can bootstrap a wikikube like cluster with kind [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154293 (https://phabricator.wikimedia.org/T396107) [09:54:08] (03CR) 10JMeybohm: "Unfortunately it's not, due to https://github.com/helmfile/helmfile/issues/2084" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154293 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [09:55:13] (03CR) 10Kosta Harlan: [C:03+1] Configure instrument for CheckUser - UserInfoCard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) (owner: 10Mimurawil) [09:55:20] jouncebot: nowandnext [09:55:21] For the next 0 hour(s) and 4 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T0800) [09:55:21] In 0 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1000) [09:55:40] hnowlan: infra window will include codfw depool of wikikube btw [09:56:17] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [09:56:24] (03CR) 10Elukey: Change log format to get name of image being built (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156315 (owner: 10Hashar) [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:56:41] (03CR) 10Hnowlan: [C:03+2] changeprop: bump concurrency for pcs_rerender_native_on_null [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160704 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [09:57:44] (03CR) 10Elukey: "I don't have a strong preference, but what is the advantage of having it in the new version (to help me understanding the change better) ?" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156733 (owner: 10Hashar) [09:58:27] (03Merged) 10jenkins-bot: changeprop: bump concurrency for pcs_rerender_native_on_null [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160704 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1000) [10:00:05] jayme, Raine, and claime: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:02:42] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [10:02:51] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:02:55] (03PS1) 10Effie Mouzeli: mediawiki-common: remove disk-type affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160719 [10:03:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T396130)', diff saved to https://phabricator.wikimedia.org/P78333 and previous config saved to /var/cache/conftool/dbconfig/20250618-100300-marostegui.json [10:03:05] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:03:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1196.eqiad.wmnet with reason: Maintenance [10:03:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:03:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T396130)', diff saved to https://phabricator.wikimedia.org/P78335 and previous config saved to /var/cache/conftool/dbconfig/20250618-100329-marostegui.json [10:04:14] topranks: _joe_: We're going to depool wikikube codfw for around an hour as a precautionary test for the upcoming kubernetes upgrade [10:04:17] (03PS1) 10Marostegui: db2191: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160722 (https://phabricator.wikimedia.org/T397279) [10:04:35] jayme: wow we are upgrading? [10:04:49] (03PS2) 10Effie Mouzeli: mediawiki-common: remove disk-type affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160719 [10:04:54] jayme: ok thanks, is that different from the depool c.laime mentioned in -sre ? [10:05:07] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [10:05:12] topranks: lol, no - sorry [10:05:21] x) [10:05:27] ha no worries it sounded the same I was just double-checking [10:05:35] thanks for letting us know :) [10:05:36] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [10:06:11] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [10:06:37] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [10:07:45] I'm done with changeprop, go ahead [10:07:48] (03PS1) 10Vgutierrez: hiera: Switch lvs5004 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1160723 (https://phabricator.wikimedia.org/T396561) [10:08:15] hnowlan: ack, thanks [10:08:51] (03CR) 10Clément Goubert: [C:03+1] mediawiki-common: remove disk-type affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160719 (owner: 10Effie Mouzeli) [10:09:20] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160723 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [10:10:16] (03PS1) 10Slyngshede: Update requirements for netbox 4.0.11 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1160724 [10:10:33] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-cluster depool 44 services in codfw/codfw: pre-upgrade-test [10:10:33] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-cluster (exit_code=99) depool 44 services in codfw/codfw: pre-upgrade-test [10:12:38] (03PS2) 10Slyngshede: Update requirements for netbox 4.0.11 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1160724 (https://phabricator.wikimedia.org/T397300) [10:13:32] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159470 (owner: 10PipelineBot) [10:14:49] !log starting backup director migration backup1001 -> backup1014 T387892 [10:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:53] T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892 [10:14:57] (03CR) 10Jcrespo: [C:03+2] bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [10:16:06] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159470 (owner: 10PipelineBot) [10:16:28] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host netboxdb1003.eqiad.wmnet [10:18:27] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2212* slowly with 10 steps - Pooling in [10:18:36] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10927316 (10MatthewVernon) Silly question while I'm here - do you need 2 buckets, each of which ends up replicated cros... [10:20:22] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1003.eqiad.wmnet [10:20:27] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup[1001,1014].eqiad.wmnet with reason: Backup director migration [10:21:25] FIRING: SystemdUnitFailed: netbox_report_cables_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:24:24] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160236 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [10:26:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T396130)', diff saved to https://phabricator.wikimedia.org/P78337 and previous config saved to /var/cache/conftool/dbconfig/20250618-102655-marostegui.json [10:27:01] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:28:42] FIRING: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:29:18] (03CR) 10Muehlenhoff: [C:03+1] "Looks good (once the NDA is completed)" [puppet] - 10https://gerrit.wikimedia.org/r/1160216 (https://phabricator.wikimedia.org/T397099) (owner: 10Herron) [10:31:01] !log jmm@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2024.codfw.wmnet with reason: remove for decom [10:32:54] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10927411 (10Volans) @BTullis the SSD upgrade is a type of its own, not STORAGE, so the files must be in `/srv/firmware/poweredge-r440/SSD`. If you use that... [10:33:51] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1160723 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [10:35:28] jouncebot: nowandnext [10:35:28] For the next 0 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1000) [10:35:28] In 0 hour(s) and 24 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1100) [10:35:42] (03PS5) 10Reedy: composer: Various updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160144 [10:35:50] (03CR) 10Reedy: [C:03+2] composer: Various updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160144 (owner: 10Reedy) [10:35:55] (03PS4) 10Reedy: Setup json linting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 (https://phabricator.wikimedia.org/T397191) [10:36:01] (03CR) 10Reedy: [C:03+2] Setup json linting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 (https://phabricator.wikimedia.org/T397191) (owner: 10Reedy) [10:36:40] (03Merged) 10jenkins-bot: composer: Various updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160144 (owner: 10Reedy) [10:36:51] (03Merged) 10jenkins-bot: Setup json linting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 (https://phabricator.wikimedia.org/T397191) (owner: 10Reedy) [10:37:10] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Sync firmwares directory between the cumin hosts - https://phabricator.wikimedia.org/T397306 (10Volans) 03NEW p:05Triage→03Medium [10:37:16] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10927437 (10Volans) Created T397306 [10:37:29] (03PS1) 10Hnowlan: changeprop: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160727 (https://phabricator.wikimedia.org/T397072) [10:40:12] (03PS12) 10Reedy: Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender) [10:40:31] !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on an-coord1003.eqiad.wmnet with reason: Upgrading SSD firmware [10:40:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2191', diff saved to https://phabricator.wikimedia.org/P78338 and previous config saved to /var/cache/conftool/dbconfig/20250618-104033-root.json [10:40:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10927450 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=49f11e46-f52a-4db8-a2cc-7688a3599023) set by btullis@cumin1003 for 1... [10:40:45] (03CR) 10Reedy: [C:03+2] Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender) [10:40:50] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1003.eqiad.wmnet [10:40:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2191.codfw.wmnet with reason: Maintenance [10:41:01] RECOVERY - mailman3_queue_size on lists1004 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [10:41:49] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Sync firmwares directory between the cumin hosts - https://phabricator.wikimedia.org/T397306#10927452 (10MoritzMuehlenhoff) We could also have one seedhost on a single designated Cumin host where dc ops can write to. And then set up an rsync which syncs th... [10:41:49] (03Merged) 10jenkins-bot: Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender) [10:41:50] (03CR) 10Marostegui: [C:03+2] db2191: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160722 (https://phabricator.wikimedia.org/T397279) (owner: 10Marostegui) [10:42:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P78339 and previous config saved to /var/cache/conftool/dbconfig/20250618-104203-marostegui.json [10:43:26] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-coord1003.eqiad.wmnet [10:43:35] RESOLVED: MailmanBounceQueueHigh: Mailman bounce queue on lists1004:9100 has more than 50 messages - https://wikitech.wikimedia.org/wiki/Mailman/Runbooks#MailmanBounceQueueHigh - https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?forceLogin&from=now-3h&orgId=1&to=now&viewPanel=2 - https://alerts.wikimedia.org/?q=alertname%3DMailmanBounceQueueHigh [10:45:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10927481 (10MoritzMuehlenhoff) [10:46:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78340 and previous config saved to /var/cache/conftool/dbconfig/20250618-104609-root.json [10:47:44] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet [10:48:41] !log root@cumin1002 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for backup1009.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [10:48:52] (03CR) 10Clément Goubert: [C:03+1] changeprop: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160727 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:48:59] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1160144|composer: Various updates]], [[gerrit:1160151|Setup json linting (T397191)]], [[gerrit:1130201|Improve function and property documentation for php code (T171115)]] [10:49:04] T397191: Add JSON syntax check to mediawiki-config CI - https://phabricator.wikimedia.org/T397191 [10:49:05] T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115 [10:49:23] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Sync firmwares directory between the cumin hosts - https://phabricator.wikimedia.org/T397306#10927495 (10Volans) That's an interesting idea that would work right now because the auto-download from the Dell website is broken, but if we fix that then any cum... [10:49:49] (03CR) 10Clément Goubert: [C:03+1] mediawiki_experimental: $kubernetes_release dir fix [puppet] - 10https://gerrit.wikimedia.org/r/1160236 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [10:49:53] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet [10:50:53] (03CR) 10Volans: [C:03+1] "LGTM, thx" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1160724 (https://phabricator.wikimedia.org/T397300) (owner: 10Slyngshede) [10:51:14] !log reedy@deploy1003 umherirrender, reedy: Backport for [[gerrit:1160144|composer: Various updates]], [[gerrit:1160151|Setup json linting (T397191)]], [[gerrit:1130201|Improve function and property documentation for php code (T171115)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:51:25] RESOLVED: SystemdUnitFailed: netbox_report_cables_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:52:03] (03PS1) 10Jcrespo: bacula: Update wrong role for backup2009 [puppet] - 10https://gerrit.wikimedia.org/r/1160730 (https://phabricator.wikimedia.org/T387892) [10:52:37] !log reedy@deploy1003 umherirrender, reedy: Continuing with sync [10:52:49] !log root@cumin1002 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for backup1009.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [10:53:26] (03CR) 10Jcrespo: [C:03+2] bacula: Update wrong role for backup2009 [puppet] - 10https://gerrit.wikimedia.org/r/1160730 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [10:54:10] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1003.eqiad.wmnet [10:54:11] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts an-coord1003.eqiad.wmnet [10:54:33] (03CR) 10Hnowlan: [C:03+2] "Thanks claime!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160727 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:56:49] (03Merged) 10jenkins-bot: changeprop: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160727 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:57:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P78341 and previous config saved to /var/cache/conftool/dbconfig/20250618-105710-marostegui.json [10:58:34] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet [10:58:41] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet [10:59:20] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160144|composer: Various updates]], [[gerrit:1160151|Setup json linting (T397191)]], [[gerrit:1130201|Improve function and property documentation for php code (T171115)]] (duration: 10m 20s) [10:59:25] T397191: Add JSON syntax check to mediawiki-config CI - https://phabricator.wikimedia.org/T397191 [10:59:26] T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115 [10:59:59] (03PS1) 10Btullis: Revert "Failover hive and presto to the standby coordinator" [dns] - 10https://gerrit.wikimedia.org/r/1160733 [11:00:04] mvolz: Time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1100). [11:00:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [11:01:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78342 and previous config saved to /var/cache/conftool/dbconfig/20250618-110114-root.json [11:01:39] (03PS2) 10Btullis: Revert "Failover hive and presto to the standby coordinator" [dns] - 10https://gerrit.wikimedia.org/r/1160733 [11:01:42] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki-common: remove disk-type affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160719 (owner: 10Effie Mouzeli) [11:02:16] !log root@cumin1002 START - Cookbook sre.puppet.migrate-host for host backup1009.eqiad.wmnet [11:02:27] !log root@cumin1002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host backup1009.eqiad.wmnet [11:02:54] (03CR) 10Btullis: [C:03+2] Revert "Failover hive and presto to the standby coordinator" [dns] - 10https://gerrit.wikimedia.org/r/1160733 (owner: 10Btullis) [11:03:16] !log btullis@dns1004 START - running authdns-update [11:04:01] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki_experimental: $kubernetes_release dir fix [puppet] - 10https://gerrit.wikimedia.org/r/1160236 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [11:04:10] (03Merged) 10jenkins-bot: mediawiki-common: remove disk-type affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160719 (owner: 10Effie Mouzeli) [11:04:13] !log btullis@dns1004 END - running authdns-update [11:04:33] (03PS1) 10Jcrespo: bacula: Force puppet7 on backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/1160735 (https://phabricator.wikimedia.org/T387892) [11:05:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [11:06:06] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [11:06:10] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [11:07:03] !log root@cumin1002 START - Cookbook sre.puppet.migrate-host for host backup1009.eqiad.wmnet [11:07:04] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti2023/ganeti2024 as Ganeti servers [puppet] - 10https://gerrit.wikimedia.org/r/1159937 (https://phabricator.wikimedia.org/T396590) (owner: 10Muehlenhoff) [11:07:14] (03CR) 10Jcrespo: [C:03+2] bacula: Force puppet7 on backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/1160735 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [11:07:52] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [11:09:56] !log root@cumin1002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host backup1009.eqiad.wmnet [11:12:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T396130)', diff saved to https://phabricator.wikimedia.org/P78343 and previous config saved to /var/cache/conftool/dbconfig/20250618-111217-marostegui.json [11:12:23] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:12:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1206.eqiad.wmnet with reason: Maintenance [11:12:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T396130)', diff saved to https://phabricator.wikimedia.org/P78344 and previous config saved to /var/cache/conftool/dbconfig/20250618-111239-marostegui.json [11:13:32] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet [11:16:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78345 and previous config saved to /var/cache/conftool/dbconfig/20250618-111620-root.json [11:18:10] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [11:18:10] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10927588 (10BTullis) >>! In T394543#10927411, @Volans wrote: > @BTullis the SSD upgrade is a type of its own, not STORAGE, so the files must be in `/srv/fi... [11:19:23] (03PS1) 10Effie Mouzeli: mw-experimental: scale down resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160738 [11:19:42] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:20:16] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:21:20] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:21:48] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:22:31] (03CR) 10Clément Goubert: [C:03+1] mw-experimental: scale down resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160738 (owner: 10Effie Mouzeli) [11:22:45] 10ops-eqiad, 06DBA, 06DC-Ops: Failed power supply on es1045 - https://phabricator.wikimedia.org/T397310 (10FCeratto-WMF) 03NEW [11:22:49] (03CR) 10Effie Mouzeli: [C:03+2] mw-experimental: scale down resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160738 (owner: 10Effie Mouzeli) [11:24:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10927618 (10MoritzMuehlenhoff) [11:24:31] (03Merged) 10jenkins-bot: mw-experimental: scale down resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160738 (owner: 10Effie Mouzeli) [11:24:33] 10ops-eqiad, 06DBA, 06DC-Ops: Failed power supply on es1045 - https://phabricator.wikimedia.org/T397310#10927627 (10Marostegui) p:05Triage→03Medium [11:24:45] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:25:12] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:26:21] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10927634 (10taavi) [11:26:47] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [11:27:34] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [11:28:25] FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:30:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [11:30:41] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10927696 (10MoritzMuehlenhoff) [11:31:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T396130)', diff saved to https://phabricator.wikimedia.org/P78347 and previous config saved to /var/cache/conftool/dbconfig/20250618-113103-marostegui.json [11:31:08] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:31:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78348 and previous config saved to /var/cache/conftool/dbconfig/20250618-113125-root.json [11:35:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [11:43:37] (03PS1) 10Jcrespo: bacula: Migrate and create general properties for the new/renamed roles [puppet] - 10https://gerrit.wikimedia.org/r/1160739 (https://phabricator.wikimedia.org/T387892) [11:44:47] (03CR) 10Alexandros Kosiaris: [C:03+1] otel: add tolerations for mw-experimental hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160173 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [11:45:11] (03PS1) 10KartikMistry: Enable the Contribute menu in 8th group of Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160740 (https://phabricator.wikimedia.org/T395084) [11:46:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P78349 and previous config saved to /var/cache/conftool/dbconfig/20250618-114610-marostegui.json [11:46:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160740 (https://phabricator.wikimedia.org/T395084) (owner: 10KartikMistry) [11:47:18] (03PS2) 10Jcrespo: bacula: Migrate and create general properties for the new/renamed roles [puppet] - 10https://gerrit.wikimedia.org/r/1160739 (https://phabricator.wikimedia.org/T387892) [11:47:46] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160739 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [11:50:31] (03PS1) 10Hnowlan: mobileapps: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160742 [11:51:34] (03CR) 10Effie Mouzeli: [C:03+1] mobileapps: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160742 (owner: 10Hnowlan) [11:52:02] (03CR) 10Jelto: [C:03+1] "this looks good to me now and spins up a working environment in kind! Also the dependencies between the admin are correct, no second `helm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154293 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [11:52:22] (03CR) 10Hnowlan: [C:03+2] mobileapps: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160742 (owner: 10Hnowlan) [11:53:58] (03Merged) 10jenkins-bot: mobileapps: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160742 (owner: 10Hnowlan) [11:54:46] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [11:54:47] (03PS1) 10Volans: sre.hardware.upgrade-firmware: improve SSD logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1160743 [11:55:04] (03PS3) 10Jcrespo: bacula: Migrate and create general properties for the new/renamed roles [puppet] - 10https://gerrit.wikimedia.org/r/1160739 (https://phabricator.wikimedia.org/T387892) [11:55:07] (03CR) 10Slyngshede: [C:03+2] Update requirements for netbox 4.0.11 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1160724 (https://phabricator.wikimedia.org/T397300) (owner: 10Slyngshede) [11:55:11] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [11:55:11] (03CR) 10Slyngshede: [V:03+2 C:03+2] Update requirements for netbox 4.0.11 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1160724 (https://phabricator.wikimedia.org/T397300) (owner: 10Slyngshede) [11:55:17] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [11:55:20] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160739 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [11:55:21] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [11:55:26] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [11:55:39] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10927811 (10Volans) The cookbook exited with that code because it had a failure, unfortunately was missing a useful logging message at the right point. I'm... [11:56:34] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [11:58:25] RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:58:58] (03CR) 10Jcrespo: [C:03+2] bacula: Migrate and create general properties for the new/renamed roles [puppet] - 10https://gerrit.wikimedia.org/r/1160739 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [12:00:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [12:01:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P78350 and previous config saved to /var/cache/conftool/dbconfig/20250618-120117-marostegui.json [12:02:53] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1160739 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [12:04:04] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10927851 (10brouberol) 05In progress→03Resolved This ^ message ^ was posted by a rogue reimage cookbook that had bee... [12:05:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [12:05:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:08:31] (03CR) 10Muehlenhoff: [C:03+2] standard_packages: Handle dnsutils/bind9-dnsutils correctly across all OSes [puppet] - 10https://gerrit.wikimedia.org/r/1152246 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [12:08:33] (03CR) 10Jelto: [C:03+2] gitlab-runner: upgrade default image to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1160119 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto) [12:09:07] PROBLEM - Host ms-fe1016 is DOWN: PING CRITICAL - Packet loss = 100% [12:09:31] FIRING: ProbeDown: Service mobileapps:4102 has failed probes (http_mobileapps_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mobileapps:4102 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:10:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:11:10] (03PS1) 10Hnowlan: mobileapps: increase CPU limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160747 [12:11:35] RECOVERY - Host ms-fe1016 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [12:12:15] (03CR) 10Clément Goubert: [C:03+1] mobileapps: increase CPU limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160747 (owner: 10Hnowlan) [12:12:27] (03PS2) 10Filippo Giunchedi: thanos: add option for series limits to store [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318) [12:12:27] (03PS1) 10Filippo Giunchedi: thanos: force query-frontend query stats [puppet] - 10https://gerrit.wikimedia.org/r/1160748 (https://phabricator.wikimedia.org/T394318) [12:12:28] (03PS1) 10Filippo Giunchedi: thanos: enable snappy compression for grpc in query [puppet] - 10https://gerrit.wikimedia.org/r/1160749 (https://phabricator.wikimedia.org/T394318) [12:12:30] (03PS1) 10Filippo Giunchedi: thanos limits in common [puppet] - 10https://gerrit.wikimedia.org/r/1160750 [12:13:06] (03CR) 10Hnowlan: [C:03+2] mobileapps: increase CPU limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160747 (owner: 10Hnowlan) [12:13:29] RESOLVED: ProbeDown: Service mobileapps:4102 has failed probes (http_mobileapps_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mobileapps:4102 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:14:20] working to fix that ^ [12:14:54] (03Merged) 10jenkins-bot: mobileapps: increase CPU limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160747 (owner: 10Hnowlan) [12:14:58] (03CR) 10CI reject: [V:04-1] thanos: add option for series limits to store [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [12:15:00] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:15:32] (03PS2) 10Filippo Giunchedi: hieradata: set thanos sidecar and store series limits [puppet] - 10https://gerrit.wikimedia.org/r/1160750 (https://phabricator.wikimedia.org/T394318) [12:15:36] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Failed power supply on es1045 - https://phabricator.wikimedia.org/T397310#10927872 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated power cable idrac shows healthy [12:15:46] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:16:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T396130)', diff saved to https://phabricator.wikimedia.org/P78351 and previous config saved to /var/cache/conftool/dbconfig/20250618-121624-marostegui.json [12:16:29] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:16:40] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1218.eqiad.wmnet with reason: Maintenance [12:16:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T396130)', diff saved to https://phabricator.wikimedia.org/P78352 and previous config saved to /var/cache/conftool/dbconfig/20250618-121646-marostegui.json [12:17:00] (03PS3) 10Filippo Giunchedi: thanos: add option for series limits to store [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318) [12:17:01] (03PS3) 10Filippo Giunchedi: hieradata: set thanos sidecar and store series limits [puppet] - 10https://gerrit.wikimedia.org/r/1160750 (https://phabricator.wikimedia.org/T394318) [12:20:51] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:22:01] (03CR) 10Filippo Giunchedi: [C:04-1] "Looks like the hostgroup definition is missing" [puppet] - 10https://gerrit.wikimedia.org/r/1160205 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen) [12:23:42] RESOLVED: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:24:15] (03PS1) 10Jgiannelos: changeprop: Add header with event timestamp for PCS requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160753 [12:24:30] (03PS2) 10Jgiannelos: changeprop: Add header with event timestamp for PCS requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160753 (https://phabricator.wikimedia.org/T397072) [12:26:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:27:59] (03CR) 10Tiziano Fogli: [C:03+1] team-sre: check PoPs for PrometheusDown [alerts] - 10https://gerrit.wikimedia.org/r/1160177 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi) [12:29:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [12:29:19] (03CR) 10Tiziano Fogli: [C:03+1] thanos: force query-frontend query stats [puppet] - 10https://gerrit.wikimedia.org/r/1160748 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [12:29:32] FIRING: ProbeDown: Service mobileapps:4102 has failed probes (http_mobileapps_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mobileapps:4102 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:29:58] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Degraded RAID on analytics1073 - https://phabricator.wikimedia.org/T397231#10927991 (10Jclark-ctr) Hey @btullis will we be swapping this drive or is this server due to be decom? 7 years old i dont believe i have any 120gb drives. but si... [12:30:18] jouncebot: nowandnext [12:30:18] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [12:30:18] In 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1300) [12:31:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:31:59] @Amir1: I was going to deploy a new version of scap, that ok from your side? [12:32:11] yeah, I actually changed my mind [12:32:13] (03CR) 10Elukey: [V:03+1 C:03+2] role::maps::master: fix Tegola container name [puppet] - 10https://gerrit.wikimedia.org/r/1160688 (https://phabricator.wikimedia.org/T396584) (owner: 10Elukey) [12:32:17] ack [12:32:44] !log jnuche@deploy1003 Installing scap version "4.178.0" for 183 host(s) [12:32:56] (03PS1) 10Hnowlan: mobileapps: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160755 [12:33:29] RESOLVED: ProbeDown: Service mobileapps:4102 has failed probes (http_mobileapps_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mobileapps:4102 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:33:33] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1193.eqiad.wmnet with reason: Maintenance [12:34:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [12:34:40] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2161.codfw.wmnet with reason: Maintenance [12:35:04] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156325 (owner: 10PipelineBot) [12:35:08] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160232 (owner: 10PipelineBot) [12:36:16] (03CR) 10Filippo Giunchedi: [C:03+2] team-sre: check PoPs for PrometheusDown [alerts] - 10https://gerrit.wikimedia.org/r/1160177 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi) [12:36:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T396130)', diff saved to https://phabricator.wikimedia.org/P78353 and previous config saved to /var/cache/conftool/dbconfig/20250618-123658-marostegui.json [12:37:03] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:37:30] (03CR) 10Tiziano Fogli: [C:03+1] thanos: enable snappy compression for grpc in query [puppet] - 10https://gerrit.wikimedia.org/r/1160749 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [12:37:46] (03PS1) 10Jcrespo: bacula: Discourage the usage of backup1001 as director [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892) [12:38:33] (03CR) 10Volans: "Additional context in https://phabricator.wikimedia.org/T394543#10927588" [cookbooks] - 10https://gerrit.wikimedia.org/r/1160743 (owner: 10Volans) [12:38:34] (03PS2) 10Jcrespo: bacula: Discourage the usage of backup1001 as director [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892) [12:38:38] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [12:40:42] (03CR) 10CI reject: [V:04-1] bacula: Discourage the usage of backup1001 as director [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [12:41:05] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-06-09-163022 to 2025-06-17-205547 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160759 (https://phabricator.wikimedia.org/T394401) [12:41:08] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-06-10-144243 to 2025-06-17-204731 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160760 (https://phabricator.wikimedia.org/T390550) [12:41:19] (03PS1) 10Jforrester: wikifunctions: Enable memcached-based batching for ZObjects [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160761 (https://phabricator.wikimedia.org/T390550) [12:41:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:41:51] (03PS3) 10Jgiannelos: changeprop: Add header with event timestamp for PCS requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160753 (https://phabricator.wikimedia.org/T397072) [12:43:27] !log drop old Thanos Swift's Tegola tile cache containers - T396584 [12:43:27] (03PS3) 10Jcrespo: bacula: Discourage the usage of backup1001 as director [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892) [12:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:31] T396584: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584 [12:43:32] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [12:43:44] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1048.eqiad.wmnet [12:45:34] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet [12:45:38] (03PS4) 10Jcrespo: bacula: Discourage the usage of backup1001 as director [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892) [12:46:00] (03CR) 10Jcrespo: [C:03+2] bacula: Discourage the usage of backup1001 as director [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [12:48:10] (03CR) 10Tiziano Fogli: [C:03+1] thanos: add option for series limits to store [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [12:48:54] (03CR) 10Tiziano Fogli: [C:03+1] hieradata: set thanos sidecar and store series limits [puppet] - 10https://gerrit.wikimedia.org/r/1160750 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [12:49:12] (03PS2) 10Hnowlan: mobileapps: increase replicas, drop CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160755 [12:49:48] (03CR) 10Jgiannelos: [C:03+1] mobileapps: increase replicas, drop CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160755 (owner: 10Hnowlan) [12:51:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:51:54] (03CR) 10Hnowlan: [C:03+2] mobileapps: increase replicas, drop CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160755 (owner: 10Hnowlan) [12:52:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P78354 and previous config saved to /var/cache/conftool/dbconfig/20250618-125206-marostegui.json [12:53:42] (03Merged) 10jenkins-bot: mobileapps: increase replicas, drop CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160755 (owner: 10Hnowlan) [12:54:11] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:55:12] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:56:51] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:57:44] (03CR) 10Xcollazo: [C:03+1] Mark outbound rsync traffic from clouddumps as low-prio for qos [puppet] - 10https://gerrit.wikimedia.org/r/1160699 (https://phabricator.wikimedia.org/T397153) (owner: 10Cathal Mooney) [12:59:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.05%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1300). [13:00:04] kart_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] o/ [13:00:40] alright here [13:00:40] !log bacula director migration finalized, backup1014 is the new bacula director. backup1001 should no longer be used. T387892 [13:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:44] T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892 [13:00:46] Lucas_WMDE: I can deploy [13:00:59] hi there, please stand by for backports for a bit [13:01:01] (03CR) 10Elukey: [C:03+1] sre.hardware.upgrade-firmware: improve SSD logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1160743 (owner: 10Volans) [13:01:02] I need to deploy a scap fix [13:01:17] jnuche: sure. let me know. [13:01:48] kart_, jnuche: go ahead, I’m a bit busy right now anyway :) [13:02:09] :) [13:03:10] (03CR) 10Cathal Mooney: "Thanks! FWIW we can see the distribution in our netflow data:" [puppet] - 10https://gerrit.wikimedia.org/r/1160699 (https://phabricator.wikimedia.org/T397153) (owner: 10Cathal Mooney) [13:03:12] (03CR) 10Cathal Mooney: [C:03+2] Mark outbound rsync traffic from clouddumps as low-prio for qos [puppet] - 10https://gerrit.wikimedia.org/r/1160699 (https://phabricator.wikimedia.org/T397153) (owner: 10Cathal Mooney) [13:03:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:04:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.05%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [13:05:44] (03PS1) 10Hnowlan: mobileapps: drop memory usage, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160763 [13:06:25] (03PS1) 10Vgutierrez: prometheus: Add NIC queue CPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/1160764 (https://phabricator.wikimedia.org/T397303) [13:07:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P78355 and previous config saved to /var/cache/conftool/dbconfig/20250618-130713-marostegui.json [13:07:31] !log jnuche@deploy1003 Installing scap version "4.178.1" for 4 host(s) [13:07:33] jouncebot: now and next [13:07:33] For the next 0 hour(s) and 52 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1300) [13:08:20] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:esams or A:drmrs and A:cp - 9.2.10 upgrade (T390912) [13:08:23] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10928125 (10elukey) >>! In T396584#10927316, @MatthewVernon wrote: > Silly question while I'm here - do you need 2 buck... [13:08:24] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [13:08:35] (03PS1) 10Vgutierrez: prometheus: Add NIC queue CPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/1160765 (https://phabricator.wikimedia.org/T397303) [13:09:16] (03CR) 10CI reject: [V:04-1] prometheus: Add NIC queue CPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/1160765 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [13:09:40] (03Abandoned) 10Vgutierrez: prometheus: Add NIC queue CPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/1160765 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [13:09:44] (03CR) 10Jgiannelos: [C:03+1] mobileapps: drop memory usage, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160763 (owner: 10Hnowlan) [13:10:22] !log jnuche@deploy1003 Installation of scap version "4.178.1" completed for 4 hosts [13:10:55] scap updated, need a minute to verify [13:10:57] (03PS2) 10Vgutierrez: prometheus: Add NIC queue CPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/1160764 (https://phabricator.wikimedia.org/T397303) [13:11:08] (03CR) 10Muehlenhoff: [C:03+2] memcached: Switch to profile::memcached::firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156269 (owner: 10Muehlenhoff) [13:11:52] (03PS2) 10Hnowlan: mobileapps: drop memory usage, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160763 [13:12:18] kart_: all good, you can go ahead, thanks for your patience! [13:12:19] (03CR) 10Jgiannelos: [C:03+1] mobileapps: drop memory usage, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160763 (owner: 10Hnowlan) [13:12:37] jnuche: Sure. Thanks! [13:12:41] (03PS1) 10Ssingh: Release 9.2.11-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1160766 (https://phabricator.wikimedia.org/T397308) [13:13:05] I am going to apply a change for mobileapps - backport window can continue alongside it but it's critical my change is applied [13:13:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160740 (https://phabricator.wikimedia.org/T395084) (owner: 10KartikMistry) [13:13:41] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs5004.eqsin.wmnet} and A:liberica (T396561) [13:13:46] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [13:13:47] (03CR) 10Hnowlan: [C:03+2] mobileapps: drop memory usage, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160763 (owner: 10Hnowlan) [13:13:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:13:58] (03Merged) 10jenkins-bot: Enable the Contribute menu in 8th group of Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160740 (https://phabricator.wikimedia.org/T395084) (owner: 10KartikMistry) [13:14:06] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs5004.eqsin.wmnet} and A:liberica (T396561) [13:14:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:14:23] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1160740|Enable the Contribute menu in 8th group of Wikipedias (T395084)]] [13:14:28] T395084: Enable the Contribute menu in 8th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T395084 [13:14:33] (03CR) 10Btullis: [C:03+1] sre.hardware.upgrade-firmware: improve SSD logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1160743 (owner: 10Volans) [13:14:41] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs5004.eqsin.wmnet with reason: switching to katran [13:14:42] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs5004 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1160723 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:15:32] (03Merged) 10jenkins-bot: mobileapps: drop memory usage, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160763 (owner: 10Hnowlan) [13:15:44] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [13:15:47] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:16:41] !log kartik@deploy1003 kartik: Backport for [[gerrit:1160740|Enable the Contribute menu in 8th group of Wikipedias (T395084)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:18:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:18:29] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:18:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:19:21] !log kartik@deploy1003 kartik: Continuing with sync [13:19:36] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [13:19:40] (03CR) 10Volans: [C:03+2] sre.hardware.upgrade-firmware: improve SSD logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1160743 (owner: 10Volans) [13:19:44] (03PS1) 10Hnowlan: admin_ng: increase limits for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160767 [13:19:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [13:19:44] Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [13:19:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:21:35] jhancock@cumin2002 provision (PID 63771) is awaiting input [13:22:15] (03PS2) 10Jforrester: wikifunctions: Update evaluators from 2025-06-09-163022 to 2025-06-17-205547 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160759 (https://phabricator.wikimedia.org/T394401) [13:22:15] (03PS2) 10Jforrester: wikifunctions: Update orchestrator from 2025-06-10-144243 to 2025-06-18-130945 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160760 (https://phabricator.wikimedia.org/T390550) [13:22:15] (03PS2) 10Jforrester: wikifunctions: Enable memcached-based batching for ZObjects [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160761 (https://phabricator.wikimedia.org/T390550) [13:22:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T396130)', diff saved to https://phabricator.wikimedia.org/P78356 and previous config saved to /var/cache/conftool/dbconfig/20250618-132220-marostegui.json [13:22:25] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [13:22:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1219.eqiad.wmnet with reason: Maintenance [13:22:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T396130)', diff saved to https://phabricator.wikimedia.org/P78357 and previous config saved to /var/cache/conftool/dbconfig/20250618-132242-marostegui.json [13:23:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:24:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [13:24:44] Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [13:24:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:26:20] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160740|Enable the Contribute menu in 8th group of Wikipedias (T395084)]] (duration: 11m 57s) [13:26:25] T395084: Enable the Contribute menu in 8th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T395084 [13:26:55] Done. [13:27:02] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: improve SSD logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1160743 (owner: 10Volans) [13:27:32] (03CR) 10Alexandros Kosiaris: [C:03+1] admin_ng: increase limits for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160767 (owner: 10Hnowlan) [13:29:09] (03PS1) 10Vgutierrez: hiera: Repool lvs5004 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1160770 (https://phabricator.wikimedia.org/T396561) [13:29:28] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160770 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:29:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:29:35] (03CR) 10Herron: [C:03+1] profile::pyrra: add SLO ratio for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [13:29:51] (03CR) 10Ssingh: [C:03+1] hiera: Repool lvs5004 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1160770 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:29:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:30:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.09%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [13:30:45] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs5004.eqsin.wmnet [13:30:45] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs5004.eqsin.wmnet [13:31:17] (03CR) 10Vgutierrez: [C:03+2] hiera: Repool lvs5004 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1160770 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:31:47] 06SRE, 06Infrastructure-Foundations, 10netops: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153#10928195 (10cmooney) 05Open→03Resolved a:03cmooney [13:31:50] (03CR) 10Herron: [C:03+1] thanos: add option for series limits to store [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [13:31:58] (03CR) 10Muehlenhoff: [C:03+2] memcached/gutter: Switch to firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156659 (owner: 10Muehlenhoff) [13:32:13] (03CR) 10Herron: [C:03+1] hieradata: set thanos sidecar and store series limits [puppet] - 10https://gerrit.wikimedia.org/r/1160750 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [13:33:12] (03PS1) 10Elukey: profile::docker::reporter: add another not supported image [puppet] - 10https://gerrit.wikimedia.org/r/1160771 [13:35:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.14%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [13:36:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.636s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:36:25] !log jnuche@deploy1003 Installing scap version "4.178.2" for 4 host(s) [13:37:09] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs5004.eqsin.wmnet} and A:liberica (T396561) [13:37:10] !log repool lvs5004 (text) using katran - T396561 [13:37:14] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [13:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:28] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs5004.eqsin.wmnet} and A:liberica (T396561) [13:39:15] !log jnuche@deploy1003 Installation of scap version "4.178.2" completed for 4 hosts [13:40:17] (03PS1) 10Hnowlan: mobileapps: bump replicas significantly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160773 [13:40:21] 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10928238 (10MatthewVernon) [13:40:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:41:04] (03CR) 10Cathal Mooney: "LGTM overall, one nit/potential typo in line." [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:41:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.007s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:42:30] !log installing net-tools regression updates on Bullseye [13:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T396130)', diff saved to https://phabricator.wikimedia.org/P78358 and previous config saved to /var/cache/conftool/dbconfig/20250618-134307-marostegui.json [13:43:12] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [13:47:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:47:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:48:12] FIRING: SLOMetricAbsent: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:48:53] (03PS1) 10Bking: cirrus-streaming-updater: raise jobManager memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160780 (https://phabricator.wikimedia.org/T397335) [13:49:46] (03CR) 10Jgiannelos: [C:03+1] mobileapps: bump replicas significantly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160773 (owner: 10Hnowlan) [13:49:53] (03CR) 10Gmodena: [C:03+1] cirrus-streaming-updater: raise jobManager memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160780 (https://phabricator.wikimedia.org/T397335) (owner: 10Bking) [13:50:09] (03PS2) 10Bking: cirrus-streaming-updater: raise jobManager memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160780 (https://phabricator.wikimedia.org/T397335) [13:53:12] RESOLVED: SLOMetricAbsent: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:54:09] (03CR) 10Hashar: Change log format to get name of image being built (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156315 (owner: 10Hashar) [13:54:17] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [13:54:31] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [13:56:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.263s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:56:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:57:07] (03CR) 10CDanis: [C:03+1] "+1 deploy at will <3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160173 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [13:58:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P78359 and previous config saved to /var/cache/conftool/dbconfig/20250618-135814-marostegui.json [13:59:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.02%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [13:59:26] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd2005'] [13:59:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd2005'] [13:59:59] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:00:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2005.codfw.wmnet with OS bullseye [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1400) [14:00:13] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928384 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcephosd2005.codfw.wmnet with OS bullseye [14:00:53] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update evaluators from 2025-06-09-163022 to 2025-06-17-205547 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160759 (https://phabricator.wikimedia.org/T394401) (owner: 10Jforrester) [14:01:08] (03CR) 10Elukey: Change log format to get name of image being built (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156315 (owner: 10Hashar) [14:02:33] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-06-09-163022 to 2025-06-17-205547 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160759 (https://phabricator.wikimedia.org/T394401) (owner: 10Jforrester) [14:03:21] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:03:54] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:04:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:04:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.02%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [14:05:32] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd2006'] [14:05:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd2006'] [14:06:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2006.codfw.wmnet with OS bullseye [14:06:24] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928429 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcephosd2006.codfw.wmnet with OS bullseye [14:08:16] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:08:56] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:09:00] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:09:41] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:10:29] !log jnuche@deploy1003 Installing scap version "4.178.3" for 4 host(s) [14:10:46] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update orchestrator from 2025-06-10-144243 to 2025-06-18-130945 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160760 (https://phabricator.wikimedia.org/T390550) (owner: 10Jforrester) [14:11:29] 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2025/2026-Q1): Cookbook sre.hosts.remove_downtime does not remove silences - https://phabricator.wikimedia.org/T395032#10928440 (10lmata) [14:12:21] (03PS3) 10Jgreen: nsca_frack.cfg.erb break out trino hostgroup, add trino API check [puppet] - 10https://gerrit.wikimedia.org/r/1160205 (https://phabricator.wikimedia.org/T386259) [14:12:54] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-06-10-144243 to 2025-06-18-130945 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160760 (https://phabricator.wikimedia.org/T390550) (owner: 10Jforrester) [14:13:21] !log jnuche@deploy1003 Installation of scap version "4.178.3" completed for 4 hosts [14:13:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P78360 and previous config saved to /var/cache/conftool/dbconfig/20250618-141322-marostegui.json [14:13:46] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:15:24] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:15:48] (03CR) 10Bking: [C:03+2] cirrus-streaming-updater: raise jobManager memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160780 (https://phabricator.wikimedia.org/T397335) (owner: 10Bking) [14:16:08] (03PS1) 10Gkyziridis: ores-extension: enable extension with revertrisk filter for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160797 (https://phabricator.wikimedia.org/T395823) [14:17:19] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:17:29] jhancock@cumin2002 provision (PID 76583) is awaiting input [14:17:51] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:18:04] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:18:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:18:51] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:19:19] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd2007'] [14:19:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd2007'] [14:19:36] (03CR) 10Jforrester: [C:04-1] wikifunctions: Enable memcached-based batching for ZObjects [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160761 (https://phabricator.wikimedia.org/T390550) (owner: 10Jforrester) [14:20:17] (03CR) 10Jforrester: [C:04-1] "Not deploying right now as there's an issue making it hard for us to inspect staging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160761 (https://phabricator.wikimedia.org/T390550) (owner: 10Jforrester) [14:20:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007.codfw.wmnet with OS bullseye [14:20:34] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928484 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcephosd2007.codfw.wmnet with OS bullseye [14:21:26] (03PS1) 10Dbrant: Add 'wikipedia:' to list of recognized protocols. [core] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160802 (https://phabricator.wikimedia.org/T386004) [14:21:44] !log bking@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:21:59] !log bking@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:24:14] !log bking@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:24:22] !log bking@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:24:32] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:25:18] jhancock@cumin2002 reimage (PID 78485) is awaiting input [14:26:03] jmm@cumin1003 drain-node (PID 2111453) is awaiting input [14:26:54] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:28:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T396130)', diff saved to https://phabricator.wikimedia.org/P78363 and previous config saved to /var/cache/conftool/dbconfig/20250618-142829-marostegui.json [14:28:35] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [14:28:45] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1232.eqiad.wmnet with reason: Maintenance [14:28:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T396130)', diff saved to https://phabricator.wikimedia.org/P78364 and previous config saved to /var/cache/conftool/dbconfig/20250618-142852-marostegui.json [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1400) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1430) [14:30:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.07%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [14:35:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.07%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [14:40:22] (03PS1) 10Hnowlan: mobileapps: Remove resource limits for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160813 [14:40:26] !log Running `mwscript-k8s --php_version=8.1 -f -- extensions/WikiLambda/maintenance/updateSecondaryTables.php --wiki=wikifunctionswiki --cache --verbose --zType Z8` for T396449 [14:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:30] T396449: WikifunctionsPFragmentHandler::fetchFunctionFromCache cache miss while fetching Z20744 for empty argument Z20744K1 - https://phabricator.wikimedia.org/T396449 [14:41:14] jhancock@cumin2002 reimage (PID 79274) is awaiting input [14:41:33] !log bking@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:41:41] !log bking@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:41:42] (03CR) 10Clément Goubert: [C:03+1] profile::docker::reporter: add another not supported image [puppet] - 10https://gerrit.wikimedia.org/r/1160771 (owner: 10Elukey) [14:42:03] (03PS1) 10JMeybohm: k8s.pool-depool-cluster: Black format [cookbooks] - 10https://gerrit.wikimedia.org/r/1160816 (https://phabricator.wikimedia.org/T397148) [14:42:05] (03PS1) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 [14:45:19] !log bking@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:45:24] !log bking@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:45:41] (03CR) 10JHathaway: [C:03+1] No longer use mirrors.debian.org on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1160171 (owner: 10Muehlenhoff) [14:47:05] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1048.eqiad.wmnet [14:48:13] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet [14:49:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T396130)', diff saved to https://phabricator.wikimedia.org/P78365 and previous config saved to /var/cache/conftool/dbconfig/20250618-144903-marostegui.json [14:49:08] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [14:49:09] (03CR) 10CI reject: [V:04-1] sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (owner: 10JMeybohm) [14:50:19] (03CR) 10Elukey: [C:03+2] profile::docker::reporter: add another not supported image [puppet] - 10https://gerrit.wikimedia.org/r/1160771 (owner: 10Elukey) [14:50:25] !log reprepro included conftool 5.3.0 in apt.wikimedia.org - T395696 [14:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:30] T395696: Move ExternalStore config out of mediawiki config - https://phabricator.wikimedia.org/T395696 [14:50:47] moritzm: ok to merge? [14:51:33] !log jmm@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir7002.magru.wmnet [14:51:37] !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir7002.magru.wmnet [14:51:57] elukey: give me 30 seconds [14:52:23] elukey: first needed to disable puppet, can be merged now [14:53:15] running it [14:54:12] (03CR) 10Filippo Giunchedi: [C:03+2] nsca_frack.cfg.erb break out trino hostgroup, add trino API check [puppet] - 10https://gerrit.wikimedia.org/r/1160205 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen) [14:56:29] (03PS1) 10MVernon: thanos: remove drained thanos-be100[1-4] from rings [puppet] - 10https://gerrit.wikimedia.org/r/1160824 (https://phabricator.wikimedia.org/T391352) [14:57:37] (03PS2) 10MVernon: thanos: remove drained thanos-be100[1-4] from rings [puppet] - 10https://gerrit.wikimedia.org/r/1160824 (https://phabricator.wikimedia.org/T391352) [14:59:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [15:00:09] jouncebot nowandnext [15:00:09] No deployments scheduled for the next 1 hour(s) and 59 minute(s) [15:00:09] In 1 hour(s) and 59 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1700) [15:01:13] !log dancy@deploy1003 Started scap sync-world: Testing T396166 [15:01:19] T396166: Are `php_fpm`/`php_version` inside `scap.cfg` used anymore? - https://phabricator.wikimedia.org/T396166 [15:03:08] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1003.eqiad.wmnet [15:04:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [15:04:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P78366 and previous config saved to /var/cache/conftool/dbconfig/20250618-150410-marostegui.json [15:06:24] btullis@cumin1003 upgrade-firmware (PID 2125299) is awaiting input [15:06:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160797 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:50] !log dancy@deploy1003 Finished scap sync-world: Testing T396166 (duration: 08m 37s) [15:09:56] T396166: Are `php_fpm`/`php_version` inside `scap.cfg` used anymore? - https://phabricator.wikimedia.org/T396166 [15:12:24] (03PS13) 10AOkoth: miscweb: add os-reports update mechanism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) [15:13:19] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM! Nice" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160797 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [15:15:56] (03PS14) 10AOkoth: miscweb: add os-reports update mechanism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:11] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sde) failed in ms-be1071 - https://phabricator.wikimedia.org/T397343 (10MatthewVernon) 03NEW [15:17:36] (03PS2) 10Ilias Sarantopoulos: ores-extension: enable extension with revertrisk filter for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160797 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [15:17:48] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sde) failed in ms-be1071 - https://phabricator.wikimedia.org/T397343#10928670 (10MatthewVernon) p:05Triage→03High [this is blocking ongoing load/drain operations for the eqiad ms cluster] [15:18:42] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sde) failed in ms-be1071 - https://phabricator.wikimedia.org/T397343#10928673 (10MatthewVernon) [15:19:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P78367 and previous config saved to /var/cache/conftool/dbconfig/20250618-151918-marostegui.json [15:19:27] (03CR) 10Effie Mouzeli: [C:03+1] mobileapps: Remove resource limits for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160813 (owner: 10Hnowlan) [15:19:33] (03CR) 10AOkoth: miscweb: add os-reports update mechanism (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [15:23:38] (03PS1) 10Lucas Werkmeister (WMDE): wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160842 (https://phabricator.wikimedia.org/T216601) [15:23:44] (03CR) 10Hnowlan: [C:03+2] mobileapps: Remove resource limits for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160813 (owner: 10Hnowlan) [15:24:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.889s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:24:19] 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade firmware (NIC and system) on ganeti1047 - https://phabricator.wikimedia.org/T396660#10928689 (10MoritzMuehlenhoff) >>! In T396660#10923171, @MoritzMuehlenhoff wrote: >> While reviewing `/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor` I noticed that we have a mixtu... [15:24:57] (03CR) 10Alexandros Kosiaris: [C:03+1] mobileapps: Remove resource limits for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160813 (owner: 10Hnowlan) [15:25:20] (03Merged) 10jenkins-bot: mobileapps: Remove resource limits for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160813 (owner: 10Hnowlan) [15:26:04] (03CR) 10Jakob: [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160842 (https://phabricator.wikimedia.org/T216601) (owner: 10Lucas Werkmeister (WMDE)) [15:26:20] jouncebot: now [15:26:20] No deployments scheduled for the next 1 hour(s) and 33 minute(s) [15:26:25] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "deploying" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160842 (https://phabricator.wikimedia.org/T216601) (owner: 10Lucas Werkmeister (WMDE)) [15:27:24] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:28:15] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160842 (https://phabricator.wikimedia.org/T216601) (owner: 10Lucas Werkmeister (WMDE)) [15:29:01] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:29:28] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:29:57] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [15:30:24] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [15:30:36] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [15:30:57] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [15:31:23] (03PS2) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 [15:31:23] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [15:31:40] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [15:34:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.914s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:34:24] (03PS2) 10Tchanders: Configure event stream for IP auto-reveal instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155725 (https://phabricator.wikimedia.org/T387600) [15:34:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T396130)', diff saved to https://phabricator.wikimedia.org/P78368 and previous config saved to /var/cache/conftool/dbconfig/20250618-153425-marostegui.json [15:34:32] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [15:34:42] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1234.eqiad.wmnet with reason: Maintenance [15:34:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T396130)', diff saved to https://phabricator.wikimedia.org/P78369 and previous config saved to /var/cache/conftool/dbconfig/20250618-153448-marostegui.json [15:35:03] (03CR) 10Marostegui: [C:03+1] thanos: remove drained thanos-be100[1-4] from rings [puppet] - 10https://gerrit.wikimedia.org/r/1160824 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [15:35:15] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2006.codfw.wmnet with OS bullseye [15:35:20] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928756 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudcephosd2006.codfw.wmnet with OS bullseye ex... [15:35:20] (03CR) 10BCornwall: [C:03+1] Release 9.2.11-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1160766 (https://phabricator.wikimedia.org/T397308) (owner: 10Ssingh) [15:35:58] (03CR) 10MVernon: [C:03+2] thanos: remove drained thanos-be100[1-4] from rings [puppet] - 10https://gerrit.wikimedia.org/r/1160824 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [15:37:02] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007.codfw.wmnet with OS bullseye [15:37:10] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928762 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudcephosd2007.codfw.wmnet with OS bullseye ex... [15:38:18] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2007.codfw.wmnet with OS bullseye [15:38:27] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host cloudcephosd2007.codfw.wmnet with OS bullseye [15:39:33] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:41:02] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-codfw and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [15:41:07] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [15:44:30] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2005.codfw.wmnet with OS bullseye [15:44:41] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928800 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudcephosd2005.codfw.wmnet with OS bullseye ex... [15:45:27] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10928801 (10Jhancock.wm) @Fabfur we will unfortunatly have to use UEFI on these machines. Could you update partman to make those changes. Then i can proceed. I'm working... [15:45:42] !log Depooling cp7001 for firmware upgrades re: thermal support ticket - T386959 [15:45:45] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7001.* [15:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:47] T386959: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959 [15:46:06] (03PS1) 10MVernon: thanos: add new backends, remove old ones gone from rings [puppet] - 10https://gerrit.wikimedia.org/r/1160855 (https://phabricator.wikimedia.org/T391352) [15:46:10] (03PS1) 10MVernon: thanos: add new nodes to ring, drain old ones [puppet] - 10https://gerrit.wikimedia.org/r/1160856 (https://phabricator.wikimedia.org/T392908) [15:46:44] (03PS1) 10Kimberly Sarabia: Revert "Enable new mobile search experience everywhere (not including empty search recommendations)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160858 [15:47:46] (03PS2) 10Kimberly Sarabia: Revert "Enable new mobile search experience everywhere (not including empty search recommendations)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160858 [15:48:22] (03PS1) 10Hnowlan: admin_ng: remove quotas and ranges for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160860 [15:48:53] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928813 (10Jhancock.wm) [15:52:44] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:52:54] (03CR) 10Alexandros Kosiaris: [C:03+1] admin_ng: remove quotas and ranges for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160860 (owner: 10Hnowlan) [15:52:58] (03CR) 10Ssingh: [C:03+2] Release 9.2.11-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1160766 (https://phabricator.wikimedia.org/T397308) (owner: 10Ssingh) [15:54:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160858 (owner: 10Kimberly Sarabia) [15:54:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T396130)', diff saved to https://phabricator.wikimedia.org/P78370 and previous config saved to /var/cache/conftool/dbconfig/20250618-155455-marostegui.json [15:55:01] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [15:56:11] (03CR) 10Hnowlan: [C:03+2] admin_ng: remove quotas and ranges for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160860 (owner: 10Hnowlan) [15:57:58] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:58:12] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-coord1003.eqiad.wmnet [15:58:59] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:59:31] !log deployed conftool 5.3.0 to all bullseye and bookworm hosts - T395696 [15:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:36] T395696: Move ExternalStore config out of mediawiki config - https://phabricator.wikimedia.org/T395696 [16:00:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.05%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [16:01:06] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10928851 (10BTullis) >>! In T394543#10927811, @Volans wrote: > If you try to re-run it it does tell you there is nothing to upgrade right? I can confirm t... [16:01:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10928854 (10BTullis) 05Open→03Resolved [16:02:49] !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on an-mariadb1002.eqiad.wmnet with reason: Upgrading SSD firmware [16:02:53] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-mariadb100[1-2] - https://phabricator.wikimedia.org/T394498#10928859 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fe78ceb4-644b-4a5a-a80d-c1b0a1c98616) set by btullis@cumin1003 for 1:00:00... [16:03:07] (03Merged) 10jenkins-bot: admin_ng: remove quotas and ranges for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160860 (owner: 10Hnowlan) [16:03:11] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:03:32] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-mariadb1002.eqiad.wmnet [16:05:03] (03CR) 10LorenMora: [C:03+1] Revert "Enable new mobile search experience everywhere (not including empty search recommendations)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160858 (owner: 10Kimberly Sarabia) [16:05:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.08%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [16:07:15] btullis@cumin1003 upgrade-firmware (PID 2131701) is awaiting input [16:07:46] (03CR) 10Hashar: "I had the issue with docker-pkg for quite a while and I came to fix it as I went to address I6f1a443473ae92f24651fd9879b8c156d5adb2c5" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156733 (owner: 10Hashar) [16:08:23] jhancock@cumin1003 provision (PID 2131666) is awaiting input [16:09:24] (03PS1) 10Cwhite: logstash: drop mobileapps detail field [puppet] - 10https://gerrit.wikimedia.org/r/1160875 (https://phabricator.wikimedia.org/T390215) [16:10:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P78371 and previous config saved to /var/cache/conftool/dbconfig/20250618-161003-marostegui.json [16:10:17] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp7001.magru.wmnet with reason: BIOS upgrades [16:10:28] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:10:34] (03PS2) 10Cwhite: logstash: drop mobileapps detail field [puppet] - 10https://gerrit.wikimedia.org/r/1160875 (https://phabricator.wikimedia.org/T390215) [16:10:42] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-mariadb1002.eqiad.wmnet [16:10:45] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:11:43] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:11:51] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10928876 (10BTullis) [16:12:18] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-mariadb100[1-2] - https://phabricator.wikimedia.org/T394498#10928879 (10BTullis) [16:13:09] (03CR) 10Cwhite: [C:03+2] logstash: drop mobileapps detail field [puppet] - 10https://gerrit.wikimedia.org/r/1160875 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [16:13:43] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10928884 (10Volans) yes if you pick the same version (option 0 above) it would just tell you that there is nothing to do because already at the same versio... [16:15:47] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:16:48] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:18:04] (03PS3) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 [16:18:13] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10928900 (10Jhancock.wm) also looks like i'm gonna need to drag @elukey into this. I manually set the ip and the user password for these servers but i still can't get a... [16:19:39] !log hnowlan@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:19:56] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2005.codfw.wmnet with OS bullseye [16:20:06] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-mariadb1002.eqiad.wmnet [16:20:08] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts an-mariadb1002.eqiad.wmnet [16:20:11] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host cloudcephosd2005.codfw.wmnet with OS bullseye [16:20:47] !log hnowlan@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:20:49] (03CR) 10Hashar: Change log format to get name of image being built (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156315 (owner: 10Hashar) [16:21:37] !log hnowlan@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:21:52] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:esams or A:drmrs and A:cp - 9.2.10 upgrade (T390912) [16:21:57] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [16:23:47] !log hnowlan@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:25:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P78372 and previous config saved to /var/cache/conftool/dbconfig/20250618-162511-marostegui.json [16:26:55] (03PS4) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 [16:27:07] (03Abandoned) 10Aqu: Bump MW Page content change app version [deployment-charts] - 10https://gerrit.wikimedia.org/r/960610 (https://phabricator.wikimedia.org/T344688) (owner: 10Aqu) [16:28:01] (03PS1) 10Btullis: Prepare for renaming kafka-stretc200[1-2] to dse-k8s-worker200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1160888 (https://phabricator.wikimedia.org/T353789) [16:28:25] (03PS2) 10Btullis: Prepare for renaming kafka-stretch200[1-2] to dse-k8s-worker200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1160888 (https://phabricator.wikimedia.org/T353789) [16:29:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [16:29:53] (03CR) 10Dzahn: [C:03+2] phabricator::migration: ensure /srv/phab is the correct symlink [puppet] - 10https://gerrit.wikimedia.org/r/1160310 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [16:30:33] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [16:31:58] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:33:29] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:33:46] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:34:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [16:34:10] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:34:30] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [16:34:43] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [16:34:43] jouncebot: nowandnext [16:34:43] No deployments scheduled for the next 0 hour(s) and 25 minute(s) [16:34:43] In 0 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1700) [16:35:30] (03CR) 10Btullis: [C:03+2] Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [16:37:42] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10929005 (10Jhancock.wm) @Andrew i got these to the point where the image is on them, but for some reason it's not syncing with the puppetdb. Could you chec... [16:37:55] (03Merged) 10jenkins-bot: Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [16:38:26] (03PS1) 10Jgiannelos: RB sunset: Abandon event processing for PCS events older than cache TTL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 [16:39:21] (03PS2) 10Jgiannelos: RB sunset: Abandon event processing for PCS events older than cache TTL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) [16:40:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T396130)', diff saved to https://phabricator.wikimedia.org/P78373 and previous config saved to /var/cache/conftool/dbconfig/20250618-164019-marostegui.json [16:40:24] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [16:40:30] (03PS3) 10Jgiannelos: RB sunset: Abandon event processing for PCS events older than cache TTL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) [16:40:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1235.eqiad.wmnet with reason: Maintenance [16:40:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T396130)', diff saved to https://phabricator.wikimedia.org/P78374 and previous config saved to /var/cache/conftool/dbconfig/20250618-164041-marostegui.json [16:41:00] jmm@cumin1003 drain-node (PID 2122878) is awaiting input [16:41:28] (03CR) 10Alexandros Kosiaris: [C:03+1] "Couple of inline comments. Overall, this should work (probably does, I see it is merged already, however I am replying to the review reque" [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [16:41:53] (03CR) 10CI reject: [V:04-1] RB sunset: Abandon event processing for PCS events older than cache TTL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos) [16:41:54] (03PS4) 10Jgiannelos: RB sunset: Configure claim TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) [16:43:14] (03CR) 10CI reject: [V:04-1] RB sunset: Configure claim TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos) [16:44:19] (03CR) 10Hnowlan: [C:04-1] RB sunset: Configure claim TTL for PCS related endpoints (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos) [16:45:38] (03PS5) 10Jgiannelos: RB sunset: Configure claim TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) [16:47:01] !log cdobbins@cumin2002:~$ sudo -i cookbook sre.cdn.roll-upgrade-ats --query 'A:cp-eqsin' --task-id T390912 --reason '9.2.10 upgrade' [16:47:02] (03CR) 10CI reject: [V:04-1] RB sunset: Configure claim TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos) [16:47:03] (03PS5) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) [16:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:06] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [16:47:26] !log cdobbins@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-eqsin and A:cp - 9.2.10 upgrade (T390912) [16:48:25] jouncebot nowandnext [16:48:26] No deployments scheduled for the next 0 hour(s) and 11 minute(s) [16:48:26] In 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1700) [16:49:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136044 (https://phabricator.wikimedia.org/T364694) (owner: 10Aklapper) [16:50:01] (03Merged) 10jenkins-bot: Update entries on https://www.mediawiki.org/keys/keys.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136044 (https://phabricator.wikimedia.org/T364694) (owner: 10Aklapper) [16:50:04] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2005.codfw.wmnet with OS bullseye [16:50:11] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10929057 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host cloudcephosd2005.codfw.wmnet with OS bullseye ex... [16:50:31] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1136044|Update entries on https://www.mediawiki.org/keys/keys.html (T364694)]] [16:50:36] T364694: https://www.mediawiki.org/keys/ needs update - https://phabricator.wikimedia.org/T364694 [16:52:46] !log dancy@deploy1003 dancy, aklapper: Backport for [[gerrit:1136044|Update entries on https://www.mediawiki.org/keys/keys.html (T364694)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:53:41] !log dancy@deploy1003 dancy, aklapper: Continuing with sync [16:55:50] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [16:56:51] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007.codfw.wmnet with OS bullseye [16:57:00] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10929065 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host cloudcephosd2007.codfw.wmnet with OS bullseye ex... [16:59:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [16:59:20] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:59:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:06] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1700) [17:00:41] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136044|Update entries on https://www.mediawiki.org/keys/keys.html (T364694)]] (duration: 10m 09s) [17:00:43] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding rdb2011 to codfw - jhancock@cumin1003" [17:00:46] T364694: https://www.mediawiki.org/keys/ needs update - https://phabricator.wikimedia.org/T364694 [17:01:00] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding rdb2011 to codfw - jhancock@cumin1003" [17:01:00] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:01:04] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker2006 [17:01:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T396130)', diff saved to https://phabricator.wikimedia.org/P78375 and previous config saved to /var/cache/conftool/dbconfig/20250618-170109-marostegui.json [17:01:15] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [17:01:16] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker2006 [17:01:19] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker2007 [17:01:28] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker2007 [17:01:31] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker2008 [17:01:41] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker2008 [17:01:43] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker2009 [17:01:54] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker2009 [17:01:56] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host build2003 [17:02:06] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host build2003 [17:02:11] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host rdb2011 [17:02:20] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host rdb2011 [17:02:26] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host rdb2012 [17:02:41] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host rdb2012 [17:02:43] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003 [17:02:52] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2003 [17:03:14] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest2004 [17:03:24] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2004 [17:03:27] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest2006 [17:03:42] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2006 [17:03:45] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest2009 [17:03:54] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2009 [17:04:07] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [17:04:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [17:04:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:04:32] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:06:41] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:07:12] (03PS1) 10Hnowlan: Revert "mobileapps: Remove resource limits for canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160914 [17:09:20] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host rdb2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:09:39] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host rdb2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:09:40] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:11:17] (03CR) 10Hnowlan: [C:03+2] Revert "mobileapps: Remove resource limits for canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160914 (owner: 10Hnowlan) [17:11:38] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:12:35] jhancock@cumin1003 provision (PID 2138389) is awaiting input [17:13:01] jhancock@cumin1003 provision (PID 2138412) is awaiting input [17:13:08] (03Merged) 10jenkins-bot: Revert "mobileapps: Remove resource limits for canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160914 (owner: 10Hnowlan) [17:13:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.334s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:15:51] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:16:09] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:16:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P78376 and previous config saved to /var/cache/conftool/dbconfig/20250618-171617-marostegui.json [17:17:07] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [17:17:39] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [17:18:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.111s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:22:09] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:24:46] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160223 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [17:24:51] (03CR) 10Scott French: [C:03+2] shellbox: migrate to bookworm-based httpd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160223 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [17:27:00] (03Merged) 10jenkins-bot: shellbox: migrate to bookworm-based httpd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160223 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [17:27:53] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [17:28:19] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [17:28:20] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [17:28:30] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-eqiad and A:cp - 9.2.10 upgrade (T390912) [17:28:34] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [17:28:34] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [17:28:35] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [17:28:45] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [17:28:46] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:28:54] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:28:56] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [17:29:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [17:29:14] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [17:29:15] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [17:29:37] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [17:31:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P78377 and previous config saved to /var/cache/conftool/dbconfig/20250618-173124-marostegui.json [17:31:43] (03PS1) 10Majavah: hieradata: Fix Cloud VPS radosgw image CSP [puppet] - 10https://gerrit.wikimedia.org/r/1160934 (https://phabricator.wikimedia.org/T397351) [17:31:56] (03PS1) 10Hashar: cloudlb: remove erroneous CSP policy [puppet] - 10https://gerrit.wikimedia.org/r/1160935 (https://phabricator.wikimedia.org/T397351) [17:32:41] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [17:33:27] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [17:33:47] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10929205 (10Andrew) @Jhancock.wm I will have a look. I've also just noticed that the names for these servers is wrong, everything should be -dev. I'll updat... [17:33:58] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [17:34:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [17:34:17] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [17:34:49] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [17:35:05] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [17:35:26] (03PS1) 10Btullis: Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) [17:35:34] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7001.* [17:35:36] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:35:39] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:36:10] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [17:36:34] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [17:36:55] (03CR) 10CI reject: [V:04-1] Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [17:37:05] (03PS1) 10Hashar: cloudlb: allow inline data in Object Storage content page [puppet] - 10https://gerrit.wikimedia.org/r/1160941 (https://phabricator.wikimedia.org/T397351) [17:37:05] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [17:37:31] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [17:37:41] (03Abandoned) 10Majavah: hieradata: Fix Cloud VPS radosgw image CSP [puppet] - 10https://gerrit.wikimedia.org/r/1160934 (https://phabricator.wikimedia.org/T397351) (owner: 10Majavah) [17:38:25] (03CR) 10Majavah: [C:03+2] cloudlb: remove erroneous CSP policy [puppet] - 10https://gerrit.wikimedia.org/r/1160935 (https://phabricator.wikimedia.org/T397351) (owner: 10Hashar) [17:38:36] (03CR) 10Majavah: [C:03+2] cloudlb: allow inline data in Object Storage content page [puppet] - 10https://gerrit.wikimedia.org/r/1160941 (https://phabricator.wikimedia.org/T397351) (owner: 10Hashar) [17:39:31] !log migrated all shellbox instances to bookworm-based httpd images in codfw - T378128 [17:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:35] T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128 [17:39:39] (03CR) 10Eevans: [C:03+1] thanos: add new backends, remove old ones gone from rings [puppet] - 10https://gerrit.wikimedia.org/r/1160855 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [17:39:55] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:40:09] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10929229 (10Andrew) [17:40:26] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:40:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:42:21] (03PS2) 10Ladsgroup: conftool: Allow es6 and es7 being set to read only via dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1153723 (https://phabricator.wikimedia.org/T395696) [17:42:31] (03CR) 10Ladsgroup: [V:03+2 C:03+2] conftool: Allow es6 and es7 being set to read only via dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1153723 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup) [17:43:35] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host rdb2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:43:49] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host rdb2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:44:00] (03PS2) 10Btullis: Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) [17:44:14] (03CR) 10Eevans: [C:03+1] thanos: add new nodes to ring, drain old ones [puppet] - 10https://gerrit.wikimedia.org/r/1160856 (https://phabricator.wikimedia.org/T392908) (owner: 10MVernon) [17:44:47] (03PS3) 10Btullis: Prepare for renaming kafka-stretch200[1-2] to dse-k8s-worker200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1160888 (https://phabricator.wikimedia.org/T353789) [17:46:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T396130)', diff saved to https://phabricator.wikimedia.org/P78378 and previous config saved to /var/cache/conftool/dbconfig/20250618-174632-marostegui.json [17:46:37] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [17:46:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [17:48:58] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:49:16] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:49:16] 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10929269 (10BCornwall) @siebrand was able to disable dnssec - once that's propagated we should hopefully be golden. [17:49:38] (03PS3) 10Btullis: Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) [17:49:52] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host rdb2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:50:06] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host rdb2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:50:20] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:50:28] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:50:32] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install rdb201[12] - https://phabricator.wikimedia.org/T393121#10929271 (10Jhancock.wm) a:03akosiaris [17:51:00] (03CR) 10CI reject: [V:04-1] Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [17:51:18] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install rdb201[12] - https://phabricator.wikimedia.org/T393121#10929274 (10Jhancock.wm) @akosiaris can you add these two servers to site.pp for me please? i saw they're already covered in preseed. Should be able to hand these over to you pretty quic... [17:51:43] (03PS4) 10Btullis: Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) [17:52:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db2155 for queries (T385167)', diff saved to https://phabricator.wikimedia.org/P78379 and previous config saved to /var/cache/conftool/dbconfig/20250618-175206-ladsgroup.json [17:52:12] T385167: Run data migration script for file migration - https://phabricator.wikimedia.org/T385167 [17:53:19] (03CR) 10CI reject: [V:04-1] Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [17:53:27] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [17:54:11] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Running queries (T385167) [17:54:12] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [17:54:44] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [17:54:59] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [17:55:30] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [17:55:45] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [17:56:16] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:56:18] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:56:50] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [17:57:13] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [17:57:44] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [17:57:45] (03PS5) 10Btullis: Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) [17:58:15] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [17:58:53] !log migrated all shellbox instances to bookworm-based httpd images in eqiad - T378128 [17:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:58] T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128 [17:59:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.04%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [17:59:12] (03CR) 10CI reject: [V:04-1] Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [17:59:40] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1048.eqiad.wmnet [18:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1800) [18:00:22] o/ [18:00:25] (03PS6) 10Btullis: Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) [18:00:27] nothing for this window, afaik. [18:03:21] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:03:37] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:03:53] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:04:09] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:04:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.16%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [18:05:11] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [18:07:25] jhancock@cumin1003 provision (PID 2144116) is awaiting input [18:07:35] (03PS5) 10Ebernhardson: cirrus: Add services for read operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838270 (https://phabricator.wikimedia.org/T143553) [18:07:35] (03PS6) 10Ebernhardson: Use discovery dns for elasticsearch read traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838271 (https://phabricator.wikimedia.org/T143553) [18:07:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838270 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [18:08:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838271 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [18:13:00] !log ladsgroup@deploy1003 Started scap sync-world: Deploy arclamp [18:13:22] !log ladsgroup@deploy1003 sync-world aborted: Deploy arclamp (duration: 00m 33s) [18:14:10] !log ladsgroup@deploy1003 Started deploy [performance/arc-lamp@76afb89]: Deploy arclamp [18:14:19] !log ladsgroup@deploy1003 Finished deploy [performance/arc-lamp@76afb89]: Deploy arclamp (duration: 00m 08s) [18:15:24] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:16:10] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:16:55] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:17:06] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:17:22] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:17:50] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:17:52] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:18:42] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:19:47] (03CR) 10Bking: [C:03+1] Prepare for renaming kafka-stretch200[1-2] to dse-k8s-worker200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1160888 (https://phabricator.wikimedia.org/T353789) (owner: 10Btullis) [18:20:34] jhancock@cumin1003 provision (PID 2144596) is awaiting input [18:22:19] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:23:06] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1251.eqiad.wmnet with reason: Maintenance [18:23:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1251 (T396130)', diff saved to https://phabricator.wikimedia.org/P78381 and previous config saved to /var/cache/conftool/dbconfig/20250618-182313-marostegui.json [18:23:15] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:23:19] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [18:24:05] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:24:15] jouncebot: nowandnext [18:24:16] For the next 1 hour(s) and 35 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1800) [18:24:16] In 1 hour(s) and 35 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T2000) [18:24:53] okay, deploying something then [18:25:58] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:26:14] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:26:21] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:26:22] yeah, all yours Amir1. [18:26:34] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:26:37] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:26:42] Thanks! [18:26:58] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:27:05] (03CR) 10Ladsgroup: [C:03+2] etcd: Remove ES clusters from "write clusters" if section is RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152853 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup) [18:27:29] \o/ [18:27:43] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:27:55] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:28:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152853 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup) [18:28:19] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:28:37] (03Merged) 10jenkins-bot: etcd: Remove ES clusters from "write clusters" if section is RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152853 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup) [18:28:59] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1152853|etcd: Remove ES clusters from "write clusters" if section is RO (T395696)]] [18:29:03] T395696: Move ExternalStore config out of mediawiki config - https://phabricator.wikimedia.org/T395696 [18:31:14] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1152853|etcd: Remove ES clusters from "write clusters" if section is RO (T395696)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:38:38] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-eqsin and A:cp - 9.2.10 upgrade (T390912) [18:38:42] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [18:38:52] (03CR) 10BCornwall: [V:04-1 C:04-1] "Presently this fails varnish tests. Comments inline!" [puppet] - 10https://gerrit.wikimedia.org/r/1154085 (owner: 10CDobbins) [18:43:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Testing T395696', diff saved to https://phabricator.wikimedia.org/P78382 and previous config saved to /var/cache/conftool/dbconfig/20250618-184325-ladsgroup.json [18:43:31] T395696: Move ExternalStore config out of mediawiki config - https://phabricator.wikimedia.org/T395696 [18:45:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T396130)', diff saved to https://phabricator.wikimedia.org/P78383 and previous config saved to /var/cache/conftool/dbconfig/20250618-184538-marostegui.json [18:45:43] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [18:47:59] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-eqiad and A:cp - 9.2.10 upgrade (T390912) [18:48:04] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [18:49:03] (03PS1) 10Ladsgroup: etcd: Check for array key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160990 (https://phabricator.wikimedia.org/T395696) [18:49:05] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [18:51:35] (03CR) 10Scott French: [C:03+1] "Good catch!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160990 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup) [18:51:39] (03CR) 10Ladsgroup: [C:03+2] etcd: Check for array key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160990 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup) [18:52:33] (03Merged) 10jenkins-bot: etcd: Check for array key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160990 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup) [18:55:54] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152853|etcd: Remove ES clusters from "write clusters" if section is RO (T395696)]] (duration: 26m 55s) [18:55:59] T395696: Move ExternalStore config out of mediawiki config - https://phabricator.wikimedia.org/T395696 [18:56:18] !log ryankemper@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on 6 hosts with reason: T395772 hosts not serving production traffic [18:56:22] T395772: Teardown lvs for wdqs public pool - https://phabricator.wikimedia.org/T395772 [18:57:20] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1160990|etcd: Check for array key (T395696)]] [18:59:42] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1160990|etcd: Check for array key (T395696)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:00:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P78384 and previous config saved to /var/cache/conftool/dbconfig/20250618-190045-marostegui.json [19:03:01] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [19:05:16] !log cdobbins@cumin2002:~$ sudo -i cookbook sre.cdn.roll-upgrade-ats --query 'A:cp-codfw' --task-id T390912 --reason '9.2.10 upgrade' [19:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:22] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [19:05:23] !log cdobbins@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-codfw and A:cp - 9.2.10 upgrade (T390912) [19:09:59] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160990|etcd: Check for array key (T395696)]] (duration: 12m 39s) [19:10:04] T395696: Move ExternalStore config out of mediawiki config - https://phabricator.wikimedia.org/T395696 [19:14:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Testing T395696', diff saved to https://phabricator.wikimedia.org/P78385 and previous config saved to /var/cache/conftool/dbconfig/20250618-191440-ladsgroup.json [19:15:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P78386 and previous config saved to /var/cache/conftool/dbconfig/20250618-191553-marostegui.json [19:17:54] (03PS3) 10NMW03: Set category collation to "uca-az" for Azerbaijani projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153722 (https://phabricator.wikimedia.org/T395896) [19:19:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153722 (https://phabricator.wikimedia.org/T395896) (owner: 10NMW03) [19:25:21] !ping [19:25:21] pong [19:27:32] (03CR) 10Ryan Kemper: [C:03+2] wdqs: fork SLOs for wdqs-main and wdqs-scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1155335 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [19:30:43] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2006.codfw.wmnet with OS bookworm [19:30:51] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host aux-k8s-worker2006.codfw.wmnet with OS bookworm [19:31:01] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2007.codfw.wmnet with OS bookworm [19:31:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T396130)', diff saved to https://phabricator.wikimedia.org/P78387 and previous config saved to /var/cache/conftool/dbconfig/20250618-193101-marostegui.json [19:31:07] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929506 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host aux-k8s-worker2007.codfw.wmnet with OS bookworm [19:31:08] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [19:31:15] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2008.codfw.wmnet with OS bookworm [19:31:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [19:31:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929510 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host aux-k8s-worker2008.codfw.wmnet with OS bookworm [19:31:31] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2009.codfw.wmnet with OS bookworm [19:31:37] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929511 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host aux-k8s-worker2009.codfw.wmnet with OS bookworm [19:32:45] !log T393966 Ran puppet on `titan1001` following merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1155335. Puppet looks happy and I see the new recording rules getting created [19:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:49] T393966: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966 [19:40:24] FIRING: SLOMetricAbsent: wdqs-scholarly-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:43:27] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2006.codfw.wmnet with reason: host reimage [19:43:33] (03PS1) 10Ryan Kemper: wdqs: absent old availability metric [puppet] - 10https://gerrit.wikimedia.org/r/1161024 (https://phabricator.wikimedia.org/T393966) [19:43:34] (03PS1) 10Ryan Kemper: wdqs: remove previously-absented slo [puppet] - 10https://gerrit.wikimedia.org/r/1161025 (https://phabricator.wikimedia.org/T393966) [19:43:41] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2007.codfw.wmnet with reason: host reimage [19:43:59] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2008.codfw.wmnet with reason: host reimage [19:44:08] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2009.codfw.wmnet with reason: host reimage [19:45:24] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:47:16] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2006.codfw.wmnet with reason: host reimage [19:47:33] (03CR) 10Herron: [C:03+1] wdqs: remove previously-absented slo [puppet] - 10https://gerrit.wikimedia.org/r/1161025 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [19:50:50] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2009.codfw.wmnet with reason: host reimage [19:51:22] (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove previously-absented slo [puppet] - 10https://gerrit.wikimedia.org/r/1161025 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [19:51:34] (03CR) 10Ryan Kemper: [C:03+1] "oops, meant to +2 other patch first" [puppet] - 10https://gerrit.wikimedia.org/r/1161025 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [19:51:47] (03CR) 10Ryan Kemper: [C:03+2] wdqs: absent old availability metric [puppet] - 10https://gerrit.wikimedia.org/r/1161024 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [19:53:33] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): WDQS Update Lag SLO looks wrong - https://phabricator.wikimedia.org/T395987#10929587 (10RKemper) [19:54:41] !log dancy@deploy1003 Installing scap version "4.179.0" for 2 host(s) [19:54:52] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2007.codfw.wmnet with reason: host reimage [19:55:13] PROBLEM - MD RAID on logstash2035 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:55:14] ACKNOWLEDGEMENT - MD RAID on logstash2035 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T397366 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:55:24] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on aux-k8s-worker2008.codfw.wmnet with reason: host reimage [19:55:26] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on logstash2035 - https://phabricator.wikimedia.org/T397366 (10ops-monitoring-bot) 03NEW [19:55:49] PROBLEM - OpenSearch health check for shards on 9200 on logstash2035 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [19:56:30] !log dancy@deploy1003 Installation of scap version "4.179.0" completed for 2 hosts [19:57:38] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): WDQS Update Lag SLO looks wrong - https://phabricator.wikimedia.org/T395987#10929603 (10RKemper) New SLOs/SLIs are in place and old ones have been fully absented. Agreed with @elukey that we should get the SLOs officially approved (&... [19:58:41] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#10929605 (10RKemper) [19:59:45] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T2000). [20:00:06] kimberly_sarabia, ebernhardson, and Nemoralis: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] here [20:00:15] o/ [20:00:17] hey [20:00:48] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#10929623 (10RKemper) [20:01:18] i suppose i can do the deploy [20:02:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160858 (owner: 10Kimberly Sarabia) [20:02:19] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:02:52] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2005-dev [20:03:02] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2005-dev [20:03:06] ty [20:03:07] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2006-dev [20:03:09] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:03:18] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2006-dev [20:03:25] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2007-dev [20:03:33] (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove previously-absented slo [puppet] - 10https://gerrit.wikimedia.org/r/1161025 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [20:03:34] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2007-dev [20:03:42] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:03:43] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2006.codfw.wmnet with OS bookworm [20:03:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host aux-k8s-worker2006.codfw.wmnet with OS bookworm comple... [20:03:58] (03Merged) 10jenkins-bot: Revert "Enable new mobile search experience everywhere (not including empty search recommendations)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160858 (owner: 10Kimberly Sarabia) [20:04:22] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1160858|Revert "Enable new mobile search experience everywhere (not including empty search recommendations)"]] [20:06:24] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:06:35] !log ebernhardson@deploy1003 ebernhardson, ksarabia: Backport for [[gerrit:1160858|Revert "Enable new mobile search experience everywhere (not including empty search recommendations)"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:14] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:07:15] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2009.codfw.wmnet with OS bookworm [20:07:22] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host aux-k8s-worker2009.codfw.wmnet with OS bookworm comple... [20:07:25] kimberly_sarabia: already it's up on test servers, can you verify? [20:08:03] ebernhardson: I see the revert. thank you LGTM [20:08:14] !log ebernhardson@deploy1003 ebernhardson, ksarabia: Continuing with sync [20:08:17] alright, continuing [20:10:42] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:11:05] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:11:05] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2007.codfw.wmnet with OS bookworm [20:11:09] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 654.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:11:12] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929649 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host aux-k8s-worker2007.codfw.wmnet with OS bookworm comple... [20:11:46] ebernhardson: Can I squeeze in a scap update before the next deployment? It will take about 2 minutes. [20:11:55] dancy: yea should be ok [20:11:58] thx [20:12:14] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:13:11] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:13:12] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2008.codfw.wmnet with OS bookworm [20:13:17] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host aux-k8s-worker2008.codfw.wmnet with OS bookworm comple... [20:13:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929662 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [20:13:46] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929666 (10Jhancock.wm) @akosiaris this one is complete [20:13:54] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1160750 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [20:14:33] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [20:15:07] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160858|Revert "Enable new mobile search experience everywhere (not including empty search recommendations)"]] (duration: 10m 45s) [20:16:00] dancy: alright you're up [20:18:20] thx [20:18:30] !log dancy@deploy1003 Installing scap version "4.179.1" for 2 host(s) [20:20:19] !log dancy@deploy1003 Installation of scap version "4.179.1" completed for 2 hosts [20:20:42] ebernhardson: Done! [20:21:02] awesome [20:21:31] dancy: hmm, it's acting a little odd [20:21:51] oh never mind, im looking at wrong thing :P [20:22:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:22:40] dancy: hmm, no it's being weird. I tell it patch 838270, and it tells me a obut ptach 838182 [20:22:50] Taking a look [20:23:31] It's mentioning 838182 due to the Depends-On: Ie6dfb586f6b22867a13b8b29d920da8409e94015 in 838270 [20:23:56] it doesn't like cross-repo depends-on? The patch is merged [20:24:08] i suppose i can remove that from the commit message [20:24:20] you can just answer 'y' to the question if you want to proceed. [20:24:26] It's just a warning. [20:24:40] ahh, i was worried it would do something awkward since it's talking about not finding 'production' in wikiversions [20:24:42] If everything is correct (e.g., the depended-on patch is merged and working), it's ok [20:24:54] The message definitely needs improvement. [20:25:09] (e.g, it should mention that it's talking about a dependency of one of the changes you supplied) [20:25:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838270 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [20:25:33] ok sounds good, thanks for the clarification [20:26:27] (03Merged) 10jenkins-bot: cirrus: Add services for read operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838270 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [20:26:48] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:838270|cirrus: Add services for read operations (T143553)]] [20:26:53] T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553 [20:27:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.109s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:29:07] !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:838270|cirrus: Add services for read operations (T143553)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:31:11] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-codfw and A:cp - 9.2.10 upgrade (T390912) [20:31:15] !log ebernhardson@deploy1003 ebernhardson: Continuing with sync [20:31:17] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [20:31:41] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10929741 (10Jhancock.wm) actually, i think that would have been it. i usually only get that error when it's not in the site.pp file. my bad [20:36:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.61s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:37:21] !log gerrit: deleted bunch of obsoletes references under `refs/users/*` accross all repositories. See T397317 (private) [20:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:59] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:838270|cirrus: Add services for read operations (T143553)]] (duration: 11m 11s) [20:38:04] T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553 [20:38:16] Nemoralis: you're up next [20:38:23] i am here [20:38:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153722 (https://phabricator.wikimedia.org/T395896) (owner: 10NMW03) [20:39:20] by the way, you need to run a maintenance script for my patch [20:39:24] (03Merged) 10jenkins-bot: Set category collation to "uca-az" for Azerbaijani projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153722 (https://phabricator.wikimedia.org/T395896) (owner: 10NMW03) [20:39:32] https://www.mediawiki.org/wiki/Manual:UpdateCollation.php [20:39:48] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1153722|Set category collation to "uca-az" for Azerbaijani projects (T395896)]] [20:39:54] T395896: Set category collation for Azerbaijani projects - https://phabricator.wikimedia.org/T395896 [20:41:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.61s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:42:06] !log ebernhardson@deploy1003 nmw03, ebernhardson: Backport for [[gerrit:1153722|Set category collation to "uca-az" for Azerbaijani projects (T395896)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:42:24] Nemoralis: can you verify? [20:42:55] sure, but I believe this will work after maintenance script [20:43:46] !log volans@cumin2002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet with reason: Release v0.10.1 - volans@cumin2002 [20:43:53] ok, continue with sync i suppose? [20:44:02] yep [20:44:05] !log ebernhardson@deploy1003 nmw03, ebernhardson: Continuing with sync [20:44:25] you will need to run updateCollation for 4 wikis [20:44:35] !log volans@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet with reason: Release v0.10.1 - volans@cumin2002 [20:44:53] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7001.* [20:45:33] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:46:00] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:46:33] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:48:32] (03PS1) 10Dzahn: phabricator::migration: add parameters phabdir,storage_user,deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889) [20:49:41] (03PS2) 10Dzahn: phabricator::migration: add parameters phabdir,storage_user,deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889) [20:50:55] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153722|Set category collation to "uca-az" for Azerbaijani projects (T395896)]] (duration: 11m 06s) [20:51:00] T395896: Set category collation for Azerbaijani projects - https://phabricator.wikimedia.org/T395896 [20:51:53] (03CR) 10CI reject: [V:04-1] phabricator::migration: add parameters phabdir,storage_user,deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [20:52:08] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1161042/6014/phab1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [20:52:16] !log running updateCollation.php for azwikibooks, azwikiquote, azwikisource, and azwiktionary [20:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838271 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [20:53:53] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloudcephosd200[567]-dev service implementation - https://phabricator.wikimedia.org/T397237#10929828 (10Andrew) [20:54:06] jhancock@cumin1003 provision (PID 2167210) is awaiting input [20:54:16] (03Merged) 10jenkins-bot: Use discovery dns for elasticsearch read traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838271 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [20:54:36] jhancock@cumin1003 provision (PID 2167233) is awaiting input [20:54:38] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:838271|Use discovery dns for elasticsearch read traffic (T143553)]] [20:54:45] T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553 [20:56:52] !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:838271|Use discovery dns for elasticsearch read traffic (T143553)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:57:17] jhancock@cumin1003 provision (PID 2167269) is awaiting input [20:57:56] !log ebernhardson@deploy1003 ebernhardson: Continuing with sync [20:58:44] (03PS3) 10Dzahn: phabricator::migration: add parameters phabdir,storage_user,deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889) [20:59:25] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:59:26] Nemoralis: maint script is complete on the 4 wikis [20:59:27] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:59:29] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:59:59] !log updateCollation.php for azwikibooks, azwikiquote, azwikisource, and azwiktionary completed [21:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:05] (03PS4) 10Dzahn: phabricator::migration: add parameters phabdir,storage_user,deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889) [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T2100) [21:00:15] ebernhardson: just tested, works fine [21:00:17] thanks! [21:00:22] awesome! [21:01:07] If all deployment are done I'm going to update scap one more time [21:01:16] it's still shipping one more [21:01:19] ok [21:01:19] but almost done [21:02:48] (03PS5) 10Dzahn: phabricator::migration: add parameters phabdir,storage_user,deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889) [21:03:10] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [21:03:57] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1161042/6016/phab1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [21:04:32] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:52] (03CR) 10Dzahn: [V:03+1 C:03+2] "this role is applied only on the "next phab" machine. used for DB upgrade test and PHP8 test." [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [21:04:52] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:838271|Use discovery dns for elasticsearch read traffic (T143553)]] (duration: 10m 14s) [21:04:57] T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553 [21:05:12] dancy: all yours now [21:05:16] Thanks! [21:05:25] !log dancy@deploy1003 Installing scap version "4.180.0" for 2 host(s) [21:05:40] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:06:20] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:06:50] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:07:14] !log dancy@deploy1003 Installation of scap version "4.180.0" completed for 2 hosts [21:07:29] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:12:21] (03CR) 10Bking: [C:03+1] Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [21:13:03] (03CR) 10Ahmon Dancy: "This should wait until https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/827 is done and the latest scap is deployed to beta" [puppet] - 10https://gerrit.wikimedia.org/r/1155318 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy) [21:17:08] jhancock@cumin1003 provision (PID 2169849) is awaiting input [21:17:37] jhancock@cumin1003 provision (PID 2169899) is awaiting input [21:18:16] jhancock@cumin1003 provision (PID 2169942) is awaiting input [21:19:02] (03PS1) 10Aqu: Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) [21:19:58] (03PS2) 10Aqu: Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) [21:22:24] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:22:25] (03CR) 10CI reject: [V:04-1] Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [21:22:30] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:22:42] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:23:55] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2005.codfw.wmnet with OS bullseye [21:24:09] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10929935 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host cloudcephosd2005.codfw.wmnet with OS bul... [21:27:28] (03PS1) 10Dzahn: phabricator::migration: puppetize password for testdb in script-vars [puppet] - 10https://gerrit.wikimedia.org/r/1161048 (https://phabricator.wikimedia.org/T390034) [21:28:55] (03PS2) 10Dzahn: phabricator::migration: puppetize password for testdb in script-vars [puppet] - 10https://gerrit.wikimedia.org/r/1161048 (https://phabricator.wikimedia.org/T390034) [21:29:13] (03PS16) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [21:30:49] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [21:31:50] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [21:32:56] (03PS1) 10Clare Ming: xLab: Deploy v0.7.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161050 (https://phabricator.wikimedia.org/T372952) [21:34:19] (03PS1) 10Dzahn: add fake password for phab test db admin user [labs/private] - 10https://gerrit.wikimedia.org/r/1161051 (https://phabricator.wikimedia.org/T377889) [21:34:43] (03CR) 10Dzahn: [V:03+2 C:03+2] add fake password for phab test db admin user [labs/private] - 10https://gerrit.wikimedia.org/r/1161051 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [21:34:55] (03PS1) 10Clare Ming: xLab: Deploy v0.7.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161052 (https://phabricator.wikimedia.org/T372952) [21:35:07] (03PS2) 10Dzahn: add fake password for phab test db admin user [labs/private] - 10https://gerrit.wikimedia.org/r/1161051 (https://phabricator.wikimedia.org/T377889) [21:35:17] (03CR) 10Dzahn: [V:03+2] add fake password for phab test db admin user [labs/private] - 10https://gerrit.wikimedia.org/r/1161051 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [21:36:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [21:36:17] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/labs/private/+/1161051" [puppet] - 10https://gerrit.wikimedia.org/r/1161048 (https://phabricator.wikimedia.org/T390034) (owner: 10Dzahn) [21:36:34] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7001.* [21:36:43] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1161048/6018/phab1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1161048 (https://phabricator.wikimedia.org/T390034) (owner: 10Dzahn) [21:37:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.291s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:39:52] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3072.* [21:39:54] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp3072.* [21:40:07] !log Depooling cp3072 to upgrade bios [21:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:44] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp3072.esams.wmnet with reason: BIOS upgrades [21:42:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.291s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:45:11] (03PS1) 10Dzahn: phabricator::migration: fix variable name used for testdb storage pass [puppet] - 10https://gerrit.wikimedia.org/r/1161053 (https://phabricator.wikimedia.org/T377889) [21:45:41] (03CR) 10Dzahn: [C:03+2] phabricator::migration: fix variable name used for testdb storage pass [puppet] - 10https://gerrit.wikimedia.org/r/1161053 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [21:46:03] (03CR) 10Dzahn: [V:03+2 C:03+2] phabricator::migration: fix variable name used for testdb storage pass [puppet] - 10https://gerrit.wikimedia.org/r/1161053 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [21:46:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [21:46:53] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10930043 (10Jclark-ctr) @Marostegui Looks like the Seed server was delivered Jun 12th to the data center {F62381874} @VRiley-WMF this would be the dell that you placed in the new cage. the P.O o... [21:49:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10930054 (10Jclark-ctr) a:03Jclark-ctr [21:53:39] (03CR) 10Santiago Faci: [C:03+2] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161050 (https://phabricator.wikimedia.org/T372952) (owner: 10Clare Ming) [21:55:14] (03PS1) 10Dzahn: phabricator::migration: fix /srv/phab symlink, /srv/repos dir [puppet] - 10https://gerrit.wikimedia.org/r/1161059 (https://phabricator.wikimedia.org/T377889) [21:55:28] (03CR) 10Santiago Faci: [C:03+2] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161052 (https://phabricator.wikimedia.org/T372952) (owner: 10Clare Ming) [21:55:51] (03CR) 10Dzahn: [C:03+2] phabricator::migration: fix /srv/phab symlink, /srv/repos dir [puppet] - 10https://gerrit.wikimedia.org/r/1161059 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [21:56:39] (03Merged) 10jenkins-bot: xLab: Deploy v0.7.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161050 (https://phabricator.wikimedia.org/T372952) (owner: 10Clare Ming) [21:57:18] (03Merged) 10jenkins-bot: xLab: Deploy v0.7.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161052 (https://phabricator.wikimedia.org/T372952) (owner: 10Clare Ming) [21:59:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:59:35] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T2200) [22:01:06] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:04:56] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp3072.esams.wmnet [22:04:57] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp3072.esams.wmnet [22:05:03] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3072.* [22:08:33] PROBLEM - Hadoop NodeManager on an-worker1196 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:09:51] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-codfw and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [22:09:56] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [22:12:09] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:12:30] !log brennen@deploy1003 Started deploy [phabricator/deployment@f8d7b38]: no-op deploy to phab1005 [22:12:37] !log brennen@deploy1003 Finished deploy [phabricator/deployment@f8d7b38]: no-op deploy to phab1005 (duration: 00m 07s) [22:14:20] !log brennen@deploy1003 Started deploy [phabricator/deployment@6af4bb7]: merge-phorge-2024.35 deploy to phab1005 (T390034) [22:14:24] T390034: Prepare a database test for m3 - https://phabricator.wikimedia.org/T390034 [22:14:46] !log brennen@deploy1003 Finished deploy [phabricator/deployment@6af4bb7]: merge-phorge-2024.35 deploy to phab1005 (T390034) (duration: 00m 26s) [22:19:26] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [22:19:51] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [22:23:29] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:25:27] Hi team [22:25:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:25:39] The rename account task seems to be stucked [22:26:33] RECOVERY - Hadoop NodeManager on an-worker1196 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:27:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:27:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/0 (Peering: DE-CIX (DXDB:NAS:173434 MAC filter) {#D0067}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:44:06] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [22:47:32] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for aux-k8s-worker100[6-9] - jclark@cumin1002" [22:47:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for aux-k8s-worker100[6-9] - jclark@cumin1002" [22:47:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:51:28] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on ms-fe1016:9290 - https://phabricator.wikimedia.org/T397261#10930169 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated power cable [22:52:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:52:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/0 (Peering: DE-CIX (DXDB:NAS:173434 MAC filter) {#D0067}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:52:54] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker1006 [22:54:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker1006 [22:54:05] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker1007 [22:55:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker1007 [22:55:18] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker1008 [22:56:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker1008 [22:56:42] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker1009 [22:57:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10930176 (10Jclark-ctr) [22:57:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker1009 [22:58:39] FIRING: TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (2001:7f8:36::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:02:54] RESOLVED: TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (2001:7f8:36::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDo [23:18:29] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:23:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:24:03] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:24:03] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:25:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:26:15] !incidents [23:26:16] 6365 (UNACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [23:26:30] !ack 6365 [23:26:30] 6365 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [23:27:03] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:27:03] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:27:14] that bodes well [23:28:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1161074 [23:38:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1161074 (owner: 10TrainBranchBot) [23:45:24] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:50:27] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:50:43] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:51:51] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1161074 (owner: 10TrainBranchBot) [23:57:17] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:57:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54083 bytes in 0.380 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring