[00:03:29] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_ipmiseld.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:08:29] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1160339
[00:08:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1160339 (owner: 10TrainBranchBot)
[00:12:21] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp30[66-81].esams.wmnet} and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581)
[00:12:26] <stashbot>	 T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581
[00:14:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P78270 and previous config saved to /var/cache/conftool/dbconfig/20250618-001408-ladsgroup.json
[00:18:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:26:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154140 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[00:26:52] <wikibugs>	 (03Merged) 10jenkins-bot: multiversion: Re-use prod for beta setSiteInfoForWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154140 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[00:27:31] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1154140|multiversion: Re-use prod for beta setSiteInfoForWiki (T289318)]]
[00:27:36] <stashbot>	 T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318
[00:29:16] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P78271 and previous config saved to /var/cache/conftool/dbconfig/20250618-002915-ladsgroup.json
[00:29:22] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1160339 (owner: 10TrainBranchBot)
[00:29:46] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1154140|multiversion: Re-use prod for beta setSiteInfoForWiki (T289318)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:30:49] <logmsgbot>	 !log sukhe@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on durum7003.magru.wmnet with reason: insetup host; will resolve service errors later
[00:33:56] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Continuing with sync
[00:39:42] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:40:54] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1154140|multiversion: Re-use prod for beta setSiteInfoForWiki (T289318)]] (duration: 13m 23s)
[00:40:59] <stashbot>	 T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318
[00:44:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T382778)', diff saved to https://phabricator.wikimedia.org/P78272 and previous config saved to /var/cache/conftool/dbconfig/20250618-004423-ladsgroup.json
[00:44:27] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2195.codfw.wmnet with reason: Maintenance
[00:44:28] <stashbot>	 T382778: Optimize text table - https://phabricator.wikimedia.org/T382778
[00:44:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T382778)', diff saved to https://phabricator.wikimedia.org/P78273 and previous config saved to /var/cache/conftool/dbconfig/20250618-004434-ladsgroup.json
[00:47:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T382778)', diff saved to https://phabricator.wikimedia.org/P78274 and previous config saved to /var/cache/conftool/dbconfig/20250618-004745-ladsgroup.json
[00:51:06] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[00:52:15] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[01:02:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P78275 and previous config saved to /var/cache/conftool/dbconfig/20250618-010253-ladsgroup.json
[01:18:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P78276 and previous config saved to /var/cache/conftool/dbconfig/20250618-011800-ladsgroup.json
[01:33:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T382778)', diff saved to https://phabricator.wikimedia.org/P78277 and previous config saved to /var/cache/conftool/dbconfig/20250618-013307-ladsgroup.json
[01:33:12] <stashbot>	 T382778: Optimize text table - https://phabricator.wikimedia.org/T382778
[01:33:23] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2198.codfw.wmnet with reason: Maintenance
[01:39:27] <wikibugs>	 (03PS1) 10Krinkle: varnish: Set X-Analytics `ismobile=1` for mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/1160381 (https://phabricator.wikimedia.org/T390924)
[02:03:50] <wikibugs>	 (03PS2) 10Krinkle: varnish: Set X-Analytics `ismobile=1` for mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/1160381 (https://phabricator.wikimedia.org/T390924)
[02:03:52] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160381 (https://phabricator.wikimedia.org/T390924) (owner: 10Krinkle)
[02:30:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.03%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[02:35:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.03%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[03:31:12] <wikibugs>	 (03CR) 10Krinkle: "Deployed in Beta Cluster." [puppet] - 10https://gerrit.wikimedia.org/r/1160381 (https://phabricator.wikimedia.org/T390924) (owner: 10Krinkle)
[03:35:32] <wikibugs>	 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10926409 (10Krinkle)
[03:58:35] <wikibugs>	 (03PS1) 10Krinkle: beta: Remove unused beta-specific "w.beta.wmcloud.org" vhost [puppet] - 10https://gerrit.wikimedia.org/r/1160441 (https://phabricator.wikimedia.org/T396012)
[04:00:15] <wikibugs>	 (03PS2) 10Krinkle: beta: Remove unused beta-specific "w.beta.wmcloud.org" vhost [puppet] - 10https://gerrit.wikimedia.org/r/1160441 (https://phabricator.wikimedia.org/T396012)
[04:00:17] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160441 (https://phabricator.wikimedia.org/T396012) (owner: 10Krinkle)
[04:00:34] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160441 (https://phabricator.wikimedia.org/T396012) (owner: 10Krinkle)
[04:03:18] <wikibugs>	 (03CR) 10Tim Starling: [C:03+1] varnish: Set X-Analytics `ismobile=1` for mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/1160381 (https://phabricator.wikimedia.org/T390924) (owner: 10Krinkle)
[04:03:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_ipmiseld.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:04:08] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts cirrussearch[2055-2060].codfw.wmnet
[04:09:19] <logmsgbot>	 ryankemper@cumin2002 decommission (PID 2013349) is awaiting input
[04:18:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:22:09] <logmsgbot>	 ryankemper@cumin2002 decommission (PID 2013349) is awaiting input
[04:34:39] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox
[04:39:00] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[2055-2060].codfw.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002"
[04:39:42] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:40:01] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[2055-2060].codfw.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002"
[04:40:02] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[04:40:03] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[2055-2060].codfw.wmnet
[04:44:01] <ryankemper>	 !log [WDQS] Restarted blazegraph on `wdqs2009` just in case it's locked up
[04:44:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:44:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:57:21] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2203.codfw.wmnet with reason: Maintenance
[04:57:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1201 with weight 0 T397198', diff saved to https://phabricator.wikimedia.org/P78278 and previous config saved to /var/cache/conftool/dbconfig/20250618-045741-marostegui.json
[04:57:46] <stashbot>	 T397198: Switchover s6 master (db1173 -> db1201) - https://phabricator.wikimedia.org/T397198
[04:57:50] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s6 T397198
[04:58:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1201 from API/vslow/dump T397198', diff saved to https://phabricator.wikimedia.org/P78279 and previous config saved to /var/cache/conftool/dbconfig/20250618-045821-marostegui.json
[04:58:47] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1201 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1160155 (https://phabricator.wikimedia.org/T397198) (owner: 10Gerrit maintenance bot)
[05:04:31] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_ipmiseld.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:07:38] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:07:54] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:08:38] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:09:28] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:09:28] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54082 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:09:44] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:16:02] <icinga-wm>	 PROBLEM - mailman3_queue_size on lists1004 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 33 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[05:18:01] <marostegui>	 !log Starting s6 eqiad failover from db1173 to db1201 - T397198
[05:18:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:18:06] <stashbot>	 T397198: Switchover s6 master (db1173 -> db1201) - https://phabricator.wikimedia.org/T397198
[05:18:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s6 eqiad as read-only for maintenance - T397198', diff saved to https://phabricator.wikimedia.org/P78281 and previous config saved to /var/cache/conftool/dbconfig/20250618-051812-root.json
[05:18:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1201 to s6 primary and set section read-write T397198', diff saved to https://phabricator.wikimedia.org/P78282 and previous config saved to /var/cache/conftool/dbconfig/20250618-051836-root.json
[05:18:39] <stashbot>	 marostegui@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[05:18:58] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1160156 (https://phabricator.wikimedia.org/T397198)
[05:19:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1173 T397198', diff saved to https://phabricator.wikimedia.org/P78283 and previous config saved to /var/cache/conftool/dbconfig/20250618-051935-root.json
[05:19:50] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1160156 (https://phabricator.wikimedia.org/T397198) (owner: 10Gerrit maintenance bot)
[05:19:53] <logmsgbot>	 !log marostegui@dns1006 START - running authdns-update
[05:20:47] <logmsgbot>	 !log marostegui@dns1006 END - running authdns-update
[05:21:32] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[05:22:43] <wikibugs>	 (03PS1) 10Marostegui: db1173: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160465 (https://phabricator.wikimedia.org/T395989)
[05:23:32] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1173: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160465 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui)
[05:26:38] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[05:26:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T396130)', diff saved to https://phabricator.wikimedia.org/P78284 and previous config saved to /var/cache/conftool/dbconfig/20250618-052645-marostegui.json
[05:26:50] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[05:30:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78285 and previous config saved to /var/cache/conftool/dbconfig/20250618-053038-root.json
[05:32:45] <wikibugs>	 (03PS1) 10Marostegui: db1188: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160466 (https://phabricator.wikimedia.org/T396549)
[05:32:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1188', diff saved to https://phabricator.wikimedia.org/P78286 and previous config saved to /var/cache/conftool/dbconfig/20250618-053253-root.json
[05:33:14] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[05:33:37] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1188: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160466 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui)
[05:34:07] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160128 (https://phabricator.wikimedia.org/T395031) (owner: 10KartikMistry)
[05:38:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78287 and previous config saved to /var/cache/conftool/dbconfig/20250618-053858-root.json
[05:42:37] <wikibugs>	 (03PS1) 10Marostegui: db2160: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160470 (https://phabricator.wikimedia.org/T397161)
[05:43:09] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] gitlab-runner: upgrade default image to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1160119 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto)
[05:43:27] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] gitlab-runner: upgrade default image to bookworm on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/1160120 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto)
[05:45:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78288 and previous config saved to /var/cache/conftool/dbconfig/20250618-054543-root.json
[05:47:06] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2160.codfw.wmnet with reason: Maintenance
[05:47:26] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2160: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160470 (https://phabricator.wikimedia.org/T397161) (owner: 10Marostegui)
[05:50:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T396130)', diff saved to https://phabricator.wikimedia.org/P78289 and previous config saved to /var/cache/conftool/dbconfig/20250618-055023-marostegui.json
[05:50:28] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[05:51:54] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714#10926519 (10Stevemunene) a:03Stevemunene
[05:54:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P78290 and previous config saved to /var/cache/conftool/dbconfig/20250618-055404-root.json
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T0600)
[06:00:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78291 and previous config saved to /var/cache/conftool/dbconfig/20250618-060049-root.json
[06:05:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P78292 and previous config saved to /var/cache/conftool/dbconfig/20250618-060531-marostegui.json
[06:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:09:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P78293 and previous config saved to /var/cache/conftool/dbconfig/20250618-060910-root.json
[06:10:14] <wikibugs>	 (03PS1) 10Phuedx: ext.wikimediaEvents: Repurpose PageVisit instrument [extensions/WikimediaEvents] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160475 (https://phabricator.wikimedia.org/T397138)
[06:10:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160475 (https://phabricator.wikimedia.org/T397138) (owner: 10Phuedx)
[06:12:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: requestctl: switch CLI from native client to API client [puppet] - 10https://gerrit.wikimedia.org/r/1160476
[06:14:36] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add stub api tokens for hiddenparma [labs/private] - 10https://gerrit.wikimedia.org/r/1160477
[06:14:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] requestctl: switch CLI from native client to API client [puppet] - 10https://gerrit.wikimedia.org/r/1160476 (owner: 10Giuseppe Lavagetto)
[06:15:09] <phuedx>	 I will be a few minutes late for this morning's backport window but I will be there :)
[06:15:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78294 and previous config saved to /var/cache/conftool/dbconfig/20250618-061555-root.json
[06:20:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P78295 and previous config saved to /var/cache/conftool/dbconfig/20250618-062038-marostegui.json
[06:24:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78296 and previous config saved to /var/cache/conftool/dbconfig/20250618-062416-root.json
[06:35:34] <wikibugs>	 (03CR) 10Jelto: "looks mostly good, as mentioned before you should bump the chart version." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[06:35:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T396130)', diff saved to https://phabricator.wikimedia.org/P78297 and previous config saved to /var/cache/conftool/dbconfig/20250618-063546-marostegui.json
[06:35:51] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[06:36:01] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[06:36:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T396130)', diff saved to https://phabricator.wikimedia.org/P78298 and previous config saved to /var/cache/conftool/dbconfig/20250618-063608-marostegui.json
[06:39:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1188 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78299 and previous config saved to /var/cache/conftool/dbconfig/20250618-063921-root.json
[06:46:34] <wikibugs>	 (03PS5) 10Stevemunene: zookeeper: remove an-conf100[1-3] from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135028 (https://phabricator.wikimedia.org/T374922)
[06:53:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add cumin1003 as mysql root client [puppet] - 10https://gerrit.wikimedia.org/r/1145085 (https://phabricator.wikimedia.org/T393990) (owner: 10Muehlenhoff)
[06:56:29] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: derive AIRFLOW_APPOWNER from the user principal in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160115 (https://phabricator.wikimedia.org/T394297) (owner: 10Brouberol)
[06:56:32] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-dev: increase the memory limits of the webserver in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160132 (https://phabricator.wikimedia.org/T394297) (owner: 10Brouberol)
[06:59:09] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: derive AIRFLOW_APPOWNER from the user principal in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160115 (https://phabricator.wikimedia.org/T394297) (owner: 10Brouberol)
[06:59:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.03%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[06:59:10] <wikibugs>	 (03Merged) 10jenkins-bot: airflow-dev: increase the memory limits of the webserver in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160132 (https://phabricator.wikimedia.org/T394297) (owner: 10Brouberol)
[06:59:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T396130)', diff saved to https://phabricator.wikimedia.org/P78300 and previous config saved to /var/cache/conftool/dbconfig/20250618-065936-marostegui.json
[06:59:41] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T0700).
[07:00:05] <jouncebot>	 georgekyz, kart_, and phuedx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:41] <kart_>	 here
[07:01:10] <kart_>	 georgekyz: deploying yourself?
[07:01:28] <wikibugs>	 (03PS1) 10Marostegui: db1156: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160628 (https://phabricator.wikimedia.org/T396549)
[07:01:43] <georgekyz>	 Yeap, I will start it in the following minutes
[07:02:06] <kart_>	 cool. Let me know when done.
[07:02:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1156', diff saved to https://phabricator.wikimedia.org/P78301 and previous config saved to /var/cache/conftool/dbconfig/20250618-070239-root.json
[07:02:48] <kart_>	 urbanecm: around? Can you review my patch meanwhile? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1160128/
[07:03:07] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[1155-1156].eqiad.wmnet with reason: Maintenance
[07:03:20] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ores-extension: enable extension with revertrisk filter for the third batch of wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis)
[07:03:27] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 10 hosts with reason: Maintenance
[07:03:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1156: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160628 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui)
[07:04:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.03%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[07:04:31] <georgekyz>	 Starting deployment
[07:04:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis)
[07:05:23] <wikibugs>	 (03Merged) 10jenkins-bot: ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis)
[07:06:01] <logmsgbot>	 !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1155652|ores-extension: enable extension with revertrisk filter for the third batch of wikis (T395824)]]
[07:06:05] <stashbot>	 T395824: [batch #3] Enable revertrisk filters in recent changes in multiple wikis  - https://phabricator.wikimedia.org/T395824
[07:08:24] <logmsgbot>	 !log gkyziridis@deploy1003 gkyziridis: Backport for [[gerrit:1155652|ores-extension: enable extension with revertrisk filter for the third batch of wikis (T395824)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:12:12] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-cluster check 44 services in codfw: maintenance
[07:12:13] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.discovery.service-route check 48 services: maintenance
[07:12:14] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check 48 services: maintenance
[07:12:14] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) check 44 services in codfw: maintenance
[07:12:15] <phuedx>	 Hello o/
[07:12:17] <phuedx>	 I'm back
[07:12:18] <wikibugs>	 (03CR) 10Jelto: "wow this is nice 🎉 I tried it locally but helmfile fails when installing istio-proxy-settings with:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154293 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm)
[07:13:17] <phuedx>	 kart_: You deploying yourself after georgekyz?
[07:13:49] <kart_>	 yes
[07:14:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P78302 and previous config saved to /var/cache/conftool/dbconfig/20250618-071443-marostegui.json
[07:16:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P78303 and previous config saved to /var/cache/conftool/dbconfig/20250618-071634-root.json
[07:17:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Add missing email in user record [puppet] - 10https://gerrit.wikimedia.org/r/1160635 (https://phabricator.wikimedia.org/T397004)
[07:18:35] <kart_>	 georgekyz: Testing?
[07:18:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add missing email in user record [puppet] - 10https://gerrit.wikimedia.org/r/1160635 (https://phabricator.wikimedia.org/T397004) (owner: 10Muehlenhoff)
[07:19:13] <georgekyz>	 Yeap we are testing the patch is deploying ores extension for 9 wikis
[07:19:19] <georgekyz>	 testing is taking some time apologies
[07:21:58] <kart_>	 No worries, just checking!
[07:22:22] <georgekyz>	 we finished testing we are going to proceed and sync
[07:22:33] <logmsgbot>	 !log gkyziridis@deploy1003 gkyziridis: Continuing with sync
[07:22:41] <kart_>	 cool
[07:23:02] <wikibugs>	 (03CR) 10JMeybohm: "Right...good catch! We're configuring helmfile with `--kubeconfig` and ofc. the credentials file exists on my machine. It does not contain" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154293 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm)
[07:23:57] <wikibugs>	 (03PS1) 10Marostegui: db2231: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160656 (https://phabricator.wikimedia.org/T397279)
[07:24:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2231', diff saved to https://phabricator.wikimedia.org/P78304 and previous config saved to /var/cache/conftool/dbconfig/20250618-072404-root.json
[07:24:32] <logmsgbot>	 !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2231.codfw.wmnet with reason: Maintenance
[07:25:04] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2231.codfw.wmnet with reason: Maintenance
[07:25:15] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2231: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160656 (https://phabricator.wikimedia.org/T397279) (owner: 10Marostegui)
[07:29:36] <logmsgbot>	 !log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155652|ores-extension: enable extension with revertrisk filter for the third batch of wikis (T395824)]] (duration: 23m 35s)
[07:29:41] <stashbot>	 T395824: [batch #3] Enable revertrisk filters in recent changes in multiple wikis  - https://phabricator.wikimedia.org/T395824
[07:29:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P78305 and previous config saved to /var/cache/conftool/dbconfig/20250618-072951-marostegui.json
[07:30:04] <kart_>	 I'll start my patch now..
[07:30:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.02%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[07:30:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160128 (https://phabricator.wikimedia.org/T395031) (owner: 10KartikMistry)
[07:31:04] <georgekyz>	 Deployment finished successfully 
[07:31:15] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the Contribute menu on new Wikipedias automatically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160128 (https://phabricator.wikimedia.org/T395031) (owner: 10KartikMistry)
[07:31:37] <kart_>	 georgekyz: \0/
[07:31:37] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1160128|Enable the Contribute menu on new Wikipedias automatically (T395031 T381371)]]
[07:31:39] <georgekyz>	 thnx for your patience 
[07:31:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78306 and previous config saved to /var/cache/conftool/dbconfig/20250618-073140-root.json
[07:31:42] <georgekyz>	 thnx a lot 
[07:31:43] <stashbot>	 T395031: Enable the Contribute menu in 7th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T395031
[07:31:44] <stashbot>	 T381371: Enable the Contribute menu on new Wikipedias automatically - https://phabricator.wikimedia.org/T381371
[07:32:17] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: add option for series limits to store [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318)
[07:33:33] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:33:53] <logmsgbot>	 !log kartik@deploy1003 kartik: Backport for [[gerrit:1160128|Enable the Contribute menu on new Wikipedias automatically (T395031 T381371)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:33:58] <wikibugs>	 (03CR) 10Volans: "[my 2 cents] I left some general suggestions in the python file, didn't do a full review in detail, leaving the specific logic to the requ" [puppet] - 10https://gerrit.wikimedia.org/r/1160476 (owner: 10Giuseppe Lavagetto)
[07:34:20] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:35:07] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Switch lvs5005 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1160672 (https://phabricator.wikimedia.org/T396561)
[07:35:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.02%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[07:35:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78307 and previous config saved to /var/cache/conftool/dbconfig/20250618-073517-root.json
[07:36:28] <logmsgbot>	 !log kartik@deploy1003 kartik: Continuing with sync
[07:36:33] <logmsgbot>	 !log brouberol@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-jumbo1018.eqiad.wmnet with OS bullseye
[07:36:45] <wikibugs>	 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10926738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin2002 for host kafka-jumbo1018...
[07:39:01] <phuedx>	 jouncebot: nowandnext
[07:39:01] <jouncebot>	 For the next 0 hour(s) and 20 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T0700)
[07:39:01] <jouncebot>	 In 2 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1000)
[07:40:07] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160672 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[07:41:31] <logmsgbot>	 !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet)
[07:42:06] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[07:42:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add stub api tokens for hiddenparma [labs/private] - 10https://gerrit.wikimedia.org/r/1160477 (owner: 10Giuseppe Lavagetto)
[07:43:22] <ryankemper>	 !log T386098 Killed the `wdqs-main` reload, it can be started up again on the new cumin later
[07:43:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:26] <stashbot>	 T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098
[07:43:34] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160128|Enable the Contribute menu on new Wikipedias automatically (T395031 T381371)]] (duration: 11m 56s)
[07:43:40] <stashbot>	 T395031: Enable the Contribute menu in 7th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T395031
[07:43:40] <stashbot>	 T381371: Enable the Contribute menu on new Wikipedias automatically - https://phabricator.wikimedia.org/T381371
[07:44:06] <kart_>	 phuedx: I'm done.
[07:44:13] <phuedx>	 kart_: ACK
[07:44:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T396130)', diff saved to https://phabricator.wikimedia.org/P78308 and previous config saved to /var/cache/conftool/dbconfig/20250618-074459-marostegui.json
[07:45:04] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[07:45:14] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[07:45:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T396130)', diff saved to https://phabricator.wikimedia.org/P78309 and previous config saved to /var/cache/conftool/dbconfig/20250618-074521-marostegui.json
[07:46:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160475 (https://phabricator.wikimedia.org/T397138) (owner: 10Phuedx)
[07:46:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78310 and previous config saved to /var/cache/conftool/dbconfig/20250618-074646-root.json
[07:47:45] <wikibugs>	 (03Merged) 10jenkins-bot: ext.wikimediaEvents: Repurpose PageVisit instrument [extensions/WikimediaEvents] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160475 (https://phabricator.wikimedia.org/T397138) (owner: 10Phuedx)
[07:48:12] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1160475|ext.wikimediaEvents: Repurpose PageVisit instrument (T397138)]]
[07:48:16] <stashbot>	 T397138: Run a second synthetic A/A test - https://phabricator.wikimedia.org/T397138
[07:48:28] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:50:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78311 and previous config saved to /var/cache/conftool/dbconfig/20250618-075022-root.json
[07:50:40] <logmsgbot>	 !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1160475|ext.wikimediaEvents: Repurpose PageVisit instrument (T397138)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:52:12] <wikibugs>	 (03PS5) 10Kosta Harlan: Configure instrument for CheckUser - UserInfoCard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) (owner: 10Mimurawil)
[07:52:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.082s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:52:32] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "godspeed!" [puppet] - 10https://gerrit.wikimedia.org/r/1160672 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[07:54:20] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:57:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.082s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:57:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:57:59] <phuedx>	 Took me a little while but I've confirmed the change looks good on testwiki
[07:58:04] <phuedx>	 Continuing
[07:58:07] <logmsgbot>	 !log phuedx@deploy1003 phuedx: Continuing with sync
[08:00:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.03%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[08:00:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-puppet-agent-stats.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:01:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78312 and previous config saved to /var/cache/conftool/dbconfig/20250618-080152-root.json
[08:02:23] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2212* slowly with 10 steps - Pooling in
[08:02:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:02:53] <wikibugs>	 (03PS1) 10Vgutierrez: sre.loadbalancer.upgrade: Avoid depooling several LBs at the same time [cookbooks] - 10https://gerrit.wikimedia.org/r/1160680
[08:03:11] <hashar>	 jouncebot: now
[08:03:11] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 56 minute(s)
[08:03:16] <hashar>	 jouncebot: refresh
[08:03:16] <jouncebot>	 I refreshed my knowledge about deployments.
[08:03:18] <hashar>	 lies?
[08:04:05] <hashar>	 isn't it supposed to be the train window right now?
[08:05:07] <logmsgbot>	 !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160475|ext.wikimediaEvents: Repurpose PageVisit instrument (T397138)]] (duration: 16m 55s)
[08:05:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.03%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[08:05:12] <stashbot>	 T397138: Run a second synthetic A/A test - https://phabricator.wikimedia.org/T397138
[08:05:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78314 and previous config saved to /var/cache/conftool/dbconfig/20250618-080528-root.json
[08:05:31] <phuedx>	 hashar: The window started 1 hour ago :)
[08:06:20] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, a dry-run should be able to confirm you the correct actions are performed (either after merging or with test-cookbook)." [cookbooks] - 10https://gerrit.wikimedia.org/r/1160680 (owner: 10Vgutierrez)
[08:07:15] <hashar>	 phuedx: that is the backport & config one which started an hour ago isn't it?
[08:07:42] <phuedx>	 hashar: You're right. Sorry. I misread your message :)
[08:07:46] <wikibugs>	 (03CR) 10Vgutierrez: "a DRY-RUN with `--query P{lvs[7001-7002].magru.wmnet}` confirms that admin cookbook is called with just one instance at a time:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1160680 (owner: 10Vgutierrez)
[08:08:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T396130)', diff saved to https://phabricator.wikimedia.org/P78316 and previous config saved to /var/cache/conftool/dbconfig/20250618-080833-marostegui.json
[08:08:34] <phuedx>	 wt:Deployments says it's the UTC-7 version this week? https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1800
[08:08:38] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[08:08:56] <phuedx>	 !log UTC morning backport window finished
[08:08:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:51] <hashar>	 I guess it got confused somehow
[08:11:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "As stated in the comments to the puppet class, I wasn't requesting a review of this code here, given it is a copy from another repository." [puppet] - 10https://gerrit.wikimedia.org/r/1160476 (owner: 10Giuseppe Lavagetto)
[08:13:27] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet
[08:16:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78317 and previous config saved to /var/cache/conftool/dbconfig/20250618-081657-root.json
[08:20:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.077s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:20:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78319 and previous config saved to /var/cache/conftool/dbconfig/20250618-082035-root.json
[08:21:11] <moritzm>	 !log rearm keyholder on cumin2002
[08:21:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:15] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10926885 (10Fabfur) @Jhancock.wm hi, when do you think we could start reimaging these? Is there something we can do in the meantime to help you with this?
[08:21:33] <hashar>	 jouncebot: refresh
[08:21:34] <jouncebot>	 I refreshed my knowledge about deployments.
[08:21:37] <hashar>	 jouncebot: now
[08:21:37] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 38 minute(s)
[08:21:40] <hashar>	 ...
[08:21:53] <wikibugs>	 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10926886 (10elukey) @MatthewVernon @MoritzMuehlenhoff I am planning to do the following:  * log on thanos-fe1004 * sudo su;...
[08:22:18] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (202827s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[08:22:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest2001.codfw.wmnet
[08:22:58] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] sre.loadbalancer.upgrade: Avoid depooling several LBs at the same time [cookbooks] - 10https://gerrit.wikimedia.org/r/1160680 (owner: 10Vgutierrez)
[08:23:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P78320 and previous config saved to /var/cache/conftool/dbconfig/20250618-082340-marostegui.json
[08:24:45] <logmsgbot>	 !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet
[08:24:46] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs5005.eqsin.wmnet} and A:liberica (T396561)
[08:24:52] <stashbot>	 T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561
[08:25:11] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs5005.eqsin.wmnet} and A:liberica (T396561)
[08:25:14] <wikibugs>	 (03PS1) 10Elukey: role::maps::master: fix Tegola container name [puppet] - 10https://gerrit.wikimedia.org/r/1160688 (https://phabricator.wikimedia.org/T396584)
[08:25:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.077s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:25:39] <logmsgbot>	 !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs5005.eqsin.wmnet with reason: switching to katran
[08:25:40] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs5005 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1160672 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[08:26:42] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] profile::pyrra: improve the istio SLOs template [puppet] - 10https://gerrit.wikimedia.org/r/1160075 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey)
[08:26:54] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6009/co" [puppet] - 10https://gerrit.wikimedia.org/r/1160688 (https://phabricator.wikimedia.org/T396584) (owner: 10Elukey)
[08:27:01] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::pyrra: fix Citoid's SLO targets [puppet] - 10https://gerrit.wikimedia.org/r/1160076 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey)
[08:27:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus-puppet-agent-stats.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:28:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2001.codfw.wmnet
[08:28:28] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:28:45] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892)
[08:29:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[08:29:14] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:29:16] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:29:16] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:29:22] <hashar>	 jouncebot: refresh
[08:29:23] <jouncebot>	 I refreshed my knowledge about deployments.
[08:29:25] <hashar>	 jouncebot: now
[08:29:25] <jouncebot>	 For the next 1 hour(s) and 30 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T0800)
[08:29:39] <hashar>	 ah
[08:30:16] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:31:29] <wikibugs>	 (03CR) 10Stevemunene: "Just did a restart of the service and there was no issue encountered" [puppet] - 10https://gerrit.wikimedia.org/r/1135028 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene)
[08:31:30] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:31:30] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:32:56] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:33:26] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892)
[08:33:35] <jinxer-wm>	 FIRING: MailmanBounceQueueHigh: Mailman bounce queue on lists1004:9100 has more than 50 messages - https://wikitech.wikimedia.org/wiki/Mailman/Runbooks#MailmanBounceQueueHigh - https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?forceLogin&from=now-3h&orgId=1&to=now&viewPanel=2 - https://alerts.wikimedia.org/?q=alertname%3DMailmanBounceQueueHigh
[08:33:50] <wikibugs>	 (03PS10) 10Btullis: Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378)
[08:33:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[08:34:19] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "The only big change (see PCC) is related to the send_tile_invalidations systemd timer, that in codfw is currently wrongly configured :D" [puppet] - 10https://gerrit.wikimedia.org/r/1160688 (https://phabricator.wikimedia.org/T396584) (owner: 10Elukey)
[08:34:32] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:34:58] <wikibugs>	 (03PS11) 10Brouberol: Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis)
[08:35:14] <wikibugs>	 (03PS3) 10Jcrespo: bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892)
[08:35:39] <wikibugs>	 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10926934 (10MatthewVernon) FWIW, I use `sudo bash ; . /etc/swift/accountfile.env`, but yes. Those commands will take so...
[08:37:50] <hashar>	 I am running the train NOW
[08:38:36] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host doh7004.wikimedia.org
[08:38:38] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.netbox
[08:38:41] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs5005.eqsin.wmnet
[08:38:42] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs5005.eqsin.wmnet
[08:38:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P78322 and previous config saved to /var/cache/conftool/dbconfig/20250618-083847-marostegui.json
[08:39:40] <wikibugs>	 (03PS4) 10Jcrespo: bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892)
[08:39:59] <logmsgbot>	 !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on an-coord1003.eqiad.wmnet with reason: Upgrading SSD firmware
[08:40:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10926952 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5ea14561-c44c-4cc5-b656-024e47b3bc03) set by btullis@cumin1003 for 1...
[08:40:21] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1003.eqiad.wmnet
[08:40:28] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Repool lvs5005 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1160693 (https://phabricator.wikimedia.org/T396561)
[08:40:33] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160694 (https://phabricator.wikimedia.org/T392176)
[08:40:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160694 (https://phabricator.wikimedia.org/T392176) (owner: 10TrainBranchBot)
[08:40:51] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160693 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[08:40:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hiera: Repool lvs5005 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1160693 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[08:40:58] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[08:41:05] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-coord1003.eqiad.wmnet
[08:41:24] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160694 (https://phabricator.wikimedia.org/T392176) (owner: 10TrainBranchBot)
[08:41:55] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet
[08:41:56] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7004.wikimedia.org - jmm@cumin1003"
[08:42:00] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7004.wikimedia.org - jmm@cumin1003"
[08:42:01] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:42:01] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache doh7004.wikimedia.org on all recursors
[08:42:04] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7004.wikimedia.org on all recursors
[08:42:16] <logmsgbot>	 !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading clouddbs T394372
[08:42:17] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1003.eqiad.wmnet
[08:42:20] <stashbot>	 T394372: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372
[08:42:26] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-coord1003.eqiad.wmnet
[08:42:33] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7004.wikimedia.org - jmm@cumin1003"
[08:42:37] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7004.wikimedia.org - jmm@cumin1003"
[08:43:42] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis)
[08:45:09] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Repool lvs5005 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1160693 (https://phabricator.wikimedia.org/T396561)
[08:45:38] <logmsgbot>	 jmm@cumin1003 makevm (PID 2085372) is awaiting input
[08:46:07] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host doh7004.wikimedia.org with OS bookworm
[08:46:54] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160693 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[08:47:15] <wikibugs>	 (03CR) 10FNegri: [C:03+2] clouddb1016: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154808 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri)
[08:49:50] <wikibugs>	 (03PS3) 10Elukey: profile::pyrra: add SLO ratio for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852)
[08:50:56] <logmsgbot>	 !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.6  refs T392176
[08:51:00] <stashbot>	 T392176: 1.45.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T392176
[08:51:14] <wikibugs>	 (03PS5) 10Jcrespo: bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892)
[08:51:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10927000 (10BTullis) Hi @RobH the cookbook failed for an-coord1003 with the following error: ` btullis@cumin1003:~$ sudo cookbook sre.hardware.up...
[08:51:55] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Mark HTTP(S) traffic from dumps with low-priority QoS mark [puppet] - 10https://gerrit.wikimedia.org/r/1160071 (https://phabricator.wikimedia.org/T397153) (owner: 10Cathal Mooney)
[08:53:29] <wikibugs>	 (03PS6) 10Jcrespo: bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892)
[08:53:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T396130)', diff saved to https://phabricator.wikimedia.org/P78324 and previous config saved to /var/cache/conftool/dbconfig/20250618-085354-marostegui.json
[08:54:00] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[08:54:10] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1195.eqiad.wmnet with reason: Maintenance
[08:54:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T396130)', diff saved to https://phabricator.wikimedia.org/P78325 and previous config saved to /var/cache/conftool/dbconfig/20250618-085417-marostegui.json
[08:54:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "I've made a patch integrating some suggestions in the HIDDEPARMA repository, and merged it." [puppet] - 10https://gerrit.wikimedia.org/r/1160476 (owner: 10Giuseppe Lavagetto)
[08:55:33] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[08:57:48] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Repool lvs5005 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1160693 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[08:58:06] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] shellbox: migrate to bookworm-based httpd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160223 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French)
[08:59:16] <wikibugs>	 (03PS1) 10Cathal Mooney: Mark outbound rsync traffic from clouddumps as low-prio for qos [puppet] - 10https://gerrit.wikimedia.org/r/1160699 (https://phabricator.wikimedia.org/T397153)
[09:00:50] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1003.eqiad.wmnet
[09:01:02] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-coord1003.eqiad.wmnet
[09:01:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153#10927019 (10cmooney) >>! In T397153#10925689, @xcollazo wrote: > Should we also mark rsync traffic as low-priority then?  Hmm yeah it might not be a...
[09:01:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153#10927020 (10cmooney) FWIW the change to mark the HTTP traffic is in place and working ` cmooney@clouddumps1002:~$ sudo iptables -v -n -t mangle -L P...
[09:02:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.096s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:02:55] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1003.eqiad.wmnet
[09:03:05] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-coord1003.eqiad.wmnet
[09:03:16] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet
[09:03:32] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet
[09:04:35] <vgutierrez>	 !log repool lvs5005 (upload) using katran - T396561
[09:04:39] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs5005.eqsin.wmnet} and A:liberica (T396561)
[09:04:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:40] <stashbot>	 T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561
[09:04:58] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs5005.eqsin.wmnet} and A:liberica (T396561)
[09:04:59] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[09:05:46] <elukey>	 vgutierrez: \o/
[09:05:57] <vgutierrez>	 elukey: <3
[09:07:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.096s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:08:28] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: requestctl: switch CLI from native client to API client [puppet] - 10https://gerrit.wikimedia.org/r/1160476
[09:09:12] <wikibugs>	 (03CR) 10FNegri: [C:03+2] clouddb1020: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154809 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri)
[09:10:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.458s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:11:21] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.netbox.restart-reboot rolling reboot on A:netbox
[09:11:24] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors
[09:11:27] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors
[09:11:28] <wikibugs>	 (03PS1) 10Hnowlan: changeprop: bump concurrency for pcs_rerender_native_on_null [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160704 (https://phabricator.wikimedia.org/T397072)
[09:11:56] <wikibugs>	 (03PS4) 10Elukey: profile::pyrra: add SLO ratio for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852)
[09:12:28] <wikibugs>	 (03CR) 10Elukey: [C:03+2] "Added it, makes total sense!" [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey)
[09:12:56] <wikibugs>	 (03CR) 10Elukey: profile::pyrra: add SLO ratio for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey)
[09:13:58] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on doh7004.wikimedia.org with reason: host reimage
[09:15:12] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[09:15:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.053s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:15:40] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors
[09:15:43] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors
[09:17:27] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: requestctl: switch CLI from native client to API client [puppet] - 10https://gerrit.wikimedia.org/r/1160476
[09:17:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T396130)', diff saved to https://phabricator.wikimedia.org/P78327 and previous config saved to /var/cache/conftool/dbconfig/20250618-091738-marostegui.json
[09:17:43] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[09:17:49] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "SGTM, but I have a limited understanding on the actual usage of these hosts. It would be interesting to monitor if the bulk of the traffic" [puppet] - 10https://gerrit.wikimedia.org/r/1160699 (https://phabricator.wikimedia.org/T397153) (owner: 10Cathal Mooney)
[09:18:07] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7004.wikimedia.org with reason: host reimage
[09:18:28] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:18:32] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Use an actual user for the fake api tokens [labs/private] - 10https://gerrit.wikimedia.org/r/1160706
[09:18:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Use an actual user for the fake api tokens [labs/private] - 10https://gerrit.wikimedia.org/r/1160706 (owner: 10Giuseppe Lavagetto)
[09:18:53] <logmsgbot>	 !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Upgrading clouddbs T394372
[09:18:58] <stashbot>	 T394372: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372
[09:19:15] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:19:15] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:19:17] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:19:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6011/co" [puppet] - 10https://gerrit.wikimedia.org/r/1160476 (owner: 10Giuseppe Lavagetto)
[09:20:15] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:21:11] <wikibugs>	 06SRE, 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10927063 (10elukey) >>! In T391852#10922168, @Mvolz wrote: >>>! In T391852#10919212, @elukey wrote: >> I am reopening this t...
[09:21:31] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:21:31] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:22:45] <wikibugs>	 (03CR) 10Jcrespo: "Alex, can I ask you for a review? The director code will need a cleanup afterwards, but I want to first do the migration and remove backup" [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[09:22:55] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:24:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.209s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:24:20] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet
[09:24:31] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:26:30] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] changeprop: bump concurrency for pcs_rerender_native_on_null [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160704 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan)
[09:28:14] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.netbox.restart-reboot (exit_code=0) rolling reboot on A:netbox
[09:29:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.206s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:29:27] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10927084 (10MoritzMuehlenhoff)
[09:29:33] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[09:29:43] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10927085 (10MoritzMuehlenhoff)
[09:30:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.07%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[09:32:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P78329 and previous config saved to /var/cache/conftool/dbconfig/20250618-093245-marostegui.json
[09:34:26] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7004.wikimedia.org with OS bookworm
[09:34:26] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7004.wikimedia.org
[09:35:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.07%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[09:38:47] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160171 (owner: 10Muehlenhoff)
[09:39:39] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[09:40:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:40:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good (based on https://phabricator.wikimedia.org/T396584#10922024)" [puppet] - 10https://gerrit.wikimedia.org/r/1160688 (https://phabricator.wikimedia.org/T396584) (owner: 10Elukey)
[09:40:48] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host netboxdb2003.codfw.wmnet
[09:44:34] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2003.codfw.wmnet
[09:46:08] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[09:47:33] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10927145 (10BTullis) Hello, just to let you know, I'm now trying the same operation on an-coord1003 T394499#10927000 and getting the same error as @RobH ab...
[09:47:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P78331 and previous config saved to /var/cache/conftool/dbconfig/20250618-094752-marostegui.json
[09:51:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:53:41] <wikibugs>	 (03PS5) 10JMeybohm: kind.sh can bootstrap a wikikube like cluster with kind [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154293 (https://phabricator.wikimedia.org/T396107)
[09:54:08] <wikibugs>	 (03CR) 10JMeybohm: "Unfortunately it's not, due to https://github.com/helmfile/helmfile/issues/2084" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154293 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm)
[09:55:13] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] Configure instrument for CheckUser - UserInfoCard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) (owner: 10Mimurawil)
[09:55:20] <hnowlan>	 jouncebot: nowandnext
[09:55:21] <jouncebot>	 For the next 0 hour(s) and 4 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T0800)
[09:55:21] <jouncebot>	 In 0 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1000)
[09:55:40] <claime>	 hnowlan: infra window will include codfw depool of wikikube btw
[09:56:17] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[09:56:24] <wikibugs>	 (03CR) 10Elukey: Change log format to get name of image being built (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156315 (owner: 10Hashar)
[09:56:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:56:41] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] changeprop: bump concurrency for pcs_rerender_native_on_null [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160704 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan)
[09:57:44] <wikibugs>	 (03CR) 10Elukey: "I don't have a strong preference, but what is the advantage of having it in the new version (to help me understanding the change better) ?" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156733 (owner: 10Hashar)
[09:58:27] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: bump concurrency for pcs_rerender_native_on_null [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160704 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1000)
[10:00:05] <jouncebot>	 jayme, Raine, and claime: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[10:02:42] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[10:02:51] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[10:02:55] <wikibugs>	 (03PS1) 10Effie Mouzeli: mediawiki-common: remove disk-type affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160719
[10:03:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T396130)', diff saved to https://phabricator.wikimedia.org/P78333 and previous config saved to /var/cache/conftool/dbconfig/20250618-100300-marostegui.json
[10:03:05] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[10:03:16] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[10:03:22] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[10:03:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T396130)', diff saved to https://phabricator.wikimedia.org/P78335 and previous config saved to /var/cache/conftool/dbconfig/20250618-100329-marostegui.json
[10:04:14] <jayme>	 topranks: _joe_: We're going to depool wikikube codfw for around an hour as a precautionary test for the upcoming kubernetes upgrade
[10:04:17] <wikibugs>	 (03PS1) 10Marostegui: db2191: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160722 (https://phabricator.wikimedia.org/T397279)
[10:04:35] <elukey>	 jayme: wow we are upgrading?
[10:04:49] <wikibugs>	 (03PS2) 10Effie Mouzeli: mediawiki-common: remove disk-type affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160719
[10:04:54] <topranks>	 jayme: ok thanks, is that different from the depool c.laime mentioned in -sre ?
[10:05:07] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply
[10:05:12] <jayme>	 topranks: lol, no - sorry
[10:05:21] <claime>	 x)
[10:05:27] <topranks>	 ha no worries it sounded the same I was just double-checking 
[10:05:35] <topranks>	 thanks for letting us know :) 
[10:05:36] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[10:06:11] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[10:06:37] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[10:07:45] <hnowlan>	 I'm done with changeprop, go ahead 
[10:07:48] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Switch lvs5004 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1160723 (https://phabricator.wikimedia.org/T396561)
[10:08:15] <jayme>	 hnowlan: ack, thanks
[10:08:51] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mediawiki-common: remove disk-type affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160719 (owner: 10Effie Mouzeli)
[10:09:20] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160723 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[10:10:16] <wikibugs>	 (03PS1) 10Slyngshede: Update requirements for netbox 4.0.11 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1160724
[10:10:33] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-cluster depool 44 services in codfw/codfw: pre-upgrade-test
[10:10:33] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-cluster (exit_code=99) depool 44 services in codfw/codfw: pre-upgrade-test
[10:12:38] <wikibugs>	 (03PS2) 10Slyngshede: Update requirements for netbox 4.0.11 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1160724 (https://phabricator.wikimedia.org/T397300)
[10:13:32] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159470 (owner: 10PipelineBot)
[10:14:49] <jynus>	 !log starting backup director migration backup1001 -> backup1014 T387892
[10:14:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:53] <stashbot>	 T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892
[10:14:57] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] bacula: Migrate backup1001's director role to backup1014 [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[10:16:06] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159470 (owner: 10PipelineBot)
[10:16:28] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host netboxdb1003.eqiad.wmnet
[10:18:27] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2212* slowly with 10 steps - Pooling in
[10:18:36] <wikibugs>	 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10927316 (10MatthewVernon) Silly question while I'm here - do you need 2 buckets, each of which ends up replicated cros...
[10:20:22] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1003.eqiad.wmnet
[10:20:27] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup[1001,1014].eqiad.wmnet with reason: Backup director migration
[10:21:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_cables_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:24:24] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160236 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[10:26:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T396130)', diff saved to https://phabricator.wikimedia.org/P78337 and previous config saved to /var/cache/conftool/dbconfig/20250618-102655-marostegui.json
[10:27:01] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[10:28:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:29:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good (once the NDA is completed)" [puppet] - 10https://gerrit.wikimedia.org/r/1160216 (https://phabricator.wikimedia.org/T397099) (owner: 10Herron)
[10:31:01] <logmsgbot>	 !log jmm@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2024.codfw.wmnet with reason: remove for decom
[10:32:54] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10927411 (10Volans) @BTullis the SSD upgrade is a type of its own, not STORAGE, so the files must be in `/srv/firmware/poweredge-r440/SSD`. If you use that...
[10:33:51] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1160723 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[10:35:28] <Reedy>	 jouncebot: nowandnext
[10:35:28] <jouncebot>	 For the next 0 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1000)
[10:35:28] <jouncebot>	 In 0 hour(s) and 24 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1100)
[10:35:42] <wikibugs>	 (03PS5) 10Reedy: composer: Various updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160144
[10:35:50] <wikibugs>	 (03CR) 10Reedy: [C:03+2] composer: Various updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160144 (owner: 10Reedy)
[10:35:55] <wikibugs>	 (03PS4) 10Reedy: Setup json linting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 (https://phabricator.wikimedia.org/T397191)
[10:36:01] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Setup json linting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 (https://phabricator.wikimedia.org/T397191) (owner: 10Reedy)
[10:36:40] <wikibugs>	 (03Merged) 10jenkins-bot: composer: Various updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160144 (owner: 10Reedy)
[10:36:51] <wikibugs>	 (03Merged) 10jenkins-bot: Setup json linting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 (https://phabricator.wikimedia.org/T397191) (owner: 10Reedy)
[10:37:10] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Sync firmwares directory between the cumin hosts - https://phabricator.wikimedia.org/T397306 (10Volans) 03NEW p:05Triage→03Medium
[10:37:16] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10927437 (10Volans) Created T397306
[10:37:29] <wikibugs>	 (03PS1) 10Hnowlan: changeprop: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160727 (https://phabricator.wikimedia.org/T397072)
[10:40:12] <wikibugs>	 (03PS12) 10Reedy: Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender)
[10:40:31] <logmsgbot>	 !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on an-coord1003.eqiad.wmnet with reason: Upgrading SSD firmware
[10:40:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2191', diff saved to https://phabricator.wikimedia.org/P78338 and previous config saved to /var/cache/conftool/dbconfig/20250618-104033-root.json
[10:40:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10927450 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=49f11e46-f52a-4db8-a2cc-7688a3599023) set by btullis@cumin1003 for 1...
[10:40:45] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender)
[10:40:50] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1003.eqiad.wmnet
[10:40:57] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2191.codfw.wmnet with reason: Maintenance
[10:41:01] <icinga-wm>	 RECOVERY - mailman3_queue_size on lists1004 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[10:41:49] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Sync firmwares directory between the cumin hosts - https://phabricator.wikimedia.org/T397306#10927452 (10MoritzMuehlenhoff) We could also have one seedhost on a single designated Cumin host where dc ops can write to. And then set up an rsync which syncs th...
[10:41:49] <wikibugs>	 (03Merged) 10jenkins-bot: Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender)
[10:41:50] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2191: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160722 (https://phabricator.wikimedia.org/T397279) (owner: 10Marostegui)
[10:42:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P78339 and previous config saved to /var/cache/conftool/dbconfig/20250618-104203-marostegui.json
[10:43:26] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-coord1003.eqiad.wmnet
[10:43:35] <jinxer-wm>	 RESOLVED: MailmanBounceQueueHigh: Mailman bounce queue on lists1004:9100 has more than 50 messages - https://wikitech.wikimedia.org/wiki/Mailman/Runbooks#MailmanBounceQueueHigh - https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?forceLogin&from=now-3h&orgId=1&to=now&viewPanel=2 - https://alerts.wikimedia.org/?q=alertname%3DMailmanBounceQueueHigh
[10:45:44] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10927481 (10MoritzMuehlenhoff)
[10:46:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78340 and previous config saved to /var/cache/conftool/dbconfig/20250618-104609-root.json
[10:47:44] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet
[10:48:41] <logmsgbot>	 !log root@cumin1002 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for backup1009.eqiad.wmnet: Renew puppet certificate - root@cumin1002
[10:48:52] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] changeprop: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160727 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan)
[10:48:59] <logmsgbot>	 !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1160144|composer: Various updates]], [[gerrit:1160151|Setup json linting (T397191)]], [[gerrit:1130201|Improve function and property documentation for php code (T171115)]]
[10:49:04] <stashbot>	 T397191: Add JSON syntax check to mediawiki-config CI - https://phabricator.wikimedia.org/T397191
[10:49:05] <stashbot>	 T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115
[10:49:23] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Sync firmwares directory between the cumin hosts - https://phabricator.wikimedia.org/T397306#10927495 (10Volans) That's an interesting idea that would work right now because the auto-download from the Dell website is broken, but if we fix that then any cum...
[10:49:49] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mediawiki_experimental: $kubernetes_release dir fix [puppet] - 10https://gerrit.wikimedia.org/r/1160236 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[10:49:53] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet
[10:50:53] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, thx" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1160724 (https://phabricator.wikimedia.org/T397300) (owner: 10Slyngshede)
[10:51:14] <logmsgbot>	 !log reedy@deploy1003 umherirrender, reedy: Backport for [[gerrit:1160144|composer: Various updates]], [[gerrit:1160151|Setup json linting (T397191)]], [[gerrit:1130201|Improve function and property documentation for php code (T171115)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[10:51:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_cables_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:52:03] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Update wrong role for backup2009 [puppet] - 10https://gerrit.wikimedia.org/r/1160730 (https://phabricator.wikimedia.org/T387892)
[10:52:37] <logmsgbot>	 !log reedy@deploy1003 umherirrender, reedy: Continuing with sync
[10:52:49] <logmsgbot>	 !log root@cumin1002 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for backup1009.eqiad.wmnet: Renew puppet certificate - root@cumin1002
[10:53:26] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] bacula: Update wrong role for backup2009 [puppet] - 10https://gerrit.wikimedia.org/r/1160730 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[10:54:10] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1003.eqiad.wmnet
[10:54:11] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts an-coord1003.eqiad.wmnet
[10:54:33] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] "Thanks claime!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160727 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan)
[10:56:49] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: bump replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160727 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan)
[10:57:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P78341 and previous config saved to /var/cache/conftool/dbconfig/20250618-105710-marostegui.json
[10:58:34] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet
[10:58:41] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet
[10:59:20] <logmsgbot>	 !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160144|composer: Various updates]], [[gerrit:1160151|Setup json linting (T397191)]], [[gerrit:1130201|Improve function and property documentation for php code (T171115)]] (duration: 10m 20s)
[10:59:25] <stashbot>	 T397191: Add JSON syntax check to mediawiki-config CI - https://phabricator.wikimedia.org/T397191
[10:59:26] <stashbot>	 T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115
[10:59:59] <wikibugs>	 (03PS1) 10Btullis: Revert "Failover hive and presto to the standby coordinator" [dns] - 10https://gerrit.wikimedia.org/r/1160733
[11:00:04] <jouncebot>	 mvolz: Time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1100).
[11:00:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[11:01:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78342 and previous config saved to /var/cache/conftool/dbconfig/20250618-110114-root.json
[11:01:39] <wikibugs>	 (03PS2) 10Btullis: Revert "Failover hive and presto to the standby coordinator" [dns] - 10https://gerrit.wikimedia.org/r/1160733
[11:01:42] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mediawiki-common: remove disk-type affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160719 (owner: 10Effie Mouzeli)
[11:02:16] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.puppet.migrate-host for host backup1009.eqiad.wmnet
[11:02:27] <logmsgbot>	 !log root@cumin1002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host backup1009.eqiad.wmnet
[11:02:54] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Failover hive and presto to the standby coordinator" [dns] - 10https://gerrit.wikimedia.org/r/1160733 (owner: 10Btullis)
[11:03:16] <logmsgbot>	 !log btullis@dns1004 START - running authdns-update
[11:04:01] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mediawiki_experimental: $kubernetes_release dir fix [puppet] - 10https://gerrit.wikimedia.org/r/1160236 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[11:04:10] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki-common: remove disk-type affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160719 (owner: 10Effie Mouzeli)
[11:04:13] <logmsgbot>	 !log btullis@dns1004 END - running authdns-update
[11:04:33] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Force puppet7 on backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/1160735 (https://phabricator.wikimedia.org/T387892)
[11:05:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[11:06:06] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[11:06:10] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[11:07:03] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.puppet.migrate-host for host backup1009.eqiad.wmnet
[11:07:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti2023/ganeti2024 as Ganeti servers [puppet] - 10https://gerrit.wikimedia.org/r/1159937 (https://phabricator.wikimedia.org/T396590) (owner: 10Muehlenhoff)
[11:07:14] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] bacula: Force puppet7 on backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/1160735 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[11:07:52] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[11:09:56] <logmsgbot>	 !log root@cumin1002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host backup1009.eqiad.wmnet
[11:12:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T396130)', diff saved to https://phabricator.wikimedia.org/P78343 and previous config saved to /var/cache/conftool/dbconfig/20250618-111217-marostegui.json
[11:12:23] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[11:12:33] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1206.eqiad.wmnet with reason: Maintenance
[11:12:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T396130)', diff saved to https://phabricator.wikimedia.org/P78344 and previous config saved to /var/cache/conftool/dbconfig/20250618-111239-marostegui.json
[11:13:32] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet
[11:16:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78345 and previous config saved to /var/cache/conftool/dbconfig/20250618-111620-root.json
[11:18:10] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[11:18:10] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10927588 (10BTullis) >>! In T394543#10927411, @Volans wrote: > @BTullis the SSD upgrade is a type of its own, not STORAGE, so the files must be in `/srv/fi...
[11:19:23] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-experimental: scale down resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160738
[11:19:42] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply
[11:20:16] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply
[11:21:20] <logmsgbot>	 !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply
[11:21:48] <logmsgbot>	 !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[11:22:31] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mw-experimental: scale down resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160738 (owner: 10Effie Mouzeli)
[11:22:45] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: Failed power supply on es1045 - https://phabricator.wikimedia.org/T397310 (10FCeratto-WMF) 03NEW
[11:22:49] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mw-experimental: scale down resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160738 (owner: 10Effie Mouzeli)
[11:24:02] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10927618 (10MoritzMuehlenhoff)
[11:24:31] <wikibugs>	 (03Merged) 10jenkins-bot: mw-experimental: scale down resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160738 (owner: 10Effie Mouzeli)
[11:24:33] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: Failed power supply on es1045 - https://phabricator.wikimedia.org/T397310#10927627 (10Marostegui) p:05Triage→03Medium
[11:24:45] <logmsgbot>	 !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply
[11:25:12] <logmsgbot>	 !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[11:26:21] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10927634 (10taavi)
[11:26:47] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[11:27:34] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[11:28:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:30:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[11:30:41] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10927696 (10MoritzMuehlenhoff)
[11:31:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T396130)', diff saved to https://phabricator.wikimedia.org/P78347 and previous config saved to /var/cache/conftool/dbconfig/20250618-113103-marostegui.json
[11:31:08] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[11:31:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78348 and previous config saved to /var/cache/conftool/dbconfig/20250618-113125-root.json
[11:35:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[11:43:37] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Migrate and create general properties for the new/renamed roles [puppet] - 10https://gerrit.wikimedia.org/r/1160739 (https://phabricator.wikimedia.org/T387892)
[11:44:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] otel: add tolerations for mw-experimental hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160173 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[11:45:11] <wikibugs>	 (03PS1) 10KartikMistry: Enable the Contribute menu in 8th group of Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160740 (https://phabricator.wikimedia.org/T395084)
[11:46:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P78349 and previous config saved to /var/cache/conftool/dbconfig/20250618-114610-marostegui.json
[11:46:21] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160740 (https://phabricator.wikimedia.org/T395084) (owner: 10KartikMistry)
[11:47:18] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Migrate and create general properties for the new/renamed roles [puppet] - 10https://gerrit.wikimedia.org/r/1160739 (https://phabricator.wikimedia.org/T387892)
[11:47:46] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160739 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[11:50:31] <wikibugs>	 (03PS1) 10Hnowlan: mobileapps: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160742
[11:51:34] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] mobileapps: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160742 (owner: 10Hnowlan)
[11:52:02] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "this looks good to me now and spins up a working environment in kind! Also the dependencies between the admin are correct, no second `helm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154293 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm)
[11:52:22] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mobileapps: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160742 (owner: 10Hnowlan)
[11:53:58] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160742 (owner: 10Hnowlan)
[11:54:46] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[11:54:47] <wikibugs>	 (03PS1) 10Volans: sre.hardware.upgrade-firmware: improve SSD logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1160743
[11:55:04] <wikibugs>	 (03PS3) 10Jcrespo: bacula: Migrate and create general properties for the new/renamed roles [puppet] - 10https://gerrit.wikimedia.org/r/1160739 (https://phabricator.wikimedia.org/T387892)
[11:55:07] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Update requirements for netbox 4.0.11 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1160724 (https://phabricator.wikimedia.org/T397300) (owner: 10Slyngshede)
[11:55:11] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply
[11:55:11] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Update requirements for netbox 4.0.11 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1160724 (https://phabricator.wikimedia.org/T397300) (owner: 10Slyngshede)
[11:55:17] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[11:55:20] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160739 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[11:55:21] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply
[11:55:26] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply
[11:55:39] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10927811 (10Volans) The cookbook exited with that code because it had a failure, unfortunately was missing a useful logging message at the right point. I'm...
[11:56:34] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply
[11:58:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on aux-k8s-worker1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:58:58] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] bacula: Migrate and create general properties for the new/renamed roles [puppet] - 10https://gerrit.wikimedia.org/r/1160739 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[12:00:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[12:01:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P78350 and previous config saved to /var/cache/conftool/dbconfig/20250618-120117-marostegui.json
[12:02:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1160739 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[12:04:04] <wikibugs>	 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10927851 (10brouberol) 05In progress→03Resolved This ^ message ^ was posted by a rogue reimage cookbook that had bee...
[12:05:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[12:05:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:08:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] standard_packages: Handle dnsutils/bind9-dnsutils correctly across all OSes [puppet] - 10https://gerrit.wikimedia.org/r/1152246 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff)
[12:08:33] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab-runner: upgrade default image to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1160119 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto)
[12:09:07] <icinga-wm>	 PROBLEM - Host ms-fe1016 is DOWN: PING CRITICAL - Packet loss = 100%
[12:09:31] <jinxer-wm>	 FIRING: ProbeDown: Service mobileapps:4102 has failed probes (http_mobileapps_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mobileapps:4102 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:10:51] <jinxer-wm>	 FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:11:10] <wikibugs>	 (03PS1) 10Hnowlan: mobileapps: increase CPU limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160747
[12:11:35] <icinga-wm>	 RECOVERY - Host ms-fe1016 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[12:12:15] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mobileapps: increase CPU limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160747 (owner: 10Hnowlan)
[12:12:27] <wikibugs>	 (03PS2) 10Filippo Giunchedi: thanos: add option for series limits to store [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318)
[12:12:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: force query-frontend query stats [puppet] - 10https://gerrit.wikimedia.org/r/1160748 (https://phabricator.wikimedia.org/T394318)
[12:12:28] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: enable snappy compression for grpc in query [puppet] - 10https://gerrit.wikimedia.org/r/1160749 (https://phabricator.wikimedia.org/T394318)
[12:12:30] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos limits in common [puppet] - 10https://gerrit.wikimedia.org/r/1160750
[12:13:06] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mobileapps: increase CPU limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160747 (owner: 10Hnowlan)
[12:13:29] <jinxer-wm>	 RESOLVED: ProbeDown: Service mobileapps:4102 has failed probes (http_mobileapps_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mobileapps:4102 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:14:20] <hnowlan>	 working to fix that ^ 
[12:14:54] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: increase CPU limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160747 (owner: 10Hnowlan)
[12:14:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] thanos: add option for series limits to store [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi)
[12:15:00] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[12:15:32] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: set thanos sidecar and store series limits [puppet] - 10https://gerrit.wikimedia.org/r/1160750 (https://phabricator.wikimedia.org/T394318)
[12:15:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Failed power supply on es1045 - https://phabricator.wikimedia.org/T397310#10927872 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated power cable idrac shows healthy
[12:15:46] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[12:16:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T396130)', diff saved to https://phabricator.wikimedia.org/P78351 and previous config saved to /var/cache/conftool/dbconfig/20250618-121624-marostegui.json
[12:16:29] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[12:16:40] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1218.eqiad.wmnet with reason: Maintenance
[12:16:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T396130)', diff saved to https://phabricator.wikimedia.org/P78352 and previous config saved to /var/cache/conftool/dbconfig/20250618-121646-marostegui.json
[12:17:00] <wikibugs>	 (03PS3) 10Filippo Giunchedi: thanos: add option for series limits to store [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318)
[12:17:01] <wikibugs>	 (03PS3) 10Filippo Giunchedi: hieradata: set thanos sidecar and store series limits [puppet] - 10https://gerrit.wikimedia.org/r/1160750 (https://phabricator.wikimedia.org/T394318)
[12:20:51] <jinxer-wm>	 RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:22:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:04-1] "Looks like the hostgroup definition is missing" [puppet] - 10https://gerrit.wikimedia.org/r/1160205 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen)
[12:23:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:24:15] <wikibugs>	 (03PS1) 10Jgiannelos: changeprop: Add header with event timestamp for PCS requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160753
[12:24:30] <wikibugs>	 (03PS2) 10Jgiannelos: changeprop: Add header with event timestamp for PCS requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160753 (https://phabricator.wikimedia.org/T397072)
[12:26:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:27:59] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] team-sre: check PoPs for PrometheusDown [alerts] - 10https://gerrit.wikimedia.org/r/1160177 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi)
[12:29:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[12:29:19] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] thanos: force query-frontend query stats [puppet] - 10https://gerrit.wikimedia.org/r/1160748 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi)
[12:29:32] <jinxer-wm>	 FIRING: ProbeDown: Service mobileapps:4102 has failed probes (http_mobileapps_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mobileapps:4102 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:29:58] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Degraded RAID on analytics1073 - https://phabricator.wikimedia.org/T397231#10927991 (10Jclark-ctr) Hey @btullis  will we be swapping this drive or is this server due to be decom?  7 years old  i dont believe i have any 120gb drives. but si...
[12:30:18] <Amir1>	 jouncebot: nowandnext
[12:30:18] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 29 minute(s)
[12:30:18] <jouncebot>	 In 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1300)
[12:31:51] <jinxer-wm>	 FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:31:59] <jnuche>	 @Amir1: I was going to deploy a new version of scap, that ok from your side?
[12:32:11] <Amir1>	 yeah, I actually changed my mind
[12:32:13] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] role::maps::master: fix Tegola container name [puppet] - 10https://gerrit.wikimedia.org/r/1160688 (https://phabricator.wikimedia.org/T396584) (owner: 10Elukey)
[12:32:17] <jnuche>	 ack
[12:32:44] <logmsgbot>	 !log jnuche@deploy1003 Installing scap version "4.178.0" for 183 host(s)
[12:32:56] <wikibugs>	 (03PS1) 10Hnowlan: mobileapps: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160755
[12:33:29] <jinxer-wm>	 RESOLVED: ProbeDown: Service mobileapps:4102 has failed probes (http_mobileapps_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mobileapps:4102 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:33:33] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1193.eqiad.wmnet with reason: Maintenance
[12:34:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[12:34:40] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2161.codfw.wmnet with reason: Maintenance
[12:35:04] <wikibugs>	 (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156325 (owner: 10PipelineBot)
[12:35:08] <wikibugs>	 (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160232 (owner: 10PipelineBot)
[12:36:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] team-sre: check PoPs for PrometheusDown [alerts] - 10https://gerrit.wikimedia.org/r/1160177 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi)
[12:36:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T396130)', diff saved to https://phabricator.wikimedia.org/P78353 and previous config saved to /var/cache/conftool/dbconfig/20250618-123658-marostegui.json
[12:37:03] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[12:37:30] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] thanos: enable snappy compression for grpc in query [puppet] - 10https://gerrit.wikimedia.org/r/1160749 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi)
[12:37:46] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Discourage the usage of backup1001 as director [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892)
[12:38:33] <wikibugs>	 (03CR) 10Volans: "Additional context in https://phabricator.wikimedia.org/T394543#10927588" [cookbooks] - 10https://gerrit.wikimedia.org/r/1160743 (owner: 10Volans)
[12:38:34] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Discourage the usage of backup1001 as director [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892)
[12:38:38] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[12:40:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] bacula: Discourage the usage of backup1001 as director [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[12:41:05] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-06-09-163022 to 2025-06-17-205547 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160759 (https://phabricator.wikimedia.org/T394401)
[12:41:08] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-06-10-144243 to 2025-06-17-204731 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160760 (https://phabricator.wikimedia.org/T390550)
[12:41:19] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Enable memcached-based batching for ZObjects [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160761 (https://phabricator.wikimedia.org/T390550)
[12:41:51] <jinxer-wm>	 FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:41:51] <wikibugs>	 (03PS3) 10Jgiannelos: changeprop: Add header with event timestamp for PCS requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160753 (https://phabricator.wikimedia.org/T397072)
[12:43:27] <elukey>	 !log drop old Thanos Swift's Tegola tile cache containers - T396584
[12:43:27] <wikibugs>	 (03PS3) 10Jcrespo: bacula: Discourage the usage of backup1001 as director [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892)
[12:43:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:31] <stashbot>	 T396584: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584
[12:43:32] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[12:43:44] <logmsgbot>	 !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1048.eqiad.wmnet
[12:45:34] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet
[12:45:38] <wikibugs>	 (03PS4) 10Jcrespo: bacula: Discourage the usage of backup1001 as director [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892)
[12:46:00] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] bacula: Discourage the usage of backup1001 as director [puppet] - 10https://gerrit.wikimedia.org/r/1160756 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[12:48:10] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] thanos: add option for series limits to store [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi)
[12:48:54] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] hieradata: set thanos sidecar and store series limits [puppet] - 10https://gerrit.wikimedia.org/r/1160750 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi)
[12:49:12] <wikibugs>	 (03PS2) 10Hnowlan: mobileapps: increase replicas, drop CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160755
[12:49:48] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] mobileapps: increase replicas, drop CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160755 (owner: 10Hnowlan)
[12:51:51] <jinxer-wm>	 FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:51:54] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mobileapps: increase replicas, drop CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160755 (owner: 10Hnowlan)
[12:52:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P78354 and previous config saved to /var/cache/conftool/dbconfig/20250618-125206-marostegui.json
[12:53:42] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: increase replicas, drop CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160755 (owner: 10Hnowlan)
[12:54:11] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[12:55:12] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[12:56:51] <jinxer-wm>	 RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:57:44] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] Mark outbound rsync traffic from clouddumps as low-prio for qos [puppet] - 10https://gerrit.wikimedia.org/r/1160699 (https://phabricator.wikimedia.org/T397153) (owner: 10Cathal Mooney)
[12:59:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.05%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1300).
[13:00:04] <jouncebot>	 kart_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:16] <Lucas_WMDE>	 o/
[13:00:40] <kart_>	 alright here
[13:00:40] <jynus>	 !log bacula director migration finalized, backup1014 is the new bacula director. backup1001 should no longer be used. T387892 
[13:00:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:44] <stashbot>	 T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892
[13:00:46] <kart_>	 Lucas_WMDE: I can deploy 
[13:00:59] <jnuche>	 hi there, please stand by for backports for a bit
[13:01:01] <wikibugs>	 (03CR) 10Elukey: [C:03+1] sre.hardware.upgrade-firmware: improve SSD logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1160743 (owner: 10Volans)
[13:01:02] <jnuche>	 I need to deploy a scap fix
[13:01:17] <kart_>	 jnuche: sure. let me know.
[13:01:48] <Lucas_WMDE>	 kart_, jnuche: go ahead, I’m a bit busy right now anyway :)
[13:02:09] <kart_>	 :)
[13:03:10] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks!  FWIW we can see the distribution in our netflow data:" [puppet] - 10https://gerrit.wikimedia.org/r/1160699 (https://phabricator.wikimedia.org/T397153) (owner: 10Cathal Mooney)
[13:03:12] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Mark outbound rsync traffic from clouddumps as low-prio for qos [puppet] - 10https://gerrit.wikimedia.org/r/1160699 (https://phabricator.wikimedia.org/T397153) (owner: 10Cathal Mooney)
[13:03:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:04:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.05%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[13:05:44] <wikibugs>	 (03PS1) 10Hnowlan: mobileapps: drop memory usage, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160763
[13:06:25] <wikibugs>	 (03PS1) 10Vgutierrez: prometheus: Add NIC queue CPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/1160764 (https://phabricator.wikimedia.org/T397303)
[13:07:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P78355 and previous config saved to /var/cache/conftool/dbconfig/20250618-130713-marostegui.json
[13:07:31] <logmsgbot>	 !log jnuche@deploy1003 Installing scap version "4.178.1" for 4 host(s)
[13:07:33] <godog>	 jouncebot: now and next
[13:07:33] <jouncebot>	 For the next 0 hour(s) and 52 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1300)
[13:08:20] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:esams or A:drmrs and A:cp - 9.2.10 upgrade (T390912)
[13:08:23] <wikibugs>	 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10928125 (10elukey) >>! In T396584#10927316, @MatthewVernon wrote: > Silly question while I'm here - do you need 2 buck...
[13:08:24] <stashbot>	 T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912
[13:08:35] <wikibugs>	 (03PS1) 10Vgutierrez: prometheus: Add NIC queue CPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/1160765 (https://phabricator.wikimedia.org/T397303)
[13:09:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] prometheus: Add NIC queue CPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/1160765 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez)
[13:09:40] <wikibugs>	 (03Abandoned) 10Vgutierrez: prometheus: Add NIC queue CPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/1160765 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez)
[13:09:44] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] mobileapps: drop memory usage, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160763 (owner: 10Hnowlan)
[13:10:22] <logmsgbot>	 !log jnuche@deploy1003 Installation of scap version "4.178.1" completed for 4 hosts
[13:10:55] <jnuche>	 scap updated, need a minute to verify
[13:10:57] <wikibugs>	 (03PS2) 10Vgutierrez: prometheus: Add NIC queue CPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/1160764 (https://phabricator.wikimedia.org/T397303)
[13:11:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] memcached: Switch to profile::memcached::firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156269 (owner: 10Muehlenhoff)
[13:11:52] <wikibugs>	 (03PS2) 10Hnowlan: mobileapps: drop memory usage, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160763
[13:12:18] <jnuche>	 kart_: all good, you can go ahead, thanks for your patience!
[13:12:19] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] mobileapps: drop memory usage, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160763 (owner: 10Hnowlan)
[13:12:37] <kart_>	 jnuche: Sure. Thanks!
[13:12:41] <wikibugs>	 (03PS1) 10Ssingh: Release 9.2.11-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1160766 (https://phabricator.wikimedia.org/T397308)
[13:13:05] <hnowlan>	 I am going to apply a change for mobileapps - backport window can continue alongside it but it's critical my change is applied 
[13:13:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160740 (https://phabricator.wikimedia.org/T395084) (owner: 10KartikMistry)
[13:13:41] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs5004.eqsin.wmnet} and A:liberica (T396561)
[13:13:46] <stashbot>	 T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561
[13:13:47] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mobileapps: drop memory usage, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160763 (owner: 10Hnowlan)
[13:13:51] <jinxer-wm>	 FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:13:58] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the Contribute menu in 8th group of Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160740 (https://phabricator.wikimedia.org/T395084) (owner: 10KartikMistry)
[13:14:06] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs5004.eqsin.wmnet} and A:liberica (T396561)
[13:14:11] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:14:23] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1160740|Enable the Contribute menu in 8th group of Wikipedias (T395084)]]
[13:14:28] <stashbot>	 T395084: Enable the Contribute menu in 8th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T395084
[13:14:33] <wikibugs>	 (03CR) 10Btullis: [C:03+1] sre.hardware.upgrade-firmware: improve SSD logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1160743 (owner: 10Volans)
[13:14:41] <logmsgbot>	 !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs5004.eqsin.wmnet with reason: switching to katran
[13:14:42] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs5004 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1160723 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[13:15:32] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: drop memory usage, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160763 (owner: 10Hnowlan)
[13:15:44] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[13:15:47] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:16:41] <logmsgbot>	 !log kartik@deploy1003 kartik: Backport for [[gerrit:1160740|Enable the Contribute menu in 8th group of Wikipedias (T395084)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:18:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:18:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:18:51] <jinxer-wm>	 FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:19:21] <logmsgbot>	 !log kartik@deploy1003 kartik: Continuing with sync
[13:19:36] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[13:19:40] <wikibugs>	 (03CR) 10Volans: [C:03+2] sre.hardware.upgrade-firmware: improve SSD logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1160743 (owner: 10Volans)
[13:19:44] <wikibugs>	 (03PS1) 10Hnowlan: admin_ng: increase limits for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160767
[13:19:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[13:19:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[13:19:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[13:21:35] <logmsgbot>	 jhancock@cumin2002 provision (PID 63771) is awaiting input
[13:22:15] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Update evaluators from 2025-06-09-163022 to 2025-06-17-205547 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160759 (https://phabricator.wikimedia.org/T394401)
[13:22:15] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Update orchestrator from 2025-06-10-144243 to 2025-06-18-130945 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160760 (https://phabricator.wikimedia.org/T390550)
[13:22:15] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Enable memcached-based batching for ZObjects [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160761 (https://phabricator.wikimedia.org/T390550)
[13:22:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T396130)', diff saved to https://phabricator.wikimedia.org/P78356 and previous config saved to /var/cache/conftool/dbconfig/20250618-132220-marostegui.json
[13:22:25] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[13:22:36] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[13:22:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T396130)', diff saved to https://phabricator.wikimedia.org/P78357 and previous config saved to /var/cache/conftool/dbconfig/20250618-132242-marostegui.json
[13:23:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:24:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[13:24:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[13:24:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[13:26:20] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160740|Enable the Contribute menu in 8th group of Wikipedias (T395084)]] (duration: 11m 57s)
[13:26:25] <stashbot>	 T395084: Enable the Contribute menu in 8th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T395084
[13:26:55] <kart_>	 Done.
[13:27:02] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: improve SSD logging [cookbooks] - 10https://gerrit.wikimedia.org/r/1160743 (owner: 10Volans)
[13:27:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] admin_ng: increase limits for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160767 (owner: 10Hnowlan)
[13:29:09] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Repool lvs5004 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1160770 (https://phabricator.wikimedia.org/T396561)
[13:29:28] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160770 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[13:29:32] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:29:35] <wikibugs>	 (03CR) 10Herron: [C:03+1] profile::pyrra: add SLO ratio for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey)
[13:29:51] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Repool lvs5004 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1160770 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[13:29:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:30:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.09%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[13:30:45] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs5004.eqsin.wmnet
[13:30:45] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs5004.eqsin.wmnet
[13:31:17] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Repool lvs5004 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1160770 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[13:31:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153#10928195 (10cmooney) 05Open→03Resolved a:03cmooney
[13:31:50] <wikibugs>	 (03CR) 10Herron: [C:03+1] thanos: add option for series limits to store [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi)
[13:31:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] memcached/gutter: Switch to firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156659 (owner: 10Muehlenhoff)
[13:32:13] <wikibugs>	 (03CR) 10Herron: [C:03+1] hieradata: set thanos sidecar and store series limits [puppet] - 10https://gerrit.wikimedia.org/r/1160750 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi)
[13:33:12] <wikibugs>	 (03PS1) 10Elukey: profile::docker::reporter: add another not supported image [puppet] - 10https://gerrit.wikimedia.org/r/1160771
[13:35:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.14%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[13:36:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.636s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:36:25] <logmsgbot>	 !log jnuche@deploy1003 Installing scap version "4.178.2" for 4 host(s)
[13:37:09] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs5004.eqsin.wmnet} and A:liberica (T396561)
[13:37:10] <vgutierrez>	 !log repool lvs5004 (text) using katran - T396561
[13:37:14] <stashbot>	 T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561
[13:37:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:28] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs5004.eqsin.wmnet} and A:liberica (T396561)
[13:39:15] <logmsgbot>	 !log jnuche@deploy1003 Installation of scap version "4.178.2" completed for 4 hosts
[13:40:17] <wikibugs>	 (03PS1) 10Hnowlan: mobileapps: bump replicas significantly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160773
[13:40:21] <wikibugs>	 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10928238 (10MatthewVernon)
[13:40:42] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:41:04] <wikibugs>	 (03CR) 10Cathal Mooney: "LGTM overall, one nit/potential typo in line." [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[13:41:15] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.007s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:42:30] <moritzm>	 !log installing net-tools regression updates on Bullseye
[13:42:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T396130)', diff saved to https://phabricator.wikimedia.org/P78358 and previous config saved to /var/cache/conftool/dbconfig/20250618-134307-marostegui.json
[13:43:12] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[13:47:07] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:47:37] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:48:12] <jinxer-wm>	 FIRING: SLOMetricAbsent: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[13:48:53] <wikibugs>	 (03PS1) 10Bking: cirrus-streaming-updater: raise jobManager memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160780 (https://phabricator.wikimedia.org/T397335)
[13:49:46] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] mobileapps: bump replicas significantly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160773 (owner: 10Hnowlan)
[13:49:53] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] cirrus-streaming-updater: raise jobManager memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160780 (https://phabricator.wikimedia.org/T397335) (owner: 10Bking)
[13:50:09] <wikibugs>	 (03PS2) 10Bking: cirrus-streaming-updater: raise jobManager memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160780 (https://phabricator.wikimedia.org/T397335)
[13:53:12] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[13:54:09] <wikibugs>	 (03CR) 10Hashar: Change log format to get name of image being built (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156315 (owner: 10Hashar)
[13:54:17] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[13:54:31] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[13:56:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.263s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:56:52] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:57:07] <wikibugs>	 (03CR) 10CDanis: [C:03+1] "+1 deploy at will <3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160173 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[13:58:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P78359 and previous config saved to /var/cache/conftool/dbconfig/20250618-135814-marostegui.json
[13:59:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.02%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[13:59:26] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd2005']
[13:59:37] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd2005']
[13:59:59] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[14:00:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2005.codfw.wmnet with OS bullseye
[14:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1400)
[14:00:13] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928384 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcephosd2005.codfw.wmnet with OS bullseye
[14:00:53] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Update evaluators from 2025-06-09-163022 to 2025-06-17-205547 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160759 (https://phabricator.wikimedia.org/T394401) (owner: 10Jforrester)
[14:01:08] <wikibugs>	 (03CR) 10Elukey: Change log format to get name of image being built (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156315 (owner: 10Hashar)
[14:02:33] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-06-09-163022 to 2025-06-17-205547 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160759 (https://phabricator.wikimedia.org/T394401) (owner: 10Jforrester)
[14:03:21] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:03:54] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:04:01] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:04:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.02%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[14:05:32] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd2006']
[14:05:41] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd2006']
[14:06:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2006.codfw.wmnet with OS bullseye
[14:06:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928429 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcephosd2006.codfw.wmnet with OS bullseye
[14:08:16] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:08:56] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:09:00] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:09:41] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:10:29] <logmsgbot>	 !log jnuche@deploy1003 Installing scap version "4.178.3" for 4 host(s)
[14:10:46] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Update orchestrator from 2025-06-10-144243 to 2025-06-18-130945 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160760 (https://phabricator.wikimedia.org/T390550) (owner: 10Jforrester)
[14:11:29] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2025/2026-Q1): Cookbook sre.hosts.remove_downtime does not remove silences - https://phabricator.wikimedia.org/T395032#10928440 (10lmata)
[14:12:21] <wikibugs>	 (03PS3) 10Jgreen: nsca_frack.cfg.erb break out trino hostgroup, add trino API check [puppet] - 10https://gerrit.wikimedia.org/r/1160205 (https://phabricator.wikimedia.org/T386259)
[14:12:54] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-06-10-144243 to 2025-06-18-130945 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160760 (https://phabricator.wikimedia.org/T390550) (owner: 10Jforrester)
[14:13:21] <logmsgbot>	 !log jnuche@deploy1003 Installation of scap version "4.178.3" completed for 4 hosts
[14:13:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P78360 and previous config saved to /var/cache/conftool/dbconfig/20250618-141322-marostegui.json
[14:13:46] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:15:24] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:15:48] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrus-streaming-updater: raise jobManager memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160780 (https://phabricator.wikimedia.org/T397335) (owner: 10Bking)
[14:16:08] <wikibugs>	 (03PS1) 10Gkyziridis: ores-extension: enable extension with revertrisk filter for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160797 (https://phabricator.wikimedia.org/T395823)
[14:17:19] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:17:29] <logmsgbot>	 jhancock@cumin2002 provision (PID 76583) is awaiting input
[14:17:51] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:18:04] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:18:48] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:18:51] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:19:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd2007']
[14:19:33] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd2007']
[14:19:36] <wikibugs>	 (03CR) 10Jforrester: [C:04-1] wikifunctions: Enable memcached-based batching for ZObjects [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160761 (https://phabricator.wikimedia.org/T390550) (owner: 10Jforrester)
[14:20:17] <wikibugs>	 (03CR) 10Jforrester: [C:04-1] "Not deploying right now as there's an issue making it hard for us to inspect staging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160761 (https://phabricator.wikimedia.org/T390550) (owner: 10Jforrester)
[14:20:27] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007.codfw.wmnet with OS bullseye
[14:20:34] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928484 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcephosd2007.codfw.wmnet with OS bullseye
[14:21:26] <wikibugs>	 (03PS1) 10Dbrant: Add 'wikipedia:' to list of recognized protocols. [core] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160802 (https://phabricator.wikimedia.org/T386004)
[14:21:44] <logmsgbot>	 !log bking@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[14:21:59] <logmsgbot>	 !log bking@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:24:14] <logmsgbot>	 !log bking@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[14:24:22] <logmsgbot>	 !log bking@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:24:32] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[14:25:18] <logmsgbot>	 jhancock@cumin2002 reimage (PID 78485) is awaiting input
[14:26:03] <logmsgbot>	 jmm@cumin1003 drain-node (PID 2111453) is awaiting input
[14:26:54] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[14:28:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T396130)', diff saved to https://phabricator.wikimedia.org/P78363 and previous config saved to /var/cache/conftool/dbconfig/20250618-142829-marostegui.json
[14:28:35] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[14:28:45] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1232.eqiad.wmnet with reason: Maintenance
[14:28:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T396130)', diff saved to https://phabricator.wikimedia.org/P78364 and previous config saved to /var/cache/conftool/dbconfig/20250618-142852-marostegui.json
[14:30:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1400)
[14:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1430)
[14:30:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.07%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[14:35:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.07%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[14:40:22] <wikibugs>	 (03PS1) 10Hnowlan: mobileapps: Remove resource limits for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160813
[14:40:26] <James_F>	 !log Running `mwscript-k8s --php_version=8.1 -f -- extensions/WikiLambda/maintenance/updateSecondaryTables.php --wiki=wikifunctionswiki --cache --verbose --zType Z8` for T396449
[14:40:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:30] <stashbot>	 T396449: WikifunctionsPFragmentHandler::fetchFunctionFromCache cache miss while fetching Z20744 for empty argument Z20744K1 - https://phabricator.wikimedia.org/T396449
[14:41:14] <logmsgbot>	 jhancock@cumin2002 reimage (PID 79274) is awaiting input
[14:41:33] <logmsgbot>	 !log bking@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[14:41:41] <logmsgbot>	 !log bking@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:41:42] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] profile::docker::reporter: add another not supported image [puppet] - 10https://gerrit.wikimedia.org/r/1160771 (owner: 10Elukey)
[14:42:03] <wikibugs>	 (03PS1) 10JMeybohm: k8s.pool-depool-cluster: Black format [cookbooks] - 10https://gerrit.wikimedia.org/r/1160816 (https://phabricator.wikimedia.org/T397148)
[14:42:05] <wikibugs>	 (03PS1) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817
[14:45:19] <logmsgbot>	 !log bking@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[14:45:24] <logmsgbot>	 !log bking@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:45:41] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] No longer use mirrors.debian.org on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1160171 (owner: 10Muehlenhoff)
[14:47:05] <logmsgbot>	 !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1048.eqiad.wmnet
[14:48:13] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet
[14:49:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T396130)', diff saved to https://phabricator.wikimedia.org/P78365 and previous config saved to /var/cache/conftool/dbconfig/20250618-144903-marostegui.json
[14:49:08] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[14:49:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (owner: 10JMeybohm)
[14:50:19] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::docker::reporter: add another not supported image [puppet] - 10https://gerrit.wikimedia.org/r/1160771 (owner: 10Elukey)
[14:50:25] <swfrench-wmf>	 !log reprepro included conftool 5.3.0 in apt.wikimedia.org - T395696
[14:50:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:30] <stashbot>	 T395696: Move ExternalStore config out of mediawiki config - https://phabricator.wikimedia.org/T395696
[14:50:47] <elukey>	 moritzm: ok to merge?
[14:51:33] <logmsgbot>	 !log jmm@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir7002.magru.wmnet
[14:51:37] <logmsgbot>	 !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir7002.magru.wmnet
[14:51:57] <moritzm>	 elukey: give me 30 seconds
[14:52:23] <moritzm>	 elukey: first needed to disable puppet, can be merged now
[14:53:15] <elukey>	 running it
[14:54:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] nsca_frack.cfg.erb break out trino hostgroup, add trino API check [puppet] - 10https://gerrit.wikimedia.org/r/1160205 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen)
[14:56:29] <wikibugs>	 (03PS1) 10MVernon: thanos: remove drained thanos-be100[1-4] from rings [puppet] - 10https://gerrit.wikimedia.org/r/1160824 (https://phabricator.wikimedia.org/T391352)
[14:57:37] <wikibugs>	 (03PS2) 10MVernon: thanos: remove drained thanos-be100[1-4] from rings [puppet] - 10https://gerrit.wikimedia.org/r/1160824 (https://phabricator.wikimedia.org/T391352)
[14:59:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[15:00:09] <dancy>	 jouncebot nowandnext
[15:00:09] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 59 minute(s)
[15:00:09] <jouncebot>	 In 1 hour(s) and 59 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1700)
[15:01:13] <logmsgbot>	 !log dancy@deploy1003 Started scap sync-world: Testing T396166
[15:01:19] <stashbot>	 T396166: Are `php_fpm`/`php_version` inside `scap.cfg` used anymore? - https://phabricator.wikimedia.org/T396166
[15:03:08] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1003.eqiad.wmnet
[15:04:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[15:04:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P78366 and previous config saved to /var/cache/conftool/dbconfig/20250618-150410-marostegui.json
[15:06:24] <logmsgbot>	 btullis@cumin1003 upgrade-firmware (PID 2125299) is awaiting input
[15:06:40] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160797 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis)
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:50] <logmsgbot>	 !log dancy@deploy1003 Finished scap sync-world: Testing T396166 (duration: 08m 37s)
[15:09:56] <stashbot>	 T396166: Are `php_fpm`/`php_version` inside `scap.cfg` used anymore? - https://phabricator.wikimedia.org/T396166
[15:12:24] <wikibugs>	 (03PS13) 10AOkoth: miscweb: add os-reports update mechanism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794)
[15:13:19] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM! Nice" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160797 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis)
[15:15:56] <wikibugs>	 (03PS14) 10AOkoth: miscweb: add os-reports update mechanism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794)
[15:16:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:17:11] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sde) failed in ms-be1071 - https://phabricator.wikimedia.org/T397343 (10MatthewVernon) 03NEW
[15:17:36] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ores-extension: enable extension with revertrisk filter for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160797 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis)
[15:17:48] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sde) failed in ms-be1071 - https://phabricator.wikimedia.org/T397343#10928670 (10MatthewVernon) p:05Triage→03High [this is blocking ongoing load/drain operations for the eqiad ms cluster]
[15:18:42] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sde) failed in ms-be1071 - https://phabricator.wikimedia.org/T397343#10928673 (10MatthewVernon)
[15:19:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P78367 and previous config saved to /var/cache/conftool/dbconfig/20250618-151918-marostegui.json
[15:19:27] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] mobileapps: Remove resource limits for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160813 (owner: 10Hnowlan)
[15:19:33] <wikibugs>	 (03CR) 10AOkoth: miscweb: add os-reports update mechanism (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[15:23:38] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160842 (https://phabricator.wikimedia.org/T216601)
[15:23:44] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mobileapps: Remove resource limits for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160813 (owner: 10Hnowlan)
[15:24:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.889s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[15:24:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade firmware (NIC and system) on ganeti1047 - https://phabricator.wikimedia.org/T396660#10928689 (10MoritzMuehlenhoff) >>! In T396660#10923171, @MoritzMuehlenhoff wrote: >> While reviewing `/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor` I noticed that we have a mixtu...
[15:24:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] mobileapps: Remove resource limits for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160813 (owner: 10Hnowlan)
[15:25:20] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Remove resource limits for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160813 (owner: 10Hnowlan)
[15:26:04] <wikibugs>	 (03CR) 10Jakob: [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160842 (https://phabricator.wikimedia.org/T216601) (owner: 10Lucas Werkmeister (WMDE))
[15:26:20] <Lucas_WMDE>	 jouncebot: now
[15:26:20] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 33 minute(s)
[15:26:25] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "deploying" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160842 (https://phabricator.wikimedia.org/T216601) (owner: 10Lucas Werkmeister (WMDE))
[15:27:24] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[15:28:15] <wikibugs>	 (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160842 (https://phabricator.wikimedia.org/T216601) (owner: 10Lucas Werkmeister (WMDE))
[15:29:01] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[15:29:28] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[15:29:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply
[15:30:24] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply
[15:30:36] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply
[15:30:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply
[15:31:23] <wikibugs>	 (03PS2) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817
[15:31:23] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[15:31:40] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[15:34:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.914s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[15:34:24] <wikibugs>	 (03PS2) 10Tchanders: Configure event stream for IP auto-reveal instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155725 (https://phabricator.wikimedia.org/T387600)
[15:34:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T396130)', diff saved to https://phabricator.wikimedia.org/P78368 and previous config saved to /var/cache/conftool/dbconfig/20250618-153425-marostegui.json
[15:34:32] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[15:34:42] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1234.eqiad.wmnet with reason: Maintenance
[15:34:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T396130)', diff saved to https://phabricator.wikimedia.org/P78369 and previous config saved to /var/cache/conftool/dbconfig/20250618-153448-marostegui.json
[15:35:03] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] thanos: remove drained thanos-be100[1-4] from rings [puppet] - 10https://gerrit.wikimedia.org/r/1160824 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon)
[15:35:15] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2006.codfw.wmnet with OS bullseye
[15:35:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928756 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudcephosd2006.codfw.wmnet with OS bullseye ex...
[15:35:20] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] Release 9.2.11-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1160766 (https://phabricator.wikimedia.org/T397308) (owner: 10Ssingh)
[15:35:58] <wikibugs>	 (03CR) 10MVernon: [C:03+2] thanos: remove drained thanos-be100[1-4] from rings [puppet] - 10https://gerrit.wikimedia.org/r/1160824 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon)
[15:37:02] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007.codfw.wmnet with OS bullseye
[15:37:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928762 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudcephosd2007.codfw.wmnet with OS bullseye ex...
[15:38:18] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2007.codfw.wmnet with OS bullseye
[15:38:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host cloudcephosd2007.codfw.wmnet with OS bullseye
[15:39:33] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[15:41:02] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-codfw and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581)
[15:41:07] <stashbot>	 T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581
[15:44:30] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2005.codfw.wmnet with OS bullseye
[15:44:41] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928800 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudcephosd2005.codfw.wmnet with OS bullseye ex...
[15:45:27] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10928801 (10Jhancock.wm) @Fabfur we will unfortunatly have to use UEFI on these machines. Could you update partman to make those changes. Then i can proceed. I'm working...
[15:45:42] <brett>	 !log Depooling cp7001 for firmware upgrades re: thermal support ticket - T386959
[15:45:45] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7001.*
[15:45:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:47] <stashbot>	 T386959: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959
[15:46:06] <wikibugs>	 (03PS1) 10MVernon: thanos: add new backends, remove old ones gone from rings [puppet] - 10https://gerrit.wikimedia.org/r/1160855 (https://phabricator.wikimedia.org/T391352)
[15:46:10] <wikibugs>	 (03PS1) 10MVernon: thanos: add new nodes to ring, drain old ones [puppet] - 10https://gerrit.wikimedia.org/r/1160856 (https://phabricator.wikimedia.org/T392908)
[15:46:44] <wikibugs>	 (03PS1) 10Kimberly Sarabia: Revert "Enable new mobile search experience everywhere (not including empty search recommendations)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160858
[15:47:46] <wikibugs>	 (03PS2) 10Kimberly Sarabia: Revert "Enable new mobile search experience everywhere (not including empty search recommendations)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160858
[15:48:22] <wikibugs>	 (03PS1) 10Hnowlan: admin_ng: remove quotas and ranges for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160860
[15:48:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928813 (10Jhancock.wm)
[15:52:44] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[15:52:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] admin_ng: remove quotas and ranges for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160860 (owner: 10Hnowlan)
[15:52:58] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Release 9.2.11-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1160766 (https://phabricator.wikimedia.org/T397308) (owner: 10Ssingh)
[15:54:54] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160858 (owner: 10Kimberly Sarabia)
[15:54:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T396130)', diff saved to https://phabricator.wikimedia.org/P78370 and previous config saved to /var/cache/conftool/dbconfig/20250618-155455-marostegui.json
[15:55:01] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[15:56:11] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] admin_ng: remove quotas and ranges for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160860 (owner: 10Hnowlan)
[15:57:58] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[15:58:12] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-coord1003.eqiad.wmnet
[15:58:59] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[15:59:31] <swfrench-wmf>	 !log deployed conftool 5.3.0 to all bullseye and bookworm hosts - T395696
[15:59:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:36] <stashbot>	 T395696: Move ExternalStore config out of mediawiki config - https://phabricator.wikimedia.org/T395696
[16:00:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.05%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[16:01:06] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10928851 (10BTullis) >>! In T394543#10927811, @Volans wrote: > If you try to re-run it it does tell you there is nothing to upgrade right?  I can confirm t...
[16:01:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10928854 (10BTullis) 05Open→03Resolved
[16:02:49] <logmsgbot>	 !log btullis@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on an-mariadb1002.eqiad.wmnet with reason: Upgrading SSD firmware
[16:02:53] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-mariadb100[1-2] - https://phabricator.wikimedia.org/T394498#10928859 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fe78ceb4-644b-4a5a-a80d-c1b0a1c98616) set by btullis@cumin1003 for 1:00:00...
[16:03:07] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: remove quotas and ranges for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160860 (owner: 10Hnowlan)
[16:03:11] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:03:32] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-mariadb1002.eqiad.wmnet
[16:05:03] <wikibugs>	 (03CR) 10LorenMora: [C:03+1] Revert "Enable new mobile search experience everywhere (not including empty search recommendations)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160858 (owner: 10Kimberly Sarabia)
[16:05:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.08%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[16:07:15] <logmsgbot>	 btullis@cumin1003 upgrade-firmware (PID 2131701) is awaiting input
[16:07:46] <wikibugs>	 (03CR) 10Hashar: "I had the issue with docker-pkg for quite a while and I came to fix it as I went to address I6f1a443473ae92f24651fd9879b8c156d5adb2c5" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156733 (owner: 10Hashar)
[16:08:23] <logmsgbot>	 jhancock@cumin1003 provision (PID 2131666) is awaiting input
[16:09:24] <wikibugs>	 (03PS1) 10Cwhite: logstash: drop mobileapps detail field [puppet] - 10https://gerrit.wikimedia.org/r/1160875 (https://phabricator.wikimedia.org/T390215)
[16:10:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P78371 and previous config saved to /var/cache/conftool/dbconfig/20250618-161003-marostegui.json
[16:10:17] <logmsgbot>	 !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp7001.magru.wmnet with reason: BIOS upgrades
[16:10:28] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:10:34] <wikibugs>	 (03PS2) 10Cwhite: logstash: drop mobileapps detail field [puppet] - 10https://gerrit.wikimedia.org/r/1160875 (https://phabricator.wikimedia.org/T390215)
[16:10:42] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-mariadb1002.eqiad.wmnet
[16:10:45] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:11:43] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:11:51] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10928876 (10BTullis)
[16:12:18] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-mariadb100[1-2] - https://phabricator.wikimedia.org/T394498#10928879 (10BTullis)
[16:13:09] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: drop mobileapps detail field [puppet] - 10https://gerrit.wikimedia.org/r/1160875 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite)
[16:13:43] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10928884 (10Volans) yes if you pick the same version (option 0 above) it would just tell you that there is nothing to do because already at the same versio...
[16:15:47] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:16:48] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:18:04] <wikibugs>	 (03PS3) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817
[16:18:13] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10928900 (10Jhancock.wm) also looks like i'm gonna need to drag @elukey into this.  I manually set the ip and the user password for these servers but i still can't get a...
[16:19:39] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[16:19:56] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2005.codfw.wmnet with OS bullseye
[16:20:06] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-mariadb1002.eqiad.wmnet
[16:20:08] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts an-mariadb1002.eqiad.wmnet
[16:20:11] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10928914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host cloudcephosd2005.codfw.wmnet with OS bullseye
[16:20:47] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[16:20:49] <wikibugs>	 (03CR) 10Hashar: Change log format to get name of image being built (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156315 (owner: 10Hashar)
[16:21:37] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[16:21:52] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:esams or A:drmrs and A:cp - 9.2.10 upgrade (T390912)
[16:21:57] <stashbot>	 T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912
[16:23:47] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[16:25:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P78372 and previous config saved to /var/cache/conftool/dbconfig/20250618-162511-marostegui.json
[16:26:55] <wikibugs>	 (03PS4) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817
[16:27:07] <wikibugs>	 (03Abandoned) 10Aqu: Bump MW Page content change app version [deployment-charts] - 10https://gerrit.wikimedia.org/r/960610 (https://phabricator.wikimedia.org/T344688) (owner: 10Aqu)
[16:28:01] <wikibugs>	 (03PS1) 10Btullis: Prepare for renaming kafka-stretc200[1-2] to dse-k8s-worker200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1160888 (https://phabricator.wikimedia.org/T353789)
[16:28:25] <wikibugs>	 (03PS2) 10Btullis: Prepare for renaming kafka-stretch200[1-2] to dse-k8s-worker200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1160888 (https://phabricator.wikimedia.org/T353789)
[16:29:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[16:29:53] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] phabricator::migration: ensure /srv/phab is the correct symlink [puppet] - 10https://gerrit.wikimedia.org/r/1160310 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[16:30:33] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[16:31:58] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[16:33:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:33:46] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[16:34:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[16:34:10] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[16:34:30] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[16:34:43] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[16:34:43] <hnowlan>	 jouncebot: nowandnext
[16:34:43] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 25 minute(s)
[16:34:43] <jouncebot>	 In 0 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1700)
[16:35:30] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis)
[16:37:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10929005 (10Jhancock.wm) @Andrew i got these to the point where the image is on them, but for some reason it's not syncing with the puppetdb. Could you chec...
[16:37:55] <wikibugs>	 (03Merged) 10jenkins-bot: Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis)
[16:38:26] <wikibugs>	 (03PS1) 10Jgiannelos: RB sunset: Abandon event processing for PCS events older than cache TTL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897
[16:39:21] <wikibugs>	 (03PS2) 10Jgiannelos: RB sunset: Abandon event processing for PCS events older than cache TTL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072)
[16:40:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T396130)', diff saved to https://phabricator.wikimedia.org/P78373 and previous config saved to /var/cache/conftool/dbconfig/20250618-164019-marostegui.json
[16:40:24] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[16:40:30] <wikibugs>	 (03PS3) 10Jgiannelos: RB sunset: Abandon event processing for PCS events older than cache TTL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072)
[16:40:34] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1235.eqiad.wmnet with reason: Maintenance
[16:40:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T396130)', diff saved to https://phabricator.wikimedia.org/P78374 and previous config saved to /var/cache/conftool/dbconfig/20250618-164041-marostegui.json
[16:41:00] <logmsgbot>	 jmm@cumin1003 drain-node (PID 2122878) is awaiting input
[16:41:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] "Couple of inline comments. Overall, this should work (probably does, I see it is merged already, however I am replying to the review reque" [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[16:41:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] RB sunset: Abandon event processing for PCS events older than cache TTL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos)
[16:41:54] <wikibugs>	 (03PS4) 10Jgiannelos: RB sunset: Configure claim TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072)
[16:43:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] RB sunset: Configure claim TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos)
[16:44:19] <wikibugs>	 (03CR) 10Hnowlan: [C:04-1] RB sunset: Configure claim TTL for PCS related endpoints (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos)
[16:45:38] <wikibugs>	 (03PS5) 10Jgiannelos: RB sunset: Configure claim TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072)
[16:47:01] <ChrisDobbins901_>	 !log cdobbins@cumin2002:~$ sudo -i cookbook sre.cdn.roll-upgrade-ats --query 'A:cp-eqsin' --task-id T390912 --reason '9.2.10 upgrade'
[16:47:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] RB sunset: Configure claim TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos)
[16:47:03] <wikibugs>	 (03PS5) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148)
[16:47:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:06] <stashbot>	 T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912
[16:47:26] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-eqsin and A:cp - 9.2.10 upgrade (T390912)
[16:48:25] <dancy>	 jouncebot nowandnext
[16:48:26] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 11 minute(s)
[16:48:26] <jouncebot>	 In 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1700)
[16:49:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136044 (https://phabricator.wikimedia.org/T364694) (owner: 10Aklapper)
[16:50:01] <wikibugs>	 (03Merged) 10jenkins-bot: Update entries on https://www.mediawiki.org/keys/keys.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136044 (https://phabricator.wikimedia.org/T364694) (owner: 10Aklapper)
[16:50:04] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2005.codfw.wmnet with OS bullseye
[16:50:11] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10929057 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host cloudcephosd2005.codfw.wmnet with OS bullseye ex...
[16:50:31] <logmsgbot>	 !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1136044|Update entries on https://www.mediawiki.org/keys/keys.html (T364694)]]
[16:50:36] <stashbot>	 T364694: https://www.mediawiki.org/keys/ needs update - https://phabricator.wikimedia.org/T364694
[16:52:46] <logmsgbot>	 !log dancy@deploy1003 dancy, aklapper: Backport for [[gerrit:1136044|Update entries on https://www.mediawiki.org/keys/keys.html (T364694)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:53:41] <logmsgbot>	 !log dancy@deploy1003 dancy, aklapper: Continuing with sync
[16:55:50] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.dns.netbox
[16:56:51] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007.codfw.wmnet with OS bullseye
[16:57:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10929065 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host cloudcephosd2007.codfw.wmnet with OS bullseye ex...
[16:59:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[16:59:20] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[16:59:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:00:06] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1700)
[17:00:41] <logmsgbot>	 !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136044|Update entries on https://www.mediawiki.org/keys/keys.html (T364694)]] (duration: 10m 09s)
[17:00:43] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding rdb2011 to codfw - jhancock@cumin1003"
[17:00:46] <stashbot>	 T364694: https://www.mediawiki.org/keys/ needs update - https://phabricator.wikimedia.org/T364694
[17:01:00] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding rdb2011 to codfw - jhancock@cumin1003"
[17:01:00] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:01:04] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker2006
[17:01:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T396130)', diff saved to https://phabricator.wikimedia.org/P78375 and previous config saved to /var/cache/conftool/dbconfig/20250618-170109-marostegui.json
[17:01:15] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[17:01:16] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker2006
[17:01:19] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker2007
[17:01:28] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker2007
[17:01:31] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker2008
[17:01:41] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker2008
[17:01:43] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker2009
[17:01:54] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker2009
[17:01:56] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host build2003
[17:02:06] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host build2003
[17:02:11] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host rdb2011
[17:02:20] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host rdb2011
[17:02:26] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host rdb2012
[17:02:41] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host rdb2012
[17:02:43] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest2003
[17:02:52] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2003
[17:03:14] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest2004
[17:03:24] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2004
[17:03:27] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest2006
[17:03:42] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2006
[17:03:45] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host sretest2009
[17:03:54] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2009
[17:04:07] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.dns.netbox
[17:04:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[17:04:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:04:32] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:06:41] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:07:12] <wikibugs>	 (03PS1) 10Hnowlan: Revert "mobileapps: Remove resource limits for canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160914
[17:09:20] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host rdb2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:09:39] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host rdb2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:09:40] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[17:11:17] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] Revert "mobileapps: Remove resource limits for canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160914 (owner: 10Hnowlan)
[17:11:38] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[17:12:35] <logmsgbot>	 jhancock@cumin1003 provision (PID 2138389) is awaiting input
[17:13:01] <logmsgbot>	 jhancock@cumin1003 provision (PID 2138412) is awaiting input
[17:13:08] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mobileapps: Remove resource limits for canaries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160914 (owner: 10Hnowlan)
[17:13:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.334s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:15:51] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[17:16:09] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[17:16:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P78376 and previous config saved to /var/cache/conftool/dbconfig/20250618-171617-marostegui.json
[17:17:07] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[17:17:39] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[17:18:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.111s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:22:09] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[17:24:46] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160223 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French)
[17:24:51] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox: migrate to bookworm-based httpd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160223 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French)
[17:27:00] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: migrate to bookworm-based httpd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160223 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French)
[17:27:53] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply
[17:28:19] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[17:28:20] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[17:28:30] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-eqiad and A:cp - 9.2.10 upgrade (T390912)
[17:28:34] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[17:28:34] <stashbot>	 T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912
[17:28:35] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[17:28:45] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[17:28:46] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[17:28:54] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[17:28:56] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply
[17:29:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[17:29:14] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply
[17:29:15] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[17:29:37] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[17:31:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P78377 and previous config saved to /var/cache/conftool/dbconfig/20250618-173124-marostegui.json
[17:31:43] <wikibugs>	 (03PS1) 10Majavah: hieradata: Fix Cloud VPS radosgw image CSP [puppet] - 10https://gerrit.wikimedia.org/r/1160934 (https://phabricator.wikimedia.org/T397351)
[17:31:56] <wikibugs>	 (03PS1) 10Hashar: cloudlb: remove erroneous CSP policy [puppet] - 10https://gerrit.wikimedia.org/r/1160935 (https://phabricator.wikimedia.org/T397351)
[17:32:41] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply
[17:33:27] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply
[17:33:47] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10929205 (10Andrew) @Jhancock.wm I will have a look. I've also just noticed that the names for these servers is wrong, everything should be -dev. I'll updat...
[17:33:58] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[17:34:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[17:34:17] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[17:34:49] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[17:35:05] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[17:35:26] <wikibugs>	 (03PS1) 10Btullis: Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197)
[17:35:34] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7001.*
[17:35:36] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[17:35:39] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[17:36:10] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply
[17:36:34] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply
[17:36:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis)
[17:37:05] <wikibugs>	 (03PS1) 10Hashar: cloudlb: allow inline data in Object Storage content page [puppet] - 10https://gerrit.wikimedia.org/r/1160941 (https://phabricator.wikimedia.org/T397351)
[17:37:05] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[17:37:31] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[17:37:41] <wikibugs>	 (03Abandoned) 10Majavah: hieradata: Fix Cloud VPS radosgw image CSP [puppet] - 10https://gerrit.wikimedia.org/r/1160934 (https://phabricator.wikimedia.org/T397351) (owner: 10Majavah)
[17:38:25] <wikibugs>	 (03CR) 10Majavah: [C:03+2] cloudlb: remove erroneous CSP policy [puppet] - 10https://gerrit.wikimedia.org/r/1160935 (https://phabricator.wikimedia.org/T397351) (owner: 10Hashar)
[17:38:36] <wikibugs>	 (03CR) 10Majavah: [C:03+2] cloudlb: allow inline data in Object Storage content page [puppet] - 10https://gerrit.wikimedia.org/r/1160941 (https://phabricator.wikimedia.org/T397351) (owner: 10Hashar)
[17:39:31] <swfrench-wmf>	 !log migrated all shellbox instances to bookworm-based httpd images in codfw - T378128
[17:39:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:35] <stashbot>	 T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128
[17:39:39] <wikibugs>	 (03CR) 10Eevans: [C:03+1] thanos: add new backends, remove old ones gone from rings [puppet] - 10https://gerrit.wikimedia.org/r/1160855 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon)
[17:39:55] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:40:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10929229 (10Andrew)
[17:40:26] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host rdb2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:40:42] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:42:21] <wikibugs>	 (03PS2) 10Ladsgroup: conftool: Allow es6 and es7 being set to read only via dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1153723 (https://phabricator.wikimedia.org/T395696)
[17:42:31] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] conftool: Allow es6 and es7 being set to read only via dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1153723 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup)
[17:43:35] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host rdb2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:43:49] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host rdb2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:44:00] <wikibugs>	 (03PS2) 10Btullis: Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197)
[17:44:14] <wikibugs>	 (03CR) 10Eevans: [C:03+1] thanos: add new nodes to ring, drain old ones [puppet] - 10https://gerrit.wikimedia.org/r/1160856 (https://phabricator.wikimedia.org/T392908) (owner: 10MVernon)
[17:44:47] <wikibugs>	 (03PS3) 10Btullis: Prepare for renaming kafka-stretch200[1-2] to dse-k8s-worker200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1160888 (https://phabricator.wikimedia.org/T353789)
[17:46:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T396130)', diff saved to https://phabricator.wikimedia.org/P78378 and previous config saved to /var/cache/conftool/dbconfig/20250618-174632-marostegui.json
[17:46:37] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[17:46:47] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[17:48:58] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:49:16] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:49:16] <wikibugs>	 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10929269 (10BCornwall) @siebrand was able to disable dnssec - once that's propagated we should hopefully be golden.
[17:49:38] <wikibugs>	 (03PS3) 10Btullis: Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197)
[17:49:52] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host rdb2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:50:06] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host rdb2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:50:20] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:50:28] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host rdb2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:50:32] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install rdb201[12] - https://phabricator.wikimedia.org/T393121#10929271 (10Jhancock.wm) a:03akosiaris
[17:51:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis)
[17:51:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install rdb201[12] - https://phabricator.wikimedia.org/T393121#10929274 (10Jhancock.wm) @akosiaris can you add these two servers to site.pp for me please? i saw they're already covered in preseed. Should be able to hand these over to you pretty quic...
[17:51:43] <wikibugs>	 (03PS4) 10Btullis: Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197)
[17:52:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db2155 for queries (T385167)', diff saved to https://phabricator.wikimedia.org/P78379 and previous config saved to /var/cache/conftool/dbconfig/20250618-175206-ladsgroup.json
[17:52:12] <stashbot>	 T385167: Run data migration script for file migration - https://phabricator.wikimedia.org/T385167
[17:53:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis)
[17:53:27] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[17:54:11] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Running queries (T385167)
[17:54:12] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[17:54:44] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[17:54:59] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[17:55:30] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[17:55:45] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[17:56:16] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[17:56:18] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[17:56:50] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply
[17:57:13] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply
[17:57:44] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[17:57:45] <wikibugs>	 (03PS5) 10Btullis: Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197)
[17:58:15] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[17:58:53] <swfrench-wmf>	 !log migrated all shellbox instances to bookworm-based httpd images in eqiad - T378128
[17:58:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:58] <stashbot>	 T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128
[17:59:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.04%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[17:59:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis)
[17:59:40] <logmsgbot>	 !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1048.eqiad.wmnet
[18:00:05] <jouncebot>	 hashar and brennen: Deploy window MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1800)
[18:00:22] <brennen>	 o/
[18:00:25] <wikibugs>	 (03PS6) 10Btullis: Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197)
[18:00:27] <brennen>	 nothing for this window, afaik.
[18:03:21] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:03:37] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:03:53] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:04:09] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:04:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.16%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[18:05:11] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[18:07:25] <logmsgbot>	 jhancock@cumin1003 provision (PID 2144116) is awaiting input
[18:07:35] <wikibugs>	 (03PS5) 10Ebernhardson: cirrus: Add services for read operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838270 (https://phabricator.wikimedia.org/T143553)
[18:07:35] <wikibugs>	 (03PS6) 10Ebernhardson: Use discovery dns for elasticsearch read traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838271 (https://phabricator.wikimedia.org/T143553)
[18:07:51] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838270 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson)
[18:08:04] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838271 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson)
[18:13:00] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Deploy arclamp
[18:13:22] <logmsgbot>	 !log ladsgroup@deploy1003 sync-world aborted: Deploy arclamp (duration: 00m 33s)
[18:14:10] <logmsgbot>	 !log ladsgroup@deploy1003 Started deploy [performance/arc-lamp@76afb89]: Deploy arclamp
[18:14:19] <logmsgbot>	 !log ladsgroup@deploy1003 Finished deploy [performance/arc-lamp@76afb89]: Deploy arclamp (duration: 00m 08s)
[18:15:24] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:16:10] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:16:55] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:17:06] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:17:22] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:17:50] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:17:52] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:18:42] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:19:47] <wikibugs>	 (03CR) 10Bking: [C:03+1] Prepare for renaming kafka-stretch200[1-2] to dse-k8s-worker200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1160888 (https://phabricator.wikimedia.org/T353789) (owner: 10Btullis)
[18:20:34] <logmsgbot>	 jhancock@cumin1003 provision (PID 2144596) is awaiting input
[18:22:19] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:23:06] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1251.eqiad.wmnet with reason: Maintenance
[18:23:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1251 (T396130)', diff saved to https://phabricator.wikimedia.org/P78381 and previous config saved to /var/cache/conftool/dbconfig/20250618-182313-marostegui.json
[18:23:15] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:23:19] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[18:24:05] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:24:15] <Amir1>	 jouncebot: nowandnext
[18:24:16] <jouncebot>	 For the next 1 hour(s) and 35 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T1800)
[18:24:16] <jouncebot>	 In 1 hour(s) and 35 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T2000)
[18:24:53] <Amir1>	 okay, deploying something then
[18:25:58] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:26:14] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:26:21] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:26:22] <brennen>	 yeah, all yours Amir1.
[18:26:34] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:26:37] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:26:42] <Amir1>	 Thanks!
[18:26:58] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:27:05] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] etcd: Remove ES clusters from "write clusters" if section is RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152853 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup)
[18:27:29] <swfrench-wmf>	 \o/
[18:27:43] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:27:55] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:28:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152853 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup)
[18:28:19] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[18:28:37] <wikibugs>	 (03Merged) 10jenkins-bot: etcd: Remove ES clusters from "write clusters" if section is RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152853 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup)
[18:28:59] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1152853|etcd: Remove ES clusters from "write clusters" if section is RO (T395696)]]
[18:29:03] <stashbot>	 T395696: Move ExternalStore config out of mediawiki config - https://phabricator.wikimedia.org/T395696
[18:31:14] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1152853|etcd: Remove ES clusters from "write clusters" if section is RO (T395696)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[18:38:38] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-eqsin and A:cp - 9.2.10 upgrade (T390912)
[18:38:42] <stashbot>	 T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912
[18:38:52] <wikibugs>	 (03CR) 10BCornwall: [V:04-1 C:04-1] "Presently this fails varnish tests. Comments inline!" [puppet] - 10https://gerrit.wikimedia.org/r/1154085 (owner: 10CDobbins)
[18:43:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Testing T395696', diff saved to https://phabricator.wikimedia.org/P78382 and previous config saved to /var/cache/conftool/dbconfig/20250618-184325-ladsgroup.json
[18:43:31] <stashbot>	 T395696: Move ExternalStore config out of mediawiki config - https://phabricator.wikimedia.org/T395696
[18:45:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T396130)', diff saved to https://phabricator.wikimedia.org/P78383 and previous config saved to /var/cache/conftool/dbconfig/20250618-184538-marostegui.json
[18:45:43] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[18:47:59] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-eqiad and A:cp - 9.2.10 upgrade (T390912)
[18:48:04] <stashbot>	 T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912
[18:49:03] <wikibugs>	 (03PS1) 10Ladsgroup: etcd: Check for array key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160990 (https://phabricator.wikimedia.org/T395696)
[18:49:05] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[18:51:35] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Good catch!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160990 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup)
[18:51:39] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] etcd: Check for array key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160990 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup)
[18:52:33] <wikibugs>	 (03Merged) 10jenkins-bot: etcd: Check for array key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160990 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup)
[18:55:54] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152853|etcd: Remove ES clusters from "write clusters" if section is RO (T395696)]] (duration: 26m 55s)
[18:55:59] <stashbot>	 T395696: Move ExternalStore config out of mediawiki config - https://phabricator.wikimedia.org/T395696
[18:56:18] <logmsgbot>	 !log ryankemper@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on 6 hosts with reason: T395772 hosts not serving production traffic
[18:56:22] <stashbot>	 T395772: Teardown lvs for wdqs public pool - https://phabricator.wikimedia.org/T395772
[18:57:20] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1160990|etcd: Check for array key (T395696)]]
[18:59:42] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1160990|etcd: Check for array key (T395696)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:00:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P78384 and previous config saved to /var/cache/conftool/dbconfig/20250618-190045-marostegui.json
[19:03:01] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[19:05:16] <ChrisDobbins901_>	 !log cdobbins@cumin2002:~$ sudo -i cookbook sre.cdn.roll-upgrade-ats --query 'A:cp-codfw' --task-id T390912 --reason '9.2.10 upgrade'
[19:05:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:22] <stashbot>	 T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912
[19:05:23] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-codfw and A:cp - 9.2.10 upgrade (T390912)
[19:09:59] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160990|etcd: Check for array key (T395696)]] (duration: 12m 39s)
[19:10:04] <stashbot>	 T395696: Move ExternalStore config out of mediawiki config - https://phabricator.wikimedia.org/T395696
[19:14:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Testing T395696', diff saved to https://phabricator.wikimedia.org/P78385 and previous config saved to /var/cache/conftool/dbconfig/20250618-191440-ladsgroup.json
[19:15:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P78386 and previous config saved to /var/cache/conftool/dbconfig/20250618-191553-marostegui.json
[19:17:54] <wikibugs>	 (03PS3) 10NMW03: Set category collation to "uca-az" for Azerbaijani projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153722 (https://phabricator.wikimedia.org/T395896)
[19:19:41] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153722 (https://phabricator.wikimedia.org/T395896) (owner: 10NMW03)
[19:25:21] <Nemoralis>	 !ping
[19:25:21] <wm-bot>	 pong
[19:27:32] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: fork SLOs for wdqs-main and wdqs-scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1155335 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[19:30:43] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2006.codfw.wmnet with OS bookworm
[19:30:51] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host aux-k8s-worker2006.codfw.wmnet with OS bookworm
[19:31:01] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2007.codfw.wmnet with OS bookworm
[19:31:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T396130)', diff saved to https://phabricator.wikimedia.org/P78387 and previous config saved to /var/cache/conftool/dbconfig/20250618-193101-marostegui.json
[19:31:07] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929506 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host aux-k8s-worker2007.codfw.wmnet with OS bookworm
[19:31:08] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[19:31:15] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2008.codfw.wmnet with OS bookworm
[19:31:16] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[19:31:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929510 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host aux-k8s-worker2008.codfw.wmnet with OS bookworm
[19:31:31] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2009.codfw.wmnet with OS bookworm
[19:31:37] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929511 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host aux-k8s-worker2009.codfw.wmnet with OS bookworm
[19:32:45] <ryankemper>	 !log T393966 Ran puppet on `titan1001` following merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1155335. Puppet looks happy and I see the new recording rules getting created
[19:32:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:49] <stashbot>	 T393966: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966
[19:40:24] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-scholarly-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[19:43:27] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2006.codfw.wmnet with reason: host reimage
[19:43:33] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: absent old availability metric [puppet] - 10https://gerrit.wikimedia.org/r/1161024 (https://phabricator.wikimedia.org/T393966)
[19:43:34] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: remove previously-absented slo [puppet] - 10https://gerrit.wikimedia.org/r/1161025 (https://phabricator.wikimedia.org/T393966)
[19:43:41] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2007.codfw.wmnet with reason: host reimage
[19:43:59] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2008.codfw.wmnet with reason: host reimage
[19:44:08] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2009.codfw.wmnet with reason: host reimage
[19:45:24] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[19:47:16] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2006.codfw.wmnet with reason: host reimage
[19:47:33] <wikibugs>	 (03CR) 10Herron: [C:03+1] wdqs: remove previously-absented slo [puppet] - 10https://gerrit.wikimedia.org/r/1161025 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[19:50:50] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2009.codfw.wmnet with reason: host reimage
[19:51:22] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove previously-absented slo [puppet] - 10https://gerrit.wikimedia.org/r/1161025 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[19:51:34] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] "oops, meant to +2 other patch first" [puppet] - 10https://gerrit.wikimedia.org/r/1161025 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[19:51:47] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: absent old availability metric [puppet] - 10https://gerrit.wikimedia.org/r/1161024 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[19:53:33] <wikibugs>	 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): WDQS Update Lag SLO looks wrong - https://phabricator.wikimedia.org/T395987#10929587 (10RKemper)
[19:54:41] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.179.0" for 2 host(s)
[19:54:52] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2007.codfw.wmnet with reason: host reimage
[19:55:13] <icinga-wm>	 PROBLEM - MD RAID on logstash2035 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[19:55:14] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on logstash2035 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T397366 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[19:55:24] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on aux-k8s-worker2008.codfw.wmnet with reason: host reimage
[19:55:26] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on logstash2035 - https://phabricator.wikimedia.org/T397366 (10ops-monitoring-bot) 03NEW
[19:55:49] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash2035 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[19:56:30] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.179.0" completed for 2 hosts
[19:57:38] <wikibugs>	 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): WDQS Update Lag SLO looks wrong - https://phabricator.wikimedia.org/T395987#10929603 (10RKemper) New SLOs/SLIs are in place and old ones have been fully absented.  Agreed with @elukey that we should get the SLOs officially approved (&...
[19:58:41] <wikibugs>	 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#10929605 (10RKemper)
[19:59:45] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.dns.netbox
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T2000).
[20:00:06] <jouncebot>	 kimberly_sarabia, ebernhardson, and Nemoralis: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:07] <ebernhardson>	 here
[20:00:15] <Nemoralis>	 o/
[20:00:17] <kimberly_sarabia>	 hey
[20:00:48] <wikibugs>	 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#10929623 (10RKemper)
[20:01:18] <ebernhardson>	 i suppose i can do the deploy
[20:02:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160858 (owner: 10Kimberly Sarabia)
[20:02:19] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:02:52] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2005-dev
[20:03:02] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2005-dev
[20:03:06] <kimberly_sarabia>	 ty
[20:03:07] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2006-dev
[20:03:09] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[20:03:18] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2006-dev
[20:03:25] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2007-dev
[20:03:33] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove previously-absented slo [puppet] - 10https://gerrit.wikimedia.org/r/1161025 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[20:03:34] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2007-dev
[20:03:42] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[20:03:43] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2006.codfw.wmnet with OS bookworm
[20:03:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host aux-k8s-worker2006.codfw.wmnet with OS bookworm comple...
[20:03:58] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Enable new mobile search experience everywhere (not including empty search recommendations)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160858 (owner: 10Kimberly Sarabia)
[20:04:22] <logmsgbot>	 !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1160858|Revert "Enable new mobile search experience everywhere (not including empty search recommendations)"]]
[20:06:24] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[20:06:35] <logmsgbot>	 !log ebernhardson@deploy1003 ebernhardson, ksarabia: Backport for [[gerrit:1160858|Revert "Enable new mobile search experience everywhere (not including empty search recommendations)"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:07:14] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[20:07:15] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2009.codfw.wmnet with OS bookworm
[20:07:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host aux-k8s-worker2009.codfw.wmnet with OS bookworm comple...
[20:07:25] <ebernhardson>	 kimberly_sarabia: already it's up on test servers, can you verify?
[20:08:03] <kimberly_sarabia>	 ebernhardson: I see the revert. thank you LGTM
[20:08:14] <logmsgbot>	 !log ebernhardson@deploy1003 ebernhardson, ksarabia: Continuing with sync
[20:08:17] <ebernhardson>	 alright, continuing
[20:10:42] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[20:11:05] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[20:11:05] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2007.codfw.wmnet with OS bookworm
[20:11:09] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 654.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:11:12] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929649 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host aux-k8s-worker2007.codfw.wmnet with OS bookworm comple...
[20:11:46] <dancy>	 ebernhardson: Can I squeeze in a scap update before the next deployment? It will take about 2 minutes.
[20:11:55] <ebernhardson>	 dancy: yea should be ok
[20:11:58] <dancy>	 thx
[20:12:14] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[20:13:11] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[20:13:12] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2008.codfw.wmnet with OS bookworm
[20:13:17] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host aux-k8s-worker2008.codfw.wmnet with OS bookworm comple...
[20:13:30] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929662 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[20:13:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#10929666 (10Jhancock.wm) @akosiaris this one is complete
[20:13:54] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1160750 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi)
[20:14:33] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi)
[20:15:07] <logmsgbot>	 !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160858|Revert "Enable new mobile search experience everywhere (not including empty search recommendations)"]] (duration: 10m 45s)
[20:16:00] <ebernhardson>	 dancy: alright you're up
[20:18:20] <dancy>	 thx
[20:18:30] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.179.1" for 2 host(s)
[20:20:19] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.179.1" completed for 2 hosts
[20:20:42] <dancy>	 ebernhardson: Done!
[20:21:02] <ebernhardson>	 awesome
[20:21:31] <ebernhardson>	 dancy: hmm, it's acting a little odd
[20:21:51] <ebernhardson>	 oh never mind, im looking at wrong thing :P
[20:22:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:22:40] <ebernhardson>	 dancy: hmm, no it's being weird. I tell it patch 838270, and it tells me a obut ptach 838182
[20:22:50] <dancy>	 Taking a look
[20:23:31] <dancy>	 It's mentioning 838182 due to the Depends-On: Ie6dfb586f6b22867a13b8b29d920da8409e94015 in 838270
[20:23:56] <ebernhardson>	 it doesn't like cross-repo depends-on?  The patch is merged
[20:24:08] <ebernhardson>	 i suppose i can remove that from the commit message
[20:24:20] <dancy>	 you can just answer 'y' to the question if you want to proceed.
[20:24:26] <dancy>	 It's just a warning.
[20:24:40] <ebernhardson>	 ahh, i was worried it would do something awkward since it's talking about not finding 'production' in wikiversions
[20:24:42] <dancy>	 If everything is correct (e.g., the depended-on patch is merged and working), it's ok 
[20:24:54] <dancy>	 The message definitely needs improvement.
[20:25:09] <dancy>	 (e.g, it should mention that it's talking about a dependency of one of the changes you supplied)
[20:25:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838270 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson)
[20:25:33] <ebernhardson>	 ok sounds good, thanks for the clarification
[20:26:27] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Add services for read operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838270 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson)
[20:26:48] <logmsgbot>	 !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:838270|cirrus: Add services for read operations (T143553)]]
[20:26:53] <stashbot>	 T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553
[20:27:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.109s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:29:07] <logmsgbot>	 !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:838270|cirrus: Add services for read operations (T143553)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:31:11] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-codfw and A:cp - 9.2.10 upgrade (T390912)
[20:31:15] <logmsgbot>	 !log ebernhardson@deploy1003 ebernhardson: Continuing with sync
[20:31:17] <stashbot>	 T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912
[20:31:41] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10929741 (10Jhancock.wm) actually, i think that would have been it. i usually only get that error when it's not in the site.pp file. my bad
[20:36:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.61s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:37:21] <hashar>	 !log gerrit: deleted bunch of obsoletes references under `refs/users/*` accross all repositories. See T397317 (private)
[20:37:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:59] <logmsgbot>	 !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:838270|cirrus: Add services for read operations (T143553)]] (duration: 11m 11s)
[20:38:04] <stashbot>	 T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553
[20:38:16] <ebernhardson>	 Nemoralis: you're up next
[20:38:23] <Nemoralis>	 i am here
[20:38:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153722 (https://phabricator.wikimedia.org/T395896) (owner: 10NMW03)
[20:39:20] <Nemoralis>	 by the way, you need to run a maintenance script for my patch
[20:39:24] <wikibugs>	 (03Merged) 10jenkins-bot: Set category collation to "uca-az" for Azerbaijani projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153722 (https://phabricator.wikimedia.org/T395896) (owner: 10NMW03)
[20:39:32] <Nemoralis>	 https://www.mediawiki.org/wiki/Manual:UpdateCollation.php
[20:39:48] <logmsgbot>	 !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1153722|Set category collation to "uca-az" for Azerbaijani projects (T395896)]]
[20:39:54] <stashbot>	 T395896: Set category collation for Azerbaijani projects - https://phabricator.wikimedia.org/T395896
[20:41:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.61s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:42:06] <logmsgbot>	 !log ebernhardson@deploy1003 nmw03, ebernhardson: Backport for [[gerrit:1153722|Set category collation to "uca-az" for Azerbaijani projects (T395896)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:42:24] <ebernhardson>	 Nemoralis: can you verify?
[20:42:55] <Nemoralis>	 sure, but I believe this will work after maintenance script
[20:43:46] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet with reason: Release v0.10.1 - volans@cumin2002
[20:43:53] <ebernhardson>	 ok, continue with sync i suppose?
[20:44:02] <Nemoralis>	 yep
[20:44:05] <logmsgbot>	 !log ebernhardson@deploy1003 nmw03, ebernhardson: Continuing with sync
[20:44:25] <Nemoralis>	 you will need to run updateCollation for 4 wikis
[20:44:35] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet with reason: Release v0.10.1 - volans@cumin2002
[20:44:53] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7001.*
[20:45:33] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:46:00] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:46:33] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:48:32] <wikibugs>	 (03PS1) 10Dzahn: phabricator::migration: add parameters phabdir,storage_user,deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889)
[20:49:41] <wikibugs>	 (03PS2) 10Dzahn: phabricator::migration: add parameters phabdir,storage_user,deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889)
[20:50:55] <logmsgbot>	 !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153722|Set category collation to "uca-az" for Azerbaijani projects (T395896)]] (duration: 11m 06s)
[20:51:00] <stashbot>	 T395896: Set category collation for Azerbaijani projects - https://phabricator.wikimedia.org/T395896
[20:51:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] phabricator::migration: add parameters phabdir,storage_user,deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[20:52:08] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1161042/6014/phab1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[20:52:16] <ebernhardson>	 !log running updateCollation.php for azwikibooks, azwikiquote, azwikisource, and azwiktionary
[20:52:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838271 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson)
[20:53:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloudcephosd200[567]-dev service implementation - https://phabricator.wikimedia.org/T397237#10929828 (10Andrew)
[20:54:06] <logmsgbot>	 jhancock@cumin1003 provision (PID 2167210) is awaiting input
[20:54:16] <wikibugs>	 (03Merged) 10jenkins-bot: Use discovery dns for elasticsearch read traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838271 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson)
[20:54:36] <logmsgbot>	 jhancock@cumin1003 provision (PID 2167233) is awaiting input
[20:54:38] <logmsgbot>	 !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:838271|Use discovery dns for elasticsearch read traffic (T143553)]]
[20:54:45] <stashbot>	 T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553
[20:56:52] <logmsgbot>	 !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:838271|Use discovery dns for elasticsearch read traffic (T143553)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:57:17] <logmsgbot>	 jhancock@cumin1003 provision (PID 2167269) is awaiting input
[20:57:56] <logmsgbot>	 !log ebernhardson@deploy1003 ebernhardson: Continuing with sync
[20:58:44] <wikibugs>	 (03PS3) 10Dzahn: phabricator::migration: add parameters phabdir,storage_user,deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889)
[20:59:25] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:59:26] <ebernhardson>	 Nemoralis: maint script is complete on the 4 wikis
[20:59:27] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:59:29] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:59:59] <ebernhardson>	 !log updateCollation.php for azwikibooks, azwikiquote, azwikisource, and azwiktionary completed
[21:00:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:05] <wikibugs>	 (03PS4) 10Dzahn: phabricator::migration: add parameters phabdir,storage_user,deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889)
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T2100)
[21:00:15] <Nemoralis>	 ebernhardson: just tested, works fine
[21:00:17] <Nemoralis>	 thanks!
[21:00:22] <ebernhardson>	 awesome!
[21:01:07] <dancy>	 If all deployment are done I'm going to update scap one more time
[21:01:16] <ebernhardson>	 it's still shipping one more
[21:01:19] <dancy>	 ok
[21:01:19] <ebernhardson>	 but almost done
[21:02:48] <wikibugs>	 (03PS5) 10Dzahn: phabricator::migration: add parameters phabdir,storage_user,deploy_user [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889)
[21:03:10] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.dns.netbox
[21:03:57] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1161042/6016/phab1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[21:04:32] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:04:52] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "this role is applied only on the "next phab" machine. used for DB upgrade test and PHP8 test." [puppet] - 10https://gerrit.wikimedia.org/r/1161042 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[21:04:52] <logmsgbot>	 !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:838271|Use discovery dns for elasticsearch read traffic (T143553)]] (duration: 10m 14s)
[21:04:57] <stashbot>	 T143553: Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553
[21:05:12] <ebernhardson>	 dancy: all yours now
[21:05:16] <dancy>	 Thanks!
[21:05:25] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.180.0" for 2 host(s)
[21:05:40] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:06:20] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:06:50] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:07:14] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.180.0" completed for 2 hosts
[21:07:29] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:12:21] <wikibugs>	 (03CR) 10Bking: [C:03+1] Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis)
[21:13:03] <wikibugs>	 (03CR) 10Ahmon Dancy: "This should wait until https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/827 is done and the latest scap is deployed to beta" [puppet] - 10https://gerrit.wikimedia.org/r/1155318 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy)
[21:17:08] <logmsgbot>	 jhancock@cumin1003 provision (PID 2169849) is awaiting input
[21:17:37] <logmsgbot>	 jhancock@cumin1003 provision (PID 2169899) is awaiting input
[21:18:16] <logmsgbot>	 jhancock@cumin1003 provision (PID 2169942) is awaiting input
[21:19:02] <wikibugs>	 (03PS1) 10Aqu: Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845)
[21:19:58] <wikibugs>	 (03PS2) 10Aqu: Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845)
[21:22:24] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:22:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[21:22:30] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:22:42] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:23:55] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2005.codfw.wmnet with OS bullseye
[21:24:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10929935 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host cloudcephosd2005.codfw.wmnet with OS bul...
[21:27:28] <wikibugs>	 (03PS1) 10Dzahn: phabricator::migration: puppetize password for testdb in script-vars [puppet] - 10https://gerrit.wikimedia.org/r/1161048 (https://phabricator.wikimedia.org/T390034)
[21:28:55] <wikibugs>	 (03PS2) 10Dzahn: phabricator::migration: puppetize password for testdb in script-vars [puppet] - 10https://gerrit.wikimedia.org/r/1161048 (https://phabricator.wikimedia.org/T390034)
[21:29:13] <wikibugs>	 (03PS16) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309)
[21:30:49] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[21:31:50] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[21:32:56] <wikibugs>	 (03PS1) 10Clare Ming: xLab: Deploy v0.7.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161050 (https://phabricator.wikimedia.org/T372952)
[21:34:19] <wikibugs>	 (03PS1) 10Dzahn: add fake password for phab test db admin user [labs/private] - 10https://gerrit.wikimedia.org/r/1161051 (https://phabricator.wikimedia.org/T377889)
[21:34:43] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] add fake password for phab test db admin user [labs/private] - 10https://gerrit.wikimedia.org/r/1161051 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[21:34:55] <wikibugs>	 (03PS1) 10Clare Ming: xLab: Deploy v0.7.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161052 (https://phabricator.wikimedia.org/T372952)
[21:35:07] <wikibugs>	 (03PS2) 10Dzahn: add fake password for phab test db admin user [labs/private] - 10https://gerrit.wikimedia.org/r/1161051 (https://phabricator.wikimedia.org/T377889)
[21:35:17] <wikibugs>	 (03CR) 10Dzahn: [V:03+2] add fake password for phab test db admin user [labs/private] - 10https://gerrit.wikimedia.org/r/1161051 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[21:36:06] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[21:36:17] <wikibugs>	 (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/labs/private/+/1161051" [puppet] - 10https://gerrit.wikimedia.org/r/1161048 (https://phabricator.wikimedia.org/T390034) (owner: 10Dzahn)
[21:36:34] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7001.*
[21:36:43] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1161048/6018/phab1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1161048 (https://phabricator.wikimedia.org/T390034) (owner: 10Dzahn)
[21:37:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.291s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:39:52] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3072.*
[21:39:54] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp3072.*
[21:40:07] <brett>	 !log Depooling cp3072 to upgrade bios
[21:40:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:44] <logmsgbot>	 !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp3072.esams.wmnet with reason: BIOS upgrades
[21:42:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.291s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:45:11] <wikibugs>	 (03PS1) 10Dzahn: phabricator::migration: fix variable name used for testdb storage pass [puppet] - 10https://gerrit.wikimedia.org/r/1161053 (https://phabricator.wikimedia.org/T377889)
[21:45:41] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] phabricator::migration: fix variable name used for testdb storage pass [puppet] - 10https://gerrit.wikimedia.org/r/1161053 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[21:46:03] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] phabricator::migration: fix variable name used for testdb storage pass [puppet] - 10https://gerrit.wikimedia.org/r/1161053 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[21:46:06] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[21:46:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10930043 (10Jclark-ctr) @Marostegui     Looks like the Seed server was delivered Jun 12th to the data center {F62381874}  @VRiley-WMF  this would be the dell that you placed in the new cage.  the P.O o...
[21:49:22] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10930054 (10Jclark-ctr) a:03Jclark-ctr
[21:53:39] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161050 (https://phabricator.wikimedia.org/T372952) (owner: 10Clare Ming)
[21:55:14] <wikibugs>	 (03PS1) 10Dzahn: phabricator::migration: fix /srv/phab symlink, /srv/repos dir [puppet] - 10https://gerrit.wikimedia.org/r/1161059 (https://phabricator.wikimedia.org/T377889)
[21:55:28] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161052 (https://phabricator.wikimedia.org/T372952) (owner: 10Clare Ming)
[21:55:51] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] phabricator::migration: fix /srv/phab symlink, /srv/repos dir [puppet] - 10https://gerrit.wikimedia.org/r/1161059 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[21:56:39] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Deploy v0.7.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161050 (https://phabricator.wikimedia.org/T372952) (owner: 10Clare Ming)
[21:57:18] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Deploy v0.7.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161052 (https://phabricator.wikimedia.org/T372952) (owner: 10Clare Ming)
[21:59:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:59:35] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250618T2200)
[22:01:06] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:04:56] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp3072.esams.wmnet
[22:04:57] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp3072.esams.wmnet
[22:05:03] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3072.*
[22:08:33] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1196 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:09:51] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-codfw and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581)
[22:09:56] <stashbot>	 T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581
[22:12:09] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:12:30] <logmsgbot>	 !log brennen@deploy1003 Started deploy [phabricator/deployment@f8d7b38]: no-op deploy to phab1005
[22:12:37] <logmsgbot>	 !log brennen@deploy1003 Finished deploy [phabricator/deployment@f8d7b38]: no-op deploy to phab1005 (duration: 00m 07s)
[22:14:20] <logmsgbot>	 !log brennen@deploy1003 Started deploy [phabricator/deployment@6af4bb7]: merge-phorge-2024.35 deploy to phab1005 (T390034)
[22:14:24] <stashbot>	 T390034: Prepare a database test for m3 - https://phabricator.wikimedia.org/T390034
[22:14:46] <logmsgbot>	 !log brennen@deploy1003 Finished deploy [phabricator/deployment@6af4bb7]: merge-phorge-2024.35 deploy to phab1005 (T390034) (duration: 00m 26s)
[22:19:26] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[22:19:51] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[22:23:29] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:25:27] <Jake_Park>	 Hi team
[22:25:31] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:25:39] <Jake_Park>	 The rename account task seems to be stucked
[22:26:33] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1196 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:27:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[22:27:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/0 (Peering: DE-CIX (DXDB:NAS:173434 MAC filter) {#D0067}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[22:44:06] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[22:47:32] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for aux-k8s-worker100[6-9] - jclark@cumin1002"
[22:47:36] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for aux-k8s-worker100[6-9] - jclark@cumin1002"
[22:47:36] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:51:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on ms-fe1016:9290 - https://phabricator.wikimedia.org/T397261#10930169 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated power cable
[22:52:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[22:52:51] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/0 (Peering: DE-CIX (DXDB:NAS:173434 MAC filter) {#D0067}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[22:52:54] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker1006
[22:54:01] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker1006
[22:54:05] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker1007
[22:55:13] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker1007
[22:55:18] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker1008
[22:56:36] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker1008
[22:56:42] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host aux-k8s-worker1009
[22:57:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10930176 (10Jclark-ctr)
[22:57:59] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host aux-k8s-worker1009
[22:58:39] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (2001:7f8:36::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[23:02:54] <jinxer-wm>	 RESOLVED: TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (2001:7f8:36::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDo
[23:18:29] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:23:57] <jinxer-wm>	 FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:24:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:24:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:25:31] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:26:15] <swfrench-wmf>	 !incidents
[23:26:16] <sirenbot>	 6365 (UNACKED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[23:26:30] <swfrench-wmf>	 !ack 6365
[23:26:30] <sirenbot>	 6365 (ACKED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[23:27:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:27:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:27:14] <swfrench-wmf>	 that bodes well
[23:28:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:38:37] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1161074
[23:38:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1161074 (owner: 10TrainBranchBot)
[23:45:24] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[23:50:27] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:50:43] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:51:51] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1161074 (owner: 10TrainBranchBot)
[23:57:17] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:57:35] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54083 bytes in 0.380 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring