[00:17:58] <wikibugs>	 (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224833 (https://phabricator.wikimedia.org/T219903)
[00:18:14] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[00:18:17] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[00:18:19] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[00:18:22] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[00:18:23] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[00:18:25] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[00:20:16] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224833 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza)
[00:22:20] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224833 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza)
[00:22:43] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[00:22:46] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[00:22:48] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[00:22:51] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[00:22:52] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[00:22:54] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[00:23:51] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[00:24:09] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[00:24:11] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[00:24:27] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[00:24:29] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[00:24:45] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[00:25:33] <wikibugs>	 (03CR) 10Arlolra: [C:03+1] Increase PRV percentage on fawiki/kowiki/azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224719 (https://phabricator.wikimedia.org/T413108) (owner: 10C. Scott Ananian)
[00:40:15] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1224836
[00:40:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1224836 (owner: 10TrainBranchBot)
[00:43:52] <wikibugs>	 (03PS1) 10Arlolra: Support incremental roll out of Parsoid Read Views [extensions/ParserMigration] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1224837 (https://phabricator.wikimedia.org/T391881)
[00:52:14] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1224836 (owner: 10TrainBranchBot)
[01:00:41] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:07:49] <wikibugs>	 (03PS1) 10Aaron Schulz: rest-gateway: changed REST sandbox rerouting to redirection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224838 (https://phabricator.wikimedia.org/T396807)
[01:10:23] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1224839
[01:10:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1224839 (owner: 10TrainBranchBot)
[01:18:59] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 18m 17s)
[01:23:30] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223261 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz)
[01:23:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11506021 (10Papaul) ` Hi Papaul,  During testing in our lab we noticed that SPT (Spanning Tree Protocol) packets are being counted as “in-error” packe...
[01:24:09] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[01:33:08] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1224839 (owner: 10TrainBranchBot)
[01:50:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:17:11] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudbackup: use default postgres dir for data files [puppet] - 10https://gerrit.wikimedia.org/r/1224840
[02:22:14] <wikibugs>	 (03PS2) 10Andrew Bogott: cloudbackup: use default postgres dir for data files [puppet] - 10https://gerrit.wikimedia.org/r/1224840
[02:24:09] <jinxer-wm>	 FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[02:25:17] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudbackup: use default postgres dir for data files [puppet] - 10https://gerrit.wikimedia.org/r/1224840 (owner: 10Andrew Bogott)
[02:25:59] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup1003']
[02:27:28] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS trixie
[02:32:46] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86895 and previous config saved to /var/cache/conftool/dbconfig/20260109-023246-marostegui.json
[02:32:51] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[02:32:51] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[02:42:55] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P86896 and previous config saved to /var/cache/conftool/dbconfig/20260109-024254-marostegui.json
[02:53:03] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P86897 and previous config saved to /var/cache/conftool/dbconfig/20260109-025303-marostegui.json
[02:53:55] <logmsgbot>	 andrew@cumin2002 reimage (PID 2869160) is awaiting input
[03:03:06] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudbackup2003.codfw.wmnet with OS trixie
[03:03:12] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86898 and previous config saved to /var/cache/conftool/dbconfig/20260109-030311-marostegui.json
[03:03:17] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[03:03:17] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[03:03:28] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance
[03:03:36] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1199 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86899 and previous config saved to /var/cache/conftool/dbconfig/20260109-030336-marostegui.json
[03:03:42] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS trixie
[03:21:40] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup2003.codfw.wmnet with reason: host reimage
[03:28:47] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup2003.codfw.wmnet with reason: host reimage
[03:45:30] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup2003.codfw.wmnet with OS trixie
[04:19:08] <icinga-wm>	 PROBLEM - Host 195.200.68.37 is DOWN: CRITICAL - Time to live exceeded (195.200.68.37)
[04:19:30] <icinga-wm>	 RECOVERY - Host 195.200.68.37 is UP: PING OK - Packet loss = 0%, RTA = 137.53 ms
[04:28:44] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100%
[04:28:46] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86900 and previous config saved to /var/cache/conftool/dbconfig/20260109-042845-marostegui.json
[04:28:51] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[04:28:51] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[04:30:06] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:31:30] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[04:34:10] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:38:54] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P86901 and previous config saved to /var/cache/conftool/dbconfig/20260109-043854-marostegui.json
[04:49:03] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P86902 and previous config saved to /var/cache/conftool/dbconfig/20260109-044902-marostegui.json
[04:59:11] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86903 and previous config saved to /var/cache/conftool/dbconfig/20260109-045910-marostegui.json
[04:59:16] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[04:59:16] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[04:59:27] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance
[04:59:36] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2172 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86904 and previous config saved to /var/cache/conftool/dbconfig/20260109-045935-marostegui.json
[05:09:10] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:24:09] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:34:10] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:50:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:50:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:51:50] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:09:09] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-main-update-lag eqiad - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[06:18:28] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1210.eqiad.wmnet with reason: Maintenance
[06:23:15] <wikibugs>	 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11506171 (10Joe) @Ragesoss I see you still get blocked from time to time; I will add an exception, per https://wikitech.wikimedia.org/wiki/Robot_policy#What_to_do_if_thes...
[06:24:10] <jinxer-wm>	 FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[06:31:46] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2157.codfw.wmnet with reason: Maintenance
[06:31:55] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2157 (T413525)', diff saved to https://phabricator.wikimedia.org/P86905 and previous config saved to /var/cache/conftool/dbconfig/20260109-063154-marostegui.json
[06:31:58] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[06:36:24] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not format db2249 [puppet] - 10https://gerrit.wikimedia.org/r/1224855 (https://phabricator.wikimedia.org/T411570)
[06:37:02] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1022:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[06:38:15] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2249 [puppet] - 10https://gerrit.wikimedia.org/r/1224855 (https://phabricator.wikimedia.org/T411570) (owner: 10Marostegui)
[06:40:47] <wikibugs>	 (03PS1) 10Marostegui: installserver: Fix duplicate reuse-db-efi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1224856 (https://phabricator.wikimedia.org/T411570)
[06:42:02] <jinxer-wm>	 FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[06:45:42] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T413525)', diff saved to https://phabricator.wikimedia.org/P86906 and previous config saved to /var/cache/conftool/dbconfig/20260109-064541-marostegui.json
[06:45:45] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[06:47:02] <jinxer-wm>	 FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[06:52:02] <jinxer-wm>	 FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[06:52:44] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157 (10kimpham) 03NEW
[06:55:50] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P86907 and previous config saved to /var/cache/conftool/dbconfig/20260109-065549-marostegui.json
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260109T0700)
[07:01:24] <wikibugs>	 (03PS1) 10Dzahn: eventgate-analytics-external: add wikipedia25.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592)
[07:03:24] <wikibugs>	 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11506217 (10Joe) 05Open→03Resolved p:05Triage→03High a:03Joe Exception added. I allowed a generous amount of requests; please let us know if you still run i...
[07:05:58] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P86908 and previous config saved to /var/cache/conftool/dbconfig/20260109-070558-marostegui.json
[07:07:41] <jinxer-wm>	 FIRING: [20x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[07:12:00] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "as a reminder for later - once it's ready - need to define a useful string" [puppet] - 10https://gerrit.wikimedia.org/r/1224575 (owner: 10Dzahn)
[07:12:41] <jinxer-wm>	 FIRING: [112x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[07:16:08] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T413525)', diff saved to https://phabricator.wikimedia.org/P86909 and previous config saved to /var/cache/conftool/dbconfig/20260109-071606-marostegui.json
[07:16:12] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[07:16:13] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance
[07:16:21] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2171 (T413525)', diff saved to https://phabricator.wikimedia.org/P86910 and previous config saved to /var/cache/conftool/dbconfig/20260109-071621-marostegui.json
[07:16:22] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157#11506239 (10Dzahn) Hello @kimpham   please send an email from your WMDE address to [[ https://gerrit.wikimedia.org/r/1224858 | Katie Francis ]] of WMF Legal and let her know you would like to start...
[07:17:02] <jinxer-wm>	 FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[07:17:16] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157#11506242 (10Dzahn) @WMDE-leszek Could you please approve? Thank you.
[07:20:07] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmde for martyn.ranyard - https://phabricator.wikimedia.org/T413994#11506244 (10Dzahn) Hello @Martyn.ranyard Please send an email from your WMDE address to [[ https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) | Katie Francis ]] of WMF Legal and let her know you...
[07:24:53] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11506246 (10Dzahn) a:03DannyS712
[07:30:09] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T413525)', diff saved to https://phabricator.wikimedia.org/P86911 and previous config saved to /var/cache/conftool/dbconfig/20260109-073008-marostegui.json
[07:30:12] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[07:30:31] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Account linking: hide message box when linked [software/bitu] - 10https://gerrit.wikimedia.org/r/1224660 (owner: 10Slyngshede)
[07:30:38] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11506251 (10Dzahn) Since the request is described as "access and update dashboards", Hadoop is not mentioned and per:  https://wikitech.wikimedia.org/wiki/Data_...
[07:33:03] <wikibugs>	 (03Merged) 10jenkins-bot: Account linking: hide message box when linked [software/bitu] - 10https://gerrit.wikimedia.org/r/1224660 (owner: 10Slyngshede)
[07:40:17] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P86912 and previous config saved to /var/cache/conftool/dbconfig/20260109-074017-marostegui.json
[07:42:02] <jinxer-wm>	 FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[07:44:10] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:50:26] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P86913 and previous config saved to /var/cache/conftool/dbconfig/20260109-075025-marostegui.json
[07:52:02] <jinxer-wm>	 FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[07:54:10] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:55:28] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1013:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[07:56:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1224856 (https://phabricator.wikimedia.org/T411570) (owner: 10Marostegui)
[07:57:02] <jinxer-wm>	 FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[07:59:10] <jinxer-wm>	 FIRING: [24x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260109T0800)
[08:00:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T413525)', diff saved to https://phabricator.wikimedia.org/P86914 and previous config saved to /var/cache/conftool/dbconfig/20260109-080033-marostegui.json
[08:00:37] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[08:00:50] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2178.codfw.wmnet with reason: Maintenance
[08:00:59] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2178 (T413525)', diff saved to https://phabricator.wikimedia.org/P86915 and previous config saved to /var/cache/conftool/dbconfig/20260109-080058-marostegui.json
[08:02:02] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[08:06:00] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798)
[08:08:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] sre.hosts.reimage: remove puppet 5 support and default to 7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey)
[08:10:06] <wikibugs>	 (03PS2) 10Muehlenhoff: Rename stale_certs_exporter and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224687 (https://phabricator.wikimedia.org/T365798)
[08:10:33] <wikibugs>	 (03PS1) 10Dzahn: admin: upgrade tgritschacher to analytics-privatedata without shell [puppet] - 10https://gerrit.wikimedia.org/r/1224862 (https://phabricator.wikimedia.org/T414061)
[08:12:31] <wikibugs>	 (03PS2) 10Dzahn: admin: upgrade tgritschacher to analytics-privatedata without shell [puppet] - 10https://gerrit.wikimedia.org/r/1224862 (https://phabricator.wikimedia.org/T414061)
[08:13:19] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224687 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[08:13:57] <wikibugs>	 (03PS3) 10Muehlenhoff: Rename stale_certs_exporter and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224687 (https://phabricator.wikimedia.org/T365798)
[08:15:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1013:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[08:17:09] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T413525)', diff saved to https://phabricator.wikimedia.org/P86916 and previous config saved to /var/cache/conftool/dbconfig/20260109-081708-marostegui.json
[08:17:12] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[08:18:18] <wikibugs>	 (03CR) 10Elukey: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[08:22:02] <jinxer-wm>	 FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[08:24:15] <wikibugs>	 (03CR) 10Elukey: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[08:25:20] <wikibugs>	 (03CR) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[08:26:08] <wikibugs>	 (03CR) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[08:27:17] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P86917 and previous config saved to /var/cache/conftool/dbconfig/20260109-082717-marostegui.json
[08:29:44] <wikibugs>	 (03CR) 10Krinkle: Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) (owner: 10Tchanders)
[08:30:18] <Krinkle>	 Dreamy_Jazz: in case you know ^ :D
[08:30:23] <wikibugs>	 (03CR) 10Elukey: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[08:30:39] <gehel>	 !log restarting blazegraph on wdqs-main@eqiad - high thread count
[08:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:46] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1018 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[08:32:46] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1018 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[08:33:34] <wikibugs>	 (03CR) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[08:33:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:34:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:34:10] <jinxer-wm>	 FIRING: [24x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:34:10] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: wdqs-main-update-lag eqiad - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[08:34:21] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224687 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[08:35:06] <jinxer-wm>	 FIRING: [24x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:35:28] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1013:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[08:37:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:37:26] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P86918 and previous config saved to /var/cache/conftool/dbconfig/20260109-083725-marostegui.json
[08:38:50] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:44:10] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:45:06] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:47:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T413525)', diff saved to https://phabricator.wikimedia.org/P86919 and previous config saved to /var/cache/conftool/dbconfig/20260109-084733-marostegui.json
[08:47:37] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[08:47:50] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2201.codfw.wmnet with reason: Maintenance
[08:48:27] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2211.codfw.wmnet with reason: Maintenance
[08:48:35] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2211 (T413525)', diff saved to https://phabricator.wikimedia.org/P86920 and previous config saved to /var/cache/conftool/dbconfig/20260109-084834-marostegui.json
[08:49:10] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:49:54] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798)
[08:52:02] <jinxer-wm>	 FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[08:52:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:55:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:57:02] <jinxer-wm>	 FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[08:59:31] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "Looks fine, let me know when you want to deploy." [puppet] - 10https://gerrit.wikimedia.org/r/1219619 (https://phabricator.wikimedia.org/T413101) (owner: 10Tchanders)
[09:00:51] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T413525)', diff saved to https://phabricator.wikimedia.org/P86921 and previous config saved to /var/cache/conftool/dbconfig/20260109-090050-marostegui.json
[09:00:54] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[09:05:50] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1013.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:06:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:06:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove dead Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/1224889
[09:07:02] <jinxer-wm>	 FIRING: [12x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:07:18] <urbanecm>	 !log [urbanecm@deploy2002 ~]$ kubectl delete job/growthexperiments-updatementeedata-s1-29460615 # T414167
[09:07:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Rename stale_certs_exporter and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224687 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[09:07:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:21] <stashbot>	 T414167: Do not alert about a failed cron job when logs are already discarded - https://phabricator.wikimedia.org/T414167
[09:07:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove dead Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/1224891
[09:09:11] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: Update image for revise-tone-task-generator on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224721 (https://phabricator.wikimedia.org/T412210) (owner: 10AikoChou)
[09:10:59] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P86922 and previous config saved to /var/cache/conftool/dbconfig/20260109-091058-marostegui.json
[09:12:52] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11506446 (10elukey) Tried to replicate Alex's test with the following:  On registry1004 (not serving live traffic):  - `sudo iptables -A INPUT -p tcp -s 10.192.32.7...
[09:14:10] <jinxer-wm>	 FIRING: [14x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:15:51] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Fix duplicate reuse-db-efi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1224856 (https://phabricator.wikimedia.org/T411570) (owner: 10Marostegui)
[09:19:10] <jinxer-wm>	 FIRING: [11x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:21:06] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P86923 and previous config saved to /var/cache/conftool/dbconfig/20260109-092105-marostegui.json
[09:22:02] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:24:00] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157#11506477 (10WMDE-leszek)
[09:24:10] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:24:10] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:24:16] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157#11506478 (10WMDE-leszek) I approve this request on WMDE behalf. Thank you
[09:27:02] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:31:14] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T413525)', diff saved to https://phabricator.wikimedia.org/P86924 and previous config saved to /var/cache/conftool/dbconfig/20260109-093114-marostegui.json
[09:31:17] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[09:31:30] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2213.codfw.wmnet with reason: Maintenance
[09:31:39] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2213 (T413525)', diff saved to https://phabricator.wikimedia.org/P86925 and previous config saved to /var/cache/conftool/dbconfig/20260109-093138-marostegui.json
[09:33:47] <wikibugs>	 (03CR) 10Elukey: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[09:35:31] <wikibugs>	 (03PS4) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798)
[09:37:20] <wikibugs>	 (03PS1) 10Gmodena: sup: register rdf updater with wdp [alerts] - 10https://gerrit.wikimedia.org/r/1224893 (https://phabricator.wikimedia.org/T414169)
[09:37:27] <wikibugs>	 (03PS1) 10DCausse: airflow-search: add enterprise extra_secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066)
[09:37:37] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11506526 (10JMeybohm)
[09:39:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] airflow-search: add enterprise extra_secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066) (owner: 10DCausse)
[09:39:23] <wikibugs>	 (03CR) 10DCausse: "@Ben/Balthazar: this is mainly up for discussion and I'm not yet clear on how to make use of this secret from the airflow side" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066) (owner: 10DCausse)
[09:41:40] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 66%, RTA = 5470.19 ms
[09:44:10] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:44:17] <wikibugs>	 (03PS12) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573)
[09:45:06] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-main-update-lag eqiad - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[09:45:30] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[09:45:59] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T413525)', diff saved to https://phabricator.wikimedia.org/P86926 and previous config saved to /var/cache/conftool/dbconfig/20260109-094558-marostegui.json
[09:46:02] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[09:46:18] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[09:46:20] <wikibugs>	 (03CR) 10Federico Ceratto: "Ok, I updated parsercache logging, see https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1215575/12/tests/unit/sre/mysql/parsercache" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[09:46:23] <wikibugs>	 (03CR) 10Trueg: [C:03+2] sup: register rdf updater with wdp [alerts] - 10https://gerrit.wikimedia.org/r/1224893 (https://phabricator.wikimedia.org/T414169) (owner: 10Gmodena)
[09:47:05] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11506544 (10ABran-WMF)
[09:47:54] <wikibugs>	 (03CR) 10Trueg: [V:03+2 C:03+2] sup: register rdf updater with wdp [alerts] - 10https://gerrit.wikimedia.org/r/1224893 (https://phabricator.wikimedia.org/T414169) (owner: 10Gmodena)
[09:48:00] <wikibugs>	 (03Merged) 10jenkins-bot: sup: register rdf updater with wdp [alerts] - 10https://gerrit.wikimedia.org/r/1224893 (https://phabricator.wikimedia.org/T414169) (owner: 10Gmodena)
[09:49:10] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:49:13] <wikibugs>	 (03CR) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[09:50:07] <wikibugs>	 (03CR) 10Marostegui: "Thanks - I am wondering why do we force people to run a depool inside of a screen? Isn't it a bit overkill?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[09:50:27] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.newdepool depool es1049: test
[09:50:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:50:54] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newdepool (exit_code=0) depool es1049: test
[09:51:06] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool es1049: test
[09:51:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11506551 (10cmooney) As you would expect the unmanaged mgmt switches do send STP frames ` A:cmooney@lswtest-d8-eqiad# bash network-instance mgmt tcpdu...
[09:51:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:51:41] <wikibugs>	 (03CR) 10Marostegui: "Ah I guess it is because it is required for the repool." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[09:52:10] <wikibugs>	 (03CR) 10Marostegui: "Is it easy to require it only for the repooling and NOT for the depooling?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[09:52:52] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) (owner: 10Tchanders)
[09:53:30] <Dreamy_Jazz>	 Krinkle: Replied to your comment, let me know if you want to discuss about it on IRC to avoid the slower replies that tend to happen on Gerrit :D
[09:56:06] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P86929 and previous config saved to /var/cache/conftool/dbconfig/20260109-095606-marostegui.json
[09:56:23] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11506561 (10elukey) The problem seems an exact replica of https://github.com/distribution/distribution/issues/2225, so I tried to add the following snippet to the r...
[09:56:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:57:02] <jinxer-wm>	 FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:57:39] <wikibugs>	 (03CR) 10Majavah: [C:03+1] Rename stale_certs_exporter and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224687 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[10:02:19] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove dead Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/1224889 (owner: 10Muehlenhoff)
[10:06:14] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P86930 and previous config saved to /var/cache/conftool/dbconfig/20260109-100614-marostegui.json
[10:08:01] <wikibugs>	 (03PS1) 10Slyngshede: P:cache::haproxy: check existance of mmdb files [puppet] - 10https://gerrit.wikimedia.org/r/1224897 (https://phabricator.wikimedia.org/T414111)
[10:09:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:cache::haproxy: check existance of mmdb files [puppet] - 10https://gerrit.wikimedia.org/r/1224897 (https://phabricator.wikimedia.org/T414111) (owner: 10Slyngshede)
[10:10:09] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11506589 (10elukey) The only thing that I found on the docker distribution logs is was:  ` Jan 09 09:52:25 registry1004 docker-registry[676]: time="2026-01-09T09:52...
[10:12:02] <jinxer-wm>	 FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[10:16:23] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T413525)', diff saved to https://phabricator.wikimedia.org/P86932 and previous config saved to /var/cache/conftool/dbconfig/20260109-101622-marostegui.json
[10:16:26] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[10:16:40] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2223.codfw.wmnet with reason: Maintenance
[10:16:48] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2223 (T413525)', diff saved to https://phabricator.wikimedia.org/P86933 and previous config saved to /var/cache/conftool/dbconfig/20260109-101648-marostegui.json
[10:17:02] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[10:19:18] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T413525)', diff saved to https://phabricator.wikimedia.org/P86934 and previous config saved to /var/cache/conftool/dbconfig/20260109-101917-marostegui.json
[10:19:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove dead Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/1224889 (owner: 10Muehlenhoff)
[10:19:50] <wikibugs>	 (03PS1) 10Jelto: varnish: add wikipedia25 frontend vcl. [puppet] - 10https://gerrit.wikimedia.org/r/1224901 (https://phabricator.wikimedia.org/T408592)
[10:21:32] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "ooooh! great find:)  that looks like it's needed indeed. another special case because it's not just a wikimedia.org sub" [puppet] - 10https://gerrit.wikimedia.org/r/1224901 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto)
[10:24:10] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:24:14] <jinxer-wm>	 FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[10:24:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] Remove spurious 'diff' file [alerts] - 10https://gerrit.wikimedia.org/r/1224585 (owner: 10Filippo Giunchedi)
[10:29:26] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P86936 and previous config saved to /var/cache/conftool/dbconfig/20260109-102925-marostegui.json
[10:30:02] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100%
[10:32:02] <jinxer-wm>	 FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[10:32:30] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[10:33:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: README.md: mention Trixie and standalone promtool package [alerts] - 10https://gerrit.wikimedia.org/r/1224907
[10:35:49] <wikibugs>	 (03PS1) 10Dzahn: zookeeper: add ssl.keyStore.passwordPath is TLS is enabled (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1224908 (https://phabricator.wikimedia.org/T405119)
[10:36:33] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool es1049: test
[10:36:59] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] varnish: add wikipedia25 frontend vcl. [puppet] - 10https://gerrit.wikimedia.org/r/1224901 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto)
[10:38:36] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "especially the part that the else has "return (synth(400, ""));" is convincing that this is the cause:)" [puppet] - 10https://gerrit.wikimedia.org/r/1224901 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto)
[10:38:46] <gehel>	 !log depooling / repooling wdqs-main@eqiad servers one by one to allow time to recover and catch up on updates.
[10:38:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:10] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: wdqs-main-update-lag eqiad - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[10:39:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P86938 and previous config saved to /var/cache/conftool/dbconfig/20260109-103934-marostegui.json
[10:42:02] <jinxer-wm>	 FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[10:47:41] <jinxer-wm>	 RESOLVED: [112x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[10:49:43] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T413525)', diff saved to https://phabricator.wikimedia.org/P86939 and previous config saved to /var/cache/conftool/dbconfig/20260109-104942-marostegui.json
[10:49:46] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[10:49:56] <icinga-wm>	 PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:50:00] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2228.codfw.wmnet with reason: Maintenance
[10:50:02] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 2266.53 ms
[10:50:03] <wikibugs>	 (03CR) 10Federico Ceratto: "Various cookbooks seem to require it by default but it would be easy to do the check only when pooling. The only issue is that even if dep" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[10:50:08] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2228 (T413525)', diff saved to https://phabricator.wikimedia.org/P86940 and previous config saved to /var/cache/conftool/dbconfig/20260109-105008-marostegui.json
[10:50:30] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[10:51:18] <wikibugs>	 06SRE, 10MediaWiki-Action-API, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11506731 (10Xqt)
[10:51:41] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, suggestion inline" [alerts] - 10https://gerrit.wikimedia.org/r/1224907 (owner: 10Filippo Giunchedi)
[10:52:30] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11506742 (10elukey) The very interesting thing is that after a few tries I got:  ` elukey@build2001:~$ sudo docker push docker-registry.svc.eqiad.wmnet/test/istio/b...
[10:52:38] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T413525)', diff saved to https://phabricator.wikimedia.org/P86941 and previous config saved to /var/cache/conftool/dbconfig/20260109-105237-marostegui.json
[10:54:56] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06serviceops, 06Release-Engineering-Team (Radar): Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11506760 (10elukey) To keep archives happy - I am working in T394476 to properly onboard ceph apu...
[10:55:30] <wikibugs>	 (03PS2) 10Filippo Giunchedi: README.md: mention Trixie and standalone promtool package [alerts] - 10https://gerrit.wikimedia.org/r/1224907
[10:55:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: README.md: mention Trixie and standalone promtool package (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1224907 (owner: 10Filippo Giunchedi)
[10:55:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] "Thank you! Added your suggestion" [alerts] - 10https://gerrit.wikimedia.org/r/1224907 (owner: 10Filippo Giunchedi)
[10:55:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] README.md: mention Trixie and standalone promtool package [alerts] - 10https://gerrit.wikimedia.org/r/1224907 (owner: 10Filippo Giunchedi)
[10:59:56] <icinga-wm>	 RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:01:07] <wikibugs>	 (03PS1) 10Jelto: miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224913 (https://phabricator.wikimedia.org/T408592)
[11:02:02] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100%
[11:02:30] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms
[11:02:46] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P86942 and previous config saved to /var/cache/conftool/dbconfig/20260109-110245-marostegui.json
[11:02:53] <mutante>	 /win/win 8
[11:04:10] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:04:28] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[11:07:38] <icinga-wm>	 PROBLEM - Memcached on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached
[11:08:28] <icinga-wm>	 RECOVERY - Memcached on titan1002 is OK: TCP OK - 0.010 second response time on 10.64.48.167 port 11211 https://wikitech.wikimedia.org/wiki/Memcached
[11:09:10] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:09:25] <moritzm>	 !log revoked legacy config-master discovery cert T365798
[11:09:26] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224913 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto)
[11:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:28] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[11:09:29] <stashbot>	 T365798: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798
[11:09:50] <wikibugs>	 (03CR) 10Jelto: [C:03+2] miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224913 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto)
[11:10:06] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:11:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:12:05] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224913 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto)
[11:12:54] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P86943 and previous config saved to /var/cache/conftool/dbconfig/20260109-111254-marostegui.json
[11:13:33] <wikibugs>	 06SRE, 10MediaWiki-Action-API, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11506781 (10Xqt)
[11:13:51] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[11:14:11] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[11:14:18] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[11:14:38] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[11:14:47] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[11:15:06] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:15:07] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[11:16:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:17:56] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06serviceops, 06Release-Engineering-Team (Radar): Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11506803 (10elukey)
[11:19:09] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11506805 (10taavi)
[11:19:10] <jinxer-wm>	 FIRING: [14x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:20:06] <jinxer-wm>	 FIRING: [15x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:22:03] <wikibugs>	 (03PS1) 10Dzahn: point wikipedia25.org to ncredir [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592)
[11:22:33] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11506822 (10Jelto) We were able to solve the loadbalancer issues and the site is reachable and returns 200 and the correct content. We will d...
[11:22:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] point wikipedia25.org to ncredir [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[11:23:03] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T413525)', diff saved to https://phabricator.wikimedia.org/P86944 and previous config saved to /var/cache/conftool/dbconfig/20260109-112302-marostegui.json
[11:23:06] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[11:24:10] <jinxer-wm>	 FIRING: [19x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:25:06] <jinxer-wm>	 FIRING: [21x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:27:02] <jinxer-wm>	 FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[11:29:10] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:30:48] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11506858 (10elukey) I had a chat with Matthew about apus, and they confirmed that there is no explicit rate/bw limit in place for the docker-registry account. I obs...
[11:30:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179 (10MoritzMuehlenhoff) 03NEW
[11:31:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11506869 (10MoritzMuehlenhoff) p:05Triage→03Medium
[11:32:19] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11506876 (10Tgr) What user agent are you using?
[11:33:45] <wikibugs>	 (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[11:33:59] <wikibugs>	 (03CR) 10Dzahn: "this is to be reverted next week" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[11:34:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[11:44:33] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11506948 (10Joe) I'm not sure having your CI depend on external resources is a good policy; I encourage you to change that long-term, but anyways, we don't want to...
[11:49:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[11:54:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[11:57:54] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11506993 (10Xqt) >>! In T414173#11506876, @Tgr wrote: > What user agent are you using?  `pwb.py version` for a sample test taks gives ` Pywikibot: [https] wikimedi...
[12:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260109T0800)
[12:00:06] <jouncebot>	 jelto, arnoldokoth, mutante, and arnaudb: Time to snap out of that daydream and deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260109T1200).
[12:01:55] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] ml-services: Update image for revise-tone-task-generator on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224721 (https://phabricator.wikimedia.org/T412210) (owner: 10AikoChou)
[12:02:42] <wikibugs>	 10SRE-Access-Requests: Grafana and Logstash access for trueg - https://phabricator.wikimedia.org/T414187 (10trueg) 03NEW
[12:03:14] <mutante>	 that gitlab version upgrade already happened. nothing now.
[12:03:41] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Update image for revise-tone-task-generator on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224721 (https://phabricator.wikimedia.org/T412210) (owner: 10AikoChou)
[12:04:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[12:08:32] <wikibugs>	 10SRE-Access-Requests: Requesting access to  Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11507065 (10trueg)
[12:08:48] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507067 (10Xqt) I also get that blocker for my normal bot during running redirect.py script:  ` >>> Talk:Licentiate (Pontifical Degree) <<<    Links to: [[en:Talk...
[12:11:00] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2192.codfw.wmnet with reason: Maintenance
[12:13:25] <wikibugs>	 10SRE-Access-Requests: Requesting access to  Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11507110 (10trueg)
[12:14:00] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' .
[12:16:24] <marostegui>	 !log Deploy schema change on s7 primary master T414178
[12:16:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:27] <stashbot>	 T414178: Remove default value from gb_by_wiki in globalblocks table on WMF wikis - https://phabricator.wikimedia.org/T414178
[12:16:45] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' .
[12:19:02] <marostegui>	 !log Deploy schema change on s6 primary master T414183
[12:19:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:05] <stashbot>	 T414183: Remove default value from gbw_by in global_block_whitelist table on WMF wikis - https://phabricator.wikimedia.org/T414183
[12:19:44] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11507159 (10elukey) Since having nginx is not really needed for this test, I went back to testing with a direct push to registry1004.eqiad.wmnet:5002:  ` elukey@bui...
[12:21:18] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[12:21:38] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[12:21:42] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance
[12:21:46] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1156 (T413525)', diff saved to https://phabricator.wikimedia.org/P86946 and previous config saved to /var/cache/conftool/dbconfig/20260109-122145-marostegui.json
[12:21:49] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[12:21:58] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2148 (T413525)', diff saved to https://phabricator.wikimedia.org/P86947 and previous config saved to /var/cache/conftool/dbconfig/20260109-122157-marostegui.json
[12:22:06] <wikibugs>	 10SRE-Access-Requests: Requesting access to  Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11507167 (10trueg)
[12:23:47] <marostegui>	 !log Deploy schema change on s2 primary master T414183
[12:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:50] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192 (10trueg) 03NEW
[12:27:02] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to  Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11507189 (10trueg)
[12:30:49] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636)
[12:30:58] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] revert-risk: Deploy on prod and staging new model version for both language-agnosting and multingual. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224604 (https://phabricator.wikimedia.org/T411786) (owner: 10Gkyziridis)
[12:32:32] <marostegui>	 !log Deploy schema change on s7 primary master T414183
[12:32:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:35] <stashbot>	 T414183: Remove default value from gbw_by in global_block_whitelist table on WMF wikis - https://phabricator.wikimedia.org/T414183
[12:34:13] <wikibugs>	 (03CR) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga)
[12:36:11] <wikibugs>	 (03CR) 10Daniel Kinzler: "is this for access fromt he outside or from within our network? We don't need access to redioscope from the outside..." [dns] - 10https://gerrit.wikimedia.org/r/1224652 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert)
[12:37:20] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] admin: upgrade tgritschacher to analytics-privatedata without shell [puppet] - 10https://gerrit.wikimedia.org/r/1224862 (https://phabricator.wikimedia.org/T414061) (owner: 10Dzahn)
[12:37:58] <marostegui>	 !log Deploy schema change on s5 primary master T414183
[12:38:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:01] <stashbot>	 T414183: Remove default value from gbw_by in global_block_whitelist table on WMF wikis - https://phabricator.wikimedia.org/T414183
[12:39:10] <jinxer-wm>	 FIRING: [24x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:39:12] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tgritschacher - https://phabricator.wikimedia.org/T414061#11507254 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Merged the patch prepared by @Dzahn (thanks).
[12:39:43] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tgritschacher - https://phabricator.wikimedia.org/T414061#11507258 (10JMeybohm)
[12:40:10] <wikibugs>	 (03CR) 10Marostegui: "Yeah maybe, ok, let's leave that for now" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[12:40:59] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507270 (10Joe) This user agent is not compliat with our user-agent policy:  https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Foundation_User-Agent_Policy...
[12:48:12] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to  Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11507289 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Welcome! Grafana access is granted by having an LDAP account. Please request access to logstash via Wikimedia IDM at http...
[12:49:29] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11507293 (10JMeybohm)
[12:52:50] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T413525)', diff saved to https://phabricator.wikimedia.org/P86949 and previous config saved to /var/cache/conftool/dbconfig/20260109-125250-marostegui.json
[12:52:54] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[12:53:19] <wikibugs>	 (03Abandoned) 10Blake: service: add excluded_services helper function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1224041 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake)
[12:56:12] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T413525)', diff saved to https://phabricator.wikimedia.org/P86950 and previous config saved to /var/cache/conftool/dbconfig/20260109-125611-marostegui.json
[12:59:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[13:01:53] <wikibugs>	 (03PS2) 10Slyngshede: P:cache::haproxy: check existance of mmdb files [puppet] - 10https://gerrit.wikimedia.org/r/1224897 (https://phabricator.wikimedia.org/T414111)
[13:02:13] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11507333 (10JMeybohm)
[13:02:48] <marostegui>	 !log Deploy schema change on s1 primary master T414183
[13:02:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:51] <stashbot>	 T414183: Remove default value from gbw_by in global_block_whitelist table on WMF wikis - https://phabricator.wikimedia.org/T414183
[13:02:59] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P86951 and previous config saved to /var/cache/conftool/dbconfig/20260109-130258-marostegui.json
[13:03:23] <marostegui>	 !log Deploy schema change on s8 primary master T414183
[13:03:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:07] <marostegui>	 !log Deploy schema change on s4 primary master T414183
[13:04:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[13:05:12] <wikibugs>	 (03CR) 10Tchanders: "This can be deployed any time, from my perspective. Do we need to wait for a puppet window?" [puppet] - 10https://gerrit.wikimedia.org/r/1219619 (https://phabricator.wikimedia.org/T413101) (owner: 10Tchanders)
[13:05:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:06:20] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P86952 and previous config saved to /var/cache/conftool/dbconfig/20260109-130619-marostegui.json
[13:10:45] <marostegui>	 !log Deploy schema change on s3 primary master (this will take a few hours) T414183
[13:10:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:48] <stashbot>	 T414183: Remove default value from gbw_by in global_block_whitelist table on WMF wikis - https://phabricator.wikimedia.org/T414183
[13:13:08] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P86953 and previous config saved to /var/cache/conftool/dbconfig/20260109-131306-marostegui.json
[13:16:28] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P86954 and previous config saved to /var/cache/conftool/dbconfig/20260109-131628-marostegui.json
[13:17:04] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] revert-risk: Deploy on prod and staging new model version for both language-agnosting and multingual. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224604 (https://phabricator.wikimedia.org/T411786) (owner: 10Gkyziridis)
[13:18:05] <wikibugs>	 (03PS1) 10Federico Ceratto: sre.mysql.clone: More uniform logging syntax [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052)
[13:18:22] <wikibugs>	 (03CR) 10Federico Ceratto: sre.mysql.clone: More uniform logging syntax (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052) (owner: 10Federico Ceratto)
[13:18:59] <wikibugs>	 (03Merged) 10jenkins-bot: revert-risk: Deploy on prod and staging new model version for both language-agnosting and multingual. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224604 (https://phabricator.wikimedia.org/T411786) (owner: 10Gkyziridis)
[13:19:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11507408 (10JMeybohm) @trueg could you please specify what access level you're requesting/what you need access to (see https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#What_access_...
[13:20:05] <wikibugs>	 (03CR) 10Marostegui: sre.mysql.clone: More uniform logging syntax (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052) (owner: 10Federico Ceratto)
[13:21:19] <wikibugs>	 (03PS2) 10Federico Ceratto: sre.mysql.clone: More uniform logging syntax [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052)
[13:22:33] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] sre.mysql.clone: More uniform logging syntax [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052) (owner: 10Federico Ceratto)
[13:23:16] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T413525)', diff saved to https://phabricator.wikimedia.org/P86955 and previous config saved to /var/cache/conftool/dbconfig/20260109-132316-marostegui.json
[13:23:20] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[13:23:32] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[13:23:41] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1162 (T413525)', diff saved to https://phabricator.wikimedia.org/P86956 and previous config saved to /var/cache/conftool/dbconfig/20260109-132340-marostegui.json
[13:24:10] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:26:03] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm as soon as CI is happy" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[13:26:37] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T413525)', diff saved to https://phabricator.wikimedia.org/P86957 and previous config saved to /var/cache/conftool/dbconfig/20260109-132636-marostegui.json
[13:26:43] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance
[13:26:51] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2175 (T413525)', diff saved to https://phabricator.wikimedia.org/P86958 and previous config saved to /var/cache/conftool/dbconfig/20260109-132651-marostegui.json
[13:32:06] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507457 (10Xqt) >>! In T414173#11507270, @Joe wrote: > This user agent is not compliat with our user-agent policy: >  > https://foundation.wikimedia.org/wiki/Poli...
[13:32:25] <wikibugs>	 (03PS1) 10Blake: sre.discovery.datacenter: use service registry for exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/1224945 (https://phabricator.wikimedia.org/T412211)
[13:34:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[13:35:15] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T413525)', diff saved to https://phabricator.wikimedia.org/P86959 and previous config saved to /var/cache/conftool/dbconfig/20260109-133514-marostegui.json
[13:35:18] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[13:39:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[13:42:41] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507482 (10Fabfur) >>! In T414173#11507457, @Xqt wrote: >>>! In T414173#11507270, @Joe wrote: >> This user agent is not compliat with our user-agent policy: >>  >...
[13:42:46] <wikibugs>	 (03PS2) 10Tchanders: Don't collect CheckUser-specific temp account patrolling metrics on labs [puppet] - 10https://gerrit.wikimedia.org/r/1219619 (https://phabricator.wikimedia.org/T413101)
[13:44:06] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1219619 (https://phabricator.wikimedia.org/T413101) (owner: 10Tchanders)
[13:45:05] <wikibugs>	 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 05Goal, 06Release-Engineering-Team (Seen): Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759#11507488 (10WMDE-leszek) Bumping this ticket as I maybe might have an interest in seeing this progressing. I...
[13:45:23] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P86960 and previous config saved to /var/cache/conftool/dbconfig/20260109-134522-marostegui.json
[13:48:06] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507521 (10Xqt)
[13:52:13] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.clone: More uniform logging syntax [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052) (owner: 10Federico Ceratto)
[13:53:56] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.clone: More uniform logging syntax (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052) (owner: 10Federico Ceratto)
[13:53:58] <wikibugs>	 (03CR) 10Federico Ceratto: [V:03+2 C:03+2] sre.mysql.clone: More uniform logging syntax [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052) (owner: 10Federico Ceratto)
[13:55:31] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P86961 and previous config saved to /var/cache/conftool/dbconfig/20260109-135531-marostegui.json
[13:55:53] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507526 (10Xqt) @Fabfur: Currently we have this UA:  `version (wikipedia:de; User:Xqtest) Pywikibot/11.0.0.dev10 (g20136) requests/2.32.5 Python/3.13.0.final.0` W...
[13:57:18] <wikibugs>	 (03PS1) 10Dzahn: trafficserver: disable wikipedia25 [puppet] - 10https://gerrit.wikimedia.org/r/1224957 (https://phabricator.wikimedia.org/T408592)
[14:00:17] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T413525)', diff saved to https://phabricator.wikimedia.org/P86962 and previous config saved to /var/cache/conftool/dbconfig/20260109-140016-marostegui.json
[14:00:20] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[14:05:40] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T413525)', diff saved to https://phabricator.wikimedia.org/P86963 and previous config saved to /var/cache/conftool/dbconfig/20260109-140539-marostegui.json
[14:05:44] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[14:05:54] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1224957 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[14:05:56] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[14:06:05] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1182 (T413525)', diff saved to https://phabricator.wikimedia.org/P86964 and previous config saved to /var/cache/conftool/dbconfig/20260109-140604-marostegui.json
[14:08:05] <wikibugs>	 (03CR) 10Dzahn: "we are using this instead: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1224957" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[14:08:15] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] trafficserver: disable wikipedia25 [puppet] - 10https://gerrit.wikimedia.org/r/1224957 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[14:08:57] <wikibugs>	 (03PS1) 10Dzahn: Revert "trafficserver: disable wikipedia25" [puppet] - 10https://gerrit.wikimedia.org/r/1224959
[14:09:10] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-main-update-lag eqiad - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[14:09:28] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[14:10:25] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P86965 and previous config saved to /var/cache/conftool/dbconfig/20260109-141024-marostegui.json
[14:10:53] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86966 and previous config saved to /var/cache/conftool/dbconfig/20260109-141052-marostegui.json
[14:10:57] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[14:10:57] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[14:11:10] <wikibugs>	 (03Abandoned) 10Dzahn: point wikipedia25.org to ncredir [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[14:11:16] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507562 (10Fabfur) >>! In T414173#11507526, @Xqt wrote: > @Fabfur: Currently we have this UA:  > `version (wikipedia:de; User:Xqtest) Pywikibot/11.0.0.dev10 (g201...
[14:16:15] <wikibugs>	 (03PS2) 10C. Scott Ananian: Increase PRV percentage on fawiki/kowiki/azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224719 (https://phabricator.wikimedia.org/T413108)
[14:17:54] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507588 (10JAnD) >>! In T414173#11507562, @Fabfur wrote: > I think this is better, if you want to add an email as contact you can do it right after the URL, separ...
[14:19:10] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: wdqs-main-update-lag eqiad - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[14:20:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P86967 and previous config saved to /var/cache/conftool/dbconfig/20260109-142033-marostegui.json
[14:21:00] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P86968 and previous config saved to /var/cache/conftool/dbconfig/20260109-142100-marostegui.json
[14:21:41] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507597 (10Fabfur) >>! In T414173#11507588, @JAnD wrote: >>>! In T414173#11507562, @Fabfur wrote: >> I think this is better, if you want to add an email as contac...
[14:24:10] <jinxer-wm>	 FIRING: [24x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:24:14] <jinxer-wm>	 FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[14:28:47] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1003.eqiad.wmnet with OS trixie
[14:29:10] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:29:58] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup1003']
[14:30:06] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:30:41] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T413525)', diff saved to https://phabricator.wikimedia.org/P86970 and previous config saved to /var/cache/conftool/dbconfig/20260109-143040-marostegui.json
[14:30:44] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[14:30:58] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2189.codfw.wmnet with reason: Maintenance
[14:31:03] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudbackup1003']
[14:31:06] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2189 (T413525)', diff saved to https://phabricator.wikimedia.org/P86971 and previous config saved to /var/cache/conftool/dbconfig/20260109-143105-marostegui.json
[14:33:28] <wikibugs>	 (03PS6) 10Federico Ceratto: mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642)
[14:33:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission pay-lvs1003.frack.eqiad.wmnet and pay-lvs1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T413986#11507633 (10Jclark-ctr)
[14:34:10] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:34:52] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup1003']
[14:34:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto)
[14:35:06] <jinxer-wm>	 FIRING: [14x] ProbeDown: Service wdqs1017:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:35:08] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507634 (10Xqt) @Joe, @Tgr: Could you please consider postponing the newly introduced restriction until the Pywikibot User-Agent has been updated? As far as I kno...
[14:35:21] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudbackup1003']
[14:35:50] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS trixie
[14:39:10] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service wdqs1017:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:39:30] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11507647 (10elukey) Retried the same op that led to the HTTP 500 after lunch:  ` elukey@build2001:~$ sudo docker push registry1004.eqiad.wmnet:5002/test/cert-manage...
[14:39:41] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T413525)', diff saved to https://phabricator.wikimedia.org/P86972 and previous config saved to /var/cache/conftool/dbconfig/20260109-143940-marostegui.json
[14:39:44] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[14:41:16] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86973 and previous config saved to /var/cache/conftool/dbconfig/20260109-144116-marostegui.json
[14:41:21] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[14:41:22] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[14:41:33] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance
[14:41:43] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on 6 hosts with reason: Maintenance
[14:41:53] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1221 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86974 and previous config saved to /var/cache/conftool/dbconfig/20260109-144151-marostegui.json
[14:43:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission pay-lvs1003.frack.eqiad.wmnet and pay-lvs1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T413986#11507665 (10Jclark-ctr) 05Open→03Resolved
[14:44:57] <wikibugs>	 (03CR) 10Pmiazga: [C:03+1] "LGTM, couple nitpicks, temporarily not being able to test it locally." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler)
[14:45:53] <logmsgbot>	 !log gkyziridis@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[14:46:09] <logmsgbot>	 !log gkyziridis@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[14:47:16] <wikibugs>	 (03PS7) 10Federico Ceratto: mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642)
[14:48:55] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1224891 (owner: 10Muehlenhoff)
[14:49:37] <wikibugs>	 (03PS1) 10Btullis: Revert "Failover the hive server2 and metastore services to the standby" [dns] - 10https://gerrit.wikimedia.org/r/1224967
[14:49:44] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1003.eqiad.wmnet with reason: host reimage
[14:49:49] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P86975 and previous config saved to /var/cache/conftool/dbconfig/20260109-144948-marostegui.json
[14:50:19] <wikibugs>	 (03CR) 10Marostegui: "let me know when pushed, so we can test it" [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto)
[14:50:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "Failover the hive server2 and metastore services to the standby" [dns] - 10https://gerrit.wikimedia.org/r/1224967 (owner: 10Btullis)
[14:53:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T413559#11507693 (10Jclark-ctr) 05Open→03Resolved
[14:54:57] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1003.eqiad.wmnet with reason: host reimage
[14:55:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:56:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:56:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11507696 (10Jgreen) * NIC.Embedded.1-1-1 pxe disabled * NIC.Integrated.1-1-1 pxe enabled * boot method changed from UEFI to BIOS
[14:56:40] <wikibugs>	 (03PS1) 10Gkyziridis: revert-risk: Roll back mulkilingual model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224969 (https://phabricator.wikimedia.org/T411786)
[14:58:56] <wikibugs>	 (03CR) 10Btullis: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1224967 (owner: 10Btullis)
[14:59:11] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T413525)', diff saved to https://phabricator.wikimedia.org/P86976 and previous config saved to /var/cache/conftool/dbconfig/20260109-145910-marostegui.json
[14:59:14] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[14:59:38] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] revert-risk: Roll back mulkilingual model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224969 (https://phabricator.wikimedia.org/T411786) (owner: 10Gkyziridis)
[14:59:57] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P86977 and previous config saved to /var/cache/conftool/dbconfig/20260109-145956-marostegui.json
[15:00:06] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs1017:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:01:38] <wikibugs>	 (03Merged) 10jenkins-bot: revert-risk: Roll back mulkilingual model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224969 (https://phabricator.wikimedia.org/T411786) (owner: 10Gkyziridis)
[15:02:02] <jinxer-wm>	 FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[15:02:44] <logmsgbot>	 !log gkyziridis@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[15:02:50] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:03:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:03:08] <logmsgbot>	 !log gkyziridis@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[15:05:49] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto)
[15:06:18] <wikibugs>	 (03CR) 10Federico Ceratto: [V:03+2 C:03+2] mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto)
[15:07:24] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507718 (10revi) >>! In T414173#11507526, @Xqt wrote: > @Fabfur: Currently we have this UA:  > `version (wikipedia:de; User:Xqtest) Pywikibot/11.0.0.dev10 (g20136...
[15:09:10] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:19] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P86978 and previous config saved to /var/cache/conftool/dbconfig/20260109-150918-marostegui.json
[15:10:05] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T413525)', diff saved to https://phabricator.wikimedia.org/P86979 and previous config saved to /var/cache/conftool/dbconfig/20260109-151005-marostegui.json
[15:10:09] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[15:10:21] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[15:10:30] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1188 (T413525)', diff saved to https://phabricator.wikimedia.org/P86980 and previous config saved to /var/cache/conftool/dbconfig/20260109-151029-marostegui.json
[15:10:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:11:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:11:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1168 - https://phabricator.wikimedia.org/T413704#11507727 (10BTullis) 05Open→03Resolved I checked the physical disks. ` btullis@an-worker1168:~$ sudo perccli64 /c0 show all  <snip>  PD LIST : =======  --------------------------------------------...
[15:14:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205 (10MoritzMuehlenhoff) 03NEW
[15:14:18] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507752 (10Joe) >>! In T414173#11507634, @Xqt wrote: > @Joe, @Tgr: Could you please consider postponing the newly introduced restriction until the Pywikibot User-...
[15:14:20] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11507753 (10MoritzMuehlenhoff) p:05Triage→03Medium
[15:17:02] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[15:17:09] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1003.eqiad.wmnet with OS trixie
[15:18:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413336#11507764 (10BTullis) 05Open→03Resolved I checked the physical disks with: ` sudo perccli64 /c0 show all <snip> PD LIST : =======  -----------------...
[15:18:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11507766 (10Papaul) ` Thanks for confirming. We recommend blocking those packets, if possible, using dst mac 01:80:c2:00:00:00 on the mgmt switch, whi...
[15:19:06] <icinga-wm>	 RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1198 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[15:19:27] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P86981 and previous config saved to /var/cache/conftool/dbconfig/20260109-151927-marostegui.json
[15:19:36] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507767 (10Joe) To clarify:     - Users on toolsforge or cloud VPS are exempt from the limit   - I only see about 5% of all requests with UA containing `User:XXX`...
[15:21:20] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T413525)', diff saved to https://phabricator.wikimedia.org/P86982 and previous config saved to /var/cache/conftool/dbconfig/20260109-152120-marostegui.json
[15:21:24] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[15:21:55] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs cinder backups: move all backups to 2003 so 2004 can be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/1224974 (https://phabricator.wikimedia.org/T375217)
[15:21:57] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudbackup: flip all backups from cloudbackup1004 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/1224975 (https://phabricator.wikimedia.org/T375217)
[15:22:02] <jinxer-wm>	 FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[15:23:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] wmcs cinder backups: move all backups to 2003 so 2004 can be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/1224974 (https://phabricator.wikimedia.org/T375217) (owner: 10Andrew Bogott)
[15:25:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudbackup: flip all backups from cloudbackup1004 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/1224975 (https://phabricator.wikimedia.org/T375217) (owner: 10Andrew Bogott)
[15:27:02] <jinxer-wm>	 FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[15:29:36] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T413525)', diff saved to https://phabricator.wikimedia.org/P86983 and previous config saved to /var/cache/conftool/dbconfig/20260109-152935-marostegui.json
[15:29:39] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[15:29:52] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance
[15:31:29] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P86984 and previous config saved to /var/cache/conftool/dbconfig/20260109-153128-marostegui.json
[15:31:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1191 - https://phabricator.wikimedia.org/T411209#11507818 (10BTullis) 05Open→03Resolved Checked the current state of the disks. ` btullis@an-worker1191:~$ sudo perccli64 /c0 show all  <snip>  PD L...
[15:32:02] <jinxer-wm>	 FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[15:34:04] <icinga-wm>	 RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1191 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[15:34:10] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:36:22] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11507843 (10elukey) Tried another test, this time on build2002 (bookworm, with a more up-to-date version of dockerd). I tried to push the calico typha's image (less...
[15:37:02] <jinxer-wm>	 FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[15:41:37] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P86985 and previous config saved to /var/cache/conftool/dbconfig/20260109-154136-marostegui.json
[15:42:02] <jinxer-wm>	 RESOLVED: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[15:42:12] <wikibugs>	 (03PS1) 10Fabfur: cache:haproxy: add new contact type [puppet] - 10https://gerrit.wikimedia.org/r/1224977 (https://phabricator.wikimedia.org/T414173)
[15:42:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11507887 (10BTullis) 05Open→03Resolved Created the new VD. ` btullis@an-worker1148:~$ sudo megacli -CfgLdAdd -r0 [32:1] -a0                                        Adap...
[15:43:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86986 and previous config saved to /var/cache/conftool/dbconfig/20260109-154301-marostegui.json
[15:43:06] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[15:43:06] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11507890 (10trueg) @gmodena , could you please help here. I assume that I need full access but I really do not know.
[15:43:07] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[15:45:37] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11507899 (10Dzahn) Due to an unrelated temp issue with the DNS repo we changed the plan slightly and disabled the site at trafficserver level...
[15:48:00] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: cache::text: add eqiad and codfw WMCS public addresses to extra_trust [puppet] - 10https://gerrit.wikimedia.org/r/1224978 (https://phabricator.wikimedia.org/T406545)
[15:49:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1224978 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto)
[15:50:44] <wikibugs>	 (03CR) 10CDanis: [C:03+1] cache::text: add eqiad and codfw WMCS public addresses to extra_trust [puppet] - 10https://gerrit.wikimedia.org/r/1224978 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto)
[15:51:44] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T413525)', diff saved to https://phabricator.wikimedia.org/P86987 and previous config saved to /var/cache/conftool/dbconfig/20260109-155143-marostegui.json
[15:51:48] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[15:52:00] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[15:52:08] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1197 (T413525)', diff saved to https://phabricator.wikimedia.org/P86988 and previous config saved to /var/cache/conftool/dbconfig/20260109-155207-marostegui.json
[15:53:10] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P86989 and previous config saved to /var/cache/conftool/dbconfig/20260109-155309-marostegui.json
[15:54:13] <wikibugs>	 (03CR) 10Scott French: [C:03+1] cache::text: add eqiad and codfw WMCS public addresses to extra_trust [puppet] - 10https://gerrit.wikimedia.org/r/1224978 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto)
[15:55:42] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[15:57:35] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2207.codfw.wmnet with reason: Maintenance
[15:57:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] cache::text: add eqiad and codfw WMCS public addresses to extra_trust [puppet] - 10https://gerrit.wikimedia.org/r/1224978 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto)
[15:57:44] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2207 (T413525)', diff saved to https://phabricator.wikimedia.org/P86990 and previous config saved to /var/cache/conftool/dbconfig/20260109-155743-marostegui.json
[15:57:50] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[15:58:23] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:00:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11507990 (10Jgreen) 05Open→03Resolved Done!
[16:01:49] <wikibugs>	 (03Restored) 10Dzahn: point wikipedia25.org to ncredir [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[16:01:57] <wikibugs>	 (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[16:02:20] <wikibugs>	 (03CR) 10Dzahn: "testing CI after netbox was synced" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[16:02:56] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T413525)', diff saved to https://phabricator.wikimedia.org/P86991 and previous config saved to /var/cache/conftool/dbconfig/20260109-160255-marostegui.json
[16:02:59] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[16:03:19] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P86992 and previous config saved to /var/cache/conftool/dbconfig/20260109-160318-marostegui.json
[16:04:30] <wikibugs>	 (03CR) 10Thcipriani: [C:03+1] "Should be safe to merge now. The last two train branches (1.46.0-wmf.7 and 1.46.0-wmf.10) both contain TestKitchen." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216847 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[16:07:21] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216847 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[16:09:02] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1148 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:13:04] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P86993 and previous config saved to /var/cache/conftool/dbconfig/20260109-161304-marostegui.json
[16:13:27] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86994 and previous config saved to /var/cache/conftool/dbconfig/20260109-161326-marostegui.json
[16:13:31] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[16:13:32] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[16:13:43] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance
[16:15:47] <wikibugs>	 (03CR) 10Dzahn: [C:04-2] "do not merge - just here for testing CI" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[16:18:32] <wikibugs>	 (03CR) 10Dillon: "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224814 (https://phabricator.wikimedia.org/T404200) (owner: 10Kgraessle)
[16:18:38] <wikibugs>	 (03CR) 10Dillon: [C:03+1] When filtering for edits with high Revert Risk, Recent Changes shouldn't display edits from non-main namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224814 (https://phabricator.wikimedia.org/T404200) (owner: 10Kgraessle)
[16:23:12] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P86995 and previous config saved to /var/cache/conftool/dbconfig/20260109-162312-marostegui.json
[16:25:30] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T413525)', diff saved to https://phabricator.wikimedia.org/P86996 and previous config saved to /var/cache/conftool/dbconfig/20260109-162529-marostegui.json
[16:25:34] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[16:27:35] <wikibugs>	 (03PS2) 10DCausse: airflow-search: add enterprise extra_secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066)
[16:33:21] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T413525)', diff saved to https://phabricator.wikimedia.org/P86997 and previous config saved to /var/cache/conftool/dbconfig/20260109-163320-marostegui.json
[16:33:25] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[16:33:33] <wikibugs>	 (03PS2) 10Fabfur: cache:haproxy: add new contact type [puppet] - 10https://gerrit.wikimedia.org/r/1224977 (https://phabricator.wikimedia.org/T414173)
[16:33:37] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[16:34:35] <wikibugs>	 (03PS1) 10Btullis: Configure the kyuubi-defaults.conf file with kerberos details [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224984 (https://phabricator.wikimedia.org/T413977)
[16:34:44] <wikibugs>	 (03PS2) 10Btullis: Configure the kyuubi-defaults.conf file with kerberos details [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224984 (https://phabricator.wikimedia.org/T413977)
[16:34:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Configure the kyuubi-defaults.conf file with kerberos details [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224984 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[16:35:02] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Fix malformed networkpolicy for spark-support and kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224694 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[16:35:38] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P86998 and previous config saved to /var/cache/conftool/dbconfig/20260109-163538-marostegui.json
[16:36:49] <wikibugs>	 (03Merged) 10jenkins-bot: Fix malformed networkpolicy for spark-support and kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224694 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[16:39:32] <wikibugs>	 (03PS3) 10Btullis: Configure the kyuubi-defaults.conf file with kerberos details [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224984 (https://phabricator.wikimedia.org/T413977)
[16:40:24] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install dse-k8s-worker10[20-22] - https://phabricator.wikimedia.org/T414216 (10RobH) 03NEW
[16:40:53] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install dse-k8s-worker10[20-22] - https://phabricator.wikimedia.org/T414216#11508121 (10RobH) a:03BTullis Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add...
[16:41:10] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install dse-k8s-worker10[20-22] - https://phabricator.wikimedia.org/T414216#11508129 (10RobH)
[16:42:41] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Configure the kyuubi-defaults.conf file with kerberos details [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224984 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[16:44:23] <wikibugs>	 (03Merged) 10jenkins-bot: Configure the kyuubi-defaults.conf file with kerberos details [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224984 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[16:45:16] <wikibugs>	 (03PS3) 10Fabfur: cache:haproxy: add new contact type [puppet] - 10https://gerrit.wikimedia.org/r/1224977 (https://phabricator.wikimedia.org/T414173)
[16:45:47] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P86999 and previous config saved to /var/cache/conftool/dbconfig/20260109-164546-marostegui.json
[16:49:07] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Fabrizio!" [puppet] - 10https://gerrit.wikimedia.org/r/1224977 (https://phabricator.wikimedia.org/T414173) (owner: 10Fabfur)
[16:49:46] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[16:52:47] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to  Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11508155 (10trueg) I am sorry, I do not know what this means: "Grafana access is granted by having an LDAP account." Is the LDAP account not my dev account?
[16:54:50] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add missing DNS for IPV6 - pt1979@cumin2002"
[16:54:56] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add missing DNS for IPV6 - pt1979@cumin2002"
[16:54:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:55:55] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T413525)', diff saved to https://phabricator.wikimedia.org/P87000 and previous config saved to /var/cache/conftool/dbconfig/20260109-165554-marostegui.json
[16:55:58] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[16:56:12] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2225.codfw.wmnet with reason: Maintenance
[16:56:15] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply
[16:56:20] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2225 (T413525)', diff saved to https://phabricator.wikimedia.org/P87001 and previous config saved to /var/cache/conftool/dbconfig/20260109-165619-marostegui.json
[16:56:25] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply
[16:57:51] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:00:25] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1229.eqiad.wmnet with reason: Maintenance
[17:00:33] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1229 (T413525)', diff saved to https://phabricator.wikimedia.org/P87002 and previous config saved to /var/cache/conftool/dbconfig/20260109-170033-marostegui.json
[17:02:47] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache:haproxy: add new contact type [puppet] - 10https://gerrit.wikimedia.org/r/1224977 (https://phabricator.wikimedia.org/T414173) (owner: 10Fabfur)
[17:04:24] <logmsgbot>	 !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie
[17:04:32] <logmsgbot>	 pt1979@cumin2002 netbox (PID 3302416) is awaiting input
[17:07:06] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add missing DNS for IPV6 - pt1979@cumin2002"
[17:07:12] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add missing DNS for IPV6 - pt1979@cumin2002"
[17:07:12] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:10:21] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:10:55] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11508207 (10Fabfur) We're now allowing this new type of contact information in User-Agent string, this change should be propagated shortly. P...
[17:13:47] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add missing DNS for IPV6 - pt1979@cumin2002"
[17:13:52] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add missing DNS for IPV6 - pt1979@cumin2002"
[17:13:53] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:16:29] <wikibugs>	 (03CR) 10Dzahn: [C:04-2] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[17:17:07] <wikibugs>	 (03CR) 10Papaul: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[17:19:55] <wikibugs>	 (03Abandoned) 10Dzahn: point wikipedia25.org to ncredir [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[17:23:00] <logmsgbot>	 !log jhathaway@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[17:24:10] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[17:25:20] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T413525)', diff saved to https://phabricator.wikimedia.org/P87003 and previous config saved to /var/cache/conftool/dbconfig/20260109-172519-marostegui.json
[17:25:23] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[17:29:37] <wikibugs>	 (03PS1) 10Jasmine: helmfile.d: add sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224998
[17:29:45] <logmsgbot>	 !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[17:30:23] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T413525)', diff saved to https://phabricator.wikimedia.org/P87004 and previous config saved to /var/cache/conftool/dbconfig/20260109-173023-marostegui.json
[17:30:27] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[17:31:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] helmfile.d: add sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224998 (owner: 10Jasmine)
[17:32:40] <wikibugs>	 (03PS2) 10Jasmine: helmfile.d: add sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224998
[17:33:23] <wikibugs>	 (03PS1) 10Bking: opensearch-ipoid: move service to "production" status. [puppet] - 10https://gerrit.wikimedia.org/r/1224999 (https://phabricator.wikimedia.org/T412447)
[17:35:06] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Blake!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1224945 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake)
[17:35:28] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P87005 and previous config saved to /var/cache/conftool/dbconfig/20260109-173527-marostegui.json
[17:38:06] <wikibugs>	 (03CR) 10Blake: [C:03+2] sre.discovery.datacenter: use service registry for exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/1224945 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake)
[17:38:51] <wikibugs>	 (03PS2) 10Bking: opensearch-ipoid: move service to "production" status. [puppet] - 10https://gerrit.wikimedia.org/r/1224999 (https://phabricator.wikimedia.org/T412447)
[17:40:31] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P87006 and previous config saved to /var/cache/conftool/dbconfig/20260109-174031-marostegui.json
[17:41:58] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11508332 (10elukey) The sequence of events before the blob unknown seems to be the following on the docker registry:  1) "PUT /v2/test/calico/node/blobs/uploads/......
[17:45:36] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P87007 and previous config saved to /var/cache/conftool/dbconfig/20260109-174536-marostegui.json
[17:46:59] <wikibugs>	 06SRE: New SRE manager - Get emails sent to noc - https://phabricator.wikimedia.org/T414223 (10MLechvien-WMF) 03NEW
[17:48:14] <logmsgbot>	 !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1029.eqiad.wmnet with OS trixie
[17:48:46] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11508364 (10elukey) Looks like it worked!  ` elukey@build2002:~$ sudo docker push registry1004.eqiad.wmnet:5002/test/restricted/mediawiki-webserver:2025-03-04-10595...
[17:50:40] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P87008 and previous config saved to /var/cache/conftool/dbconfig/20260109-175039-marostegui.json
[17:55:45] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T413525)', diff saved to https://phabricator.wikimedia.org/P87009 and previous config saved to /var/cache/conftool/dbconfig/20260109-175544-marostegui.json
[17:55:48] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[17:56:01] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2226.codfw.wmnet with reason: Maintenance
[17:56:10] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2226 (T413525)', diff saved to https://phabricator.wikimedia.org/P87010 and previous config saved to /var/cache/conftool/dbconfig/20260109-175609-marostegui.json
[17:58:44] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100%
[17:59:02] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms
[18:00:06] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:00:48] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T413525)', diff saved to https://phabricator.wikimedia.org/P87011 and previous config saved to /var/cache/conftool/dbconfig/20260109-180047-marostegui.json
[18:00:51] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmde for martyn.ranyard - https://phabricator.wikimedia.org/T413994#11508410 (10KFrancis) Hi all, the NDA is out for signatures.  I'll confirm when it's complete.
[18:00:51] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[18:01:04] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1233.eqiad.wmnet with reason: Maintenance
[18:01:12] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1233 (T413525)', diff saved to https://phabricator.wikimedia.org/P87012 and previous config saved to /var/cache/conftool/dbconfig/20260109-180112-marostegui.json
[18:04:10] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:04:41] <wikibugs>	 (03PS5) 10Clare Ming: Deploy TestKitchen to Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806)
[18:05:54] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[18:09:01] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T413525)', diff saved to https://phabricator.wikimedia.org/P87013 and previous config saved to /var/cache/conftool/dbconfig/20260109-180900-marostegui.json
[18:09:04] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[18:09:13] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157#11508477 (10KFrancis) Hi all, the NDA is out for signatures.  I'll confirm when it's complete.
[18:12:22] <wikibugs>	 (03PS1) 10Clare Ming: Deploy TestKitchen to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806)
[18:13:05] <logmsgbot>	 !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie
[18:19:09] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P87014 and previous config saved to /var/cache/conftool/dbconfig/20260109-181908-marostegui.json
[18:24:10] <jinxer-wm>	 FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[18:24:51] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[18:29:18] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P87015 and previous config saved to /var/cache/conftool/dbconfig/20260109-182917-marostegui.json
[18:31:04] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T413525)', diff saved to https://phabricator.wikimedia.org/P87016 and previous config saved to /var/cache/conftool/dbconfig/20260109-183103-marostegui.json
[18:31:07] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[18:31:58] <logmsgbot>	 !log jhathaway@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[18:35:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11508591 (10cmooney) If we configured the mgmt switches we'd just have spanning-tree portfast or something on access interfaces, so they wouldn't be s...
[18:39:08] <logmsgbot>	 !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[18:39:26] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T413525)', diff saved to https://phabricator.wikimedia.org/P87017 and previous config saved to /var/cache/conftool/dbconfig/20260109-183926-marostegui.json
[18:39:30] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[18:39:31] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2238.codfw.wmnet with reason: Maintenance
[18:39:40] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2238 (T413525)', diff saved to https://phabricator.wikimedia.org/P87018 and previous config saved to /var/cache/conftool/dbconfig/20260109-183939-marostegui.json
[18:41:12] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P87019 and previous config saved to /var/cache/conftool/dbconfig/20260109-184111-marostegui.json
[18:41:18] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[18:42:44] <wikibugs>	 (03PS1) 10Cathal Mooney: Add include statement for netbox snippet for 2620:0:860:137::/64 [dns] - 10https://gerrit.wikimedia.org/r/1225007 (https://phabricator.wikimedia.org/T410717)
[18:43:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add include statement for netbox snippet for 2620:0:860:137::/64 [dns] - 10https://gerrit.wikimedia.org/r/1225007 (https://phabricator.wikimedia.org/T410717) (owner: 10Cathal Mooney)
[18:44:20] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add DNS for codfw lsw1-a4 to mr1-codfw IPv6 IPs - cmooney@cumin1003"
[18:44:25] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add DNS for codfw lsw1-a4 to mr1-codfw IPv6 IPs - cmooney@cumin1003"
[18:44:25] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:44:48] <wikibugs>	 (03PS2) 10Cathal Mooney: Add include statement for netbox snippet for 2620:0:860:137::/64 [dns] - 10https://gerrit.wikimedia.org/r/1225007 (https://phabricator.wikimedia.org/T410717)
[18:45:50] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add include statement for netbox snippet for 2620:0:860:137::/64 [dns] - 10https://gerrit.wikimedia.org/r/1225007 (https://phabricator.wikimedia.org/T410717) (owner: 10Cathal Mooney)
[18:46:02] <logmsgbot>	 !log cmooney@dns2005 START - running authdns-update
[18:46:51] <logmsgbot>	 !log cmooney@dns2005 END - running authdns-update
[18:47:06] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[18:49:51] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:51:21] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P87020 and previous config saved to /var/cache/conftool/dbconfig/20260109-185120-marostegui.json
[18:57:18] <logmsgbot>	 !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1029.eqiad.wmnet with OS trixie
[19:01:29] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T413525)', diff saved to https://phabricator.wikimedia.org/P87021 and previous config saved to /var/cache/conftool/dbconfig/20260109-190128-marostegui.json
[19:01:32] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[19:01:45] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[19:04:10] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:09:37] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T413525)', diff saved to https://phabricator.wikimedia.org/P87022 and previous config saved to /var/cache/conftool/dbconfig/20260109-190936-marostegui.json
[19:09:40] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[19:13:46] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+1] "That change as already merged so we can resolve this comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216847 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[19:14:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11508669 (10Papaul) I think the quick fix here is for us to go with your option (2) exclude any interface called "mgmt0"  for the time being and when...
[19:17:04] <wikibugs>	 (03PS2) 10Santiago Faci: Deploy TestKitchen to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[19:19:45] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P87023 and previous config saved to /var/cache/conftool/dbconfig/20260109-191944-marostegui.json
[19:25:36] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+1] Deploy TestKitchen to Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[19:25:50] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+1] Deploy TestKitchen to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[19:28:36] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1254.eqiad.wmnet with reason: Maintenance
[19:28:44] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1254 (T413525)', diff saved to https://phabricator.wikimedia.org/P87024 and previous config saved to /var/cache/conftool/dbconfig/20260109-192844-marostegui.json
[19:28:48] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[19:29:53] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P87025 and previous config saved to /var/cache/conftool/dbconfig/20260109-192953-marostegui.json
[19:36:03] <logmsgbot>	 !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie
[19:40:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T413525)', diff saved to https://phabricator.wikimedia.org/P87026 and previous config saved to /var/cache/conftool/dbconfig/20260109-194001-marostegui.json
[19:40:05] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[19:43:10] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie
[19:54:53] <logmsgbot>	 !log jhathaway@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[19:57:32] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T413525)', diff saved to https://phabricator.wikimedia.org/P87027 and previous config saved to /var/cache/conftool/dbconfig/20260109-195731-marostegui.json
[19:57:35] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[19:59:21] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1008-dev.eqiad.wmnet with reason: host reimage
[20:00:08] <logmsgbot>	 !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[20:02:47] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1008-dev.eqiad.wmnet with reason: host reimage
[20:07:04] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11508763 (10SerDIDG) Maybe it's somehow related. I use [[https://github.com/siddharthvp/mwn|siddharthvp/mwn]] for deploying my gadget. But a couple of days ago my...
[20:07:40] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P87028 and previous config saved to /var/cache/conftool/dbconfig/20260109-200739-marostegui.json
[20:17:49] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P87029 and previous config saved to /var/cache/conftool/dbconfig/20260109-201748-marostegui.json
[20:19:06] <logmsgbot>	 !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1029.eqiad.wmnet with OS trixie
[20:19:54] <wikibugs>	 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11508769 (10taavi) >>! In T414173#11508763, @SerDIDG wrote: > Maybe it's somehow related. I use [[https://github.com/siddharthvp/mwn|siddharthvp/mwn]] for deployin...
[20:21:04] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie
[20:21:19] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:23:41] <logmsgbot>	 !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on wdqs1029.eqiad.wmnet with reason: T412451
[20:23:44] <stashbot>	 T412451: 4 failed reimages on wdqs1029, 1030, 1031, 1032 - https://phabricator.wikimedia.org/T412451
[20:26:09] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:27:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[20:27:57] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T413525)', diff saved to https://phabricator.wikimedia.org/P87030 and previous config saved to /var/cache/conftool/dbconfig/20260109-202756-marostegui.json
[20:28:00] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[20:28:13] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1259.eqiad.wmnet with reason: Maintenance
[20:28:22] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1259 (T413525)', diff saved to https://phabricator.wikimedia.org/P87031 and previous config saved to /var/cache/conftool/dbconfig/20260109-202821-marostegui.json
[20:29:19] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:32:44] <jinxer-wm>	 FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[20:33:09] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:33:31] <wikibugs>	 (03PS1) 10JHathaway: debian installer: format EFI partions [puppet] - 10https://gerrit.wikimedia.org/r/1225021 (https://phabricator.wikimedia.org/T412451)
[20:33:46] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1225021 (https://phabricator.wikimedia.org/T412451) (owner: 10JHathaway)
[20:37:17] <wikibugs>	 (03PS2) 10JHathaway: debian installer: format EFI partions [puppet] - 10https://gerrit.wikimedia.org/r/1225021 (https://phabricator.wikimedia.org/T412451)
[20:37:18] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1225021 (https://phabricator.wikimedia.org/T412451) (owner: 10JHathaway)
[20:47:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:58:06] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T413525)', diff saved to https://phabricator.wikimedia.org/P87032 and previous config saved to /var/cache/conftool/dbconfig/20260109-205805-marostegui.json
[20:58:09] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[20:58:54] <wikibugs>	 (03CR) 10Bking: [C:03+2] "self-merging now so I'll have a couple of hours to make sure it doesn't set off alerts before we head out for the weekend." [puppet] - 10https://gerrit.wikimedia.org/r/1224999 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking)
[20:59:19] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:02:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:02:56] <wikibugs>	 (03PS1) 10BryanDavis: extension-list: add a bogus extension to test l10n-update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225023 (https://phabricator.wikimedia.org/T411516)
[21:05:17] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:07:31] <wikibugs>	 (03CR) 10BryanDavis: "Patch to merge as part of the 1.46-wmf.11 train process. See the commit message for an explanation of what is being tested and why." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225023 (https://phabricator.wikimedia.org/T411516) (owner: 10BryanDavis)
[21:08:14] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P87033 and previous config saved to /var/cache/conftool/dbconfig/20260109-210813-marostegui.json
[21:16:19] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:18:23] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P87034 and previous config saved to /var/cache/conftool/dbconfig/20260109-211822-marostegui.json
[21:19:10] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:24:10] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[21:27:15] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:27:15] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:28:31] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T413525)', diff saved to https://phabricator.wikimedia.org/P87035 and previous config saved to /var/cache/conftool/dbconfig/20260109-212830-marostegui.json
[21:28:34] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[21:28:47] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[21:33:17] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:37:11] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55565 bytes in 6.740 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:37:11] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 6.884 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:53:19] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:59:58] <wikibugs>	 (03PS1) 10Bking: [DO NOT MERGE] opensearch-ipoid: Add a path and timeout to blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1225029 (https://phabricator.wikimedia.org/T414037)
[22:00:20] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1225029 (https://phabricator.wikimedia.org/T414037) (owner: 10Bking)
[22:02:11] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:11:22] <wikibugs>	 (03Abandoned) 10Bking: [DO NOT MERGE] opensearch-ipoid: Add a path and timeout to blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1225029 (https://phabricator.wikimedia.org/T414037) (owner: 10Bking)
[22:11:47] <wikibugs>	 (03CR) 10Ryan Kemper: "Checking grafana explore for `probe_ssl_earliest_cert_expiry{module=~'.*ipoid.*'}`, this had the intended effect" [puppet] - 10https://gerrit.wikimedia.org/r/1224999 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking)
[22:12:44] <jinxer-wm>	 FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[22:24:10] <jinxer-wm>	 FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[22:27:44] <jinxer-wm>	 FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[22:29:19] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:33:13] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:51:19] <wikibugs>	 (03PS1) 10Zabe: Close kywikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225034 (https://phabricator.wikimedia.org/T413845)
[23:09:45] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: Add kai to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1225036 (https://phabricator.wikimedia.org/T414234)
[23:31:19] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:32:17] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:33:26] <wikibugs>	 (03PS1) 10Jasmine: charts: add sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225042
[23:35:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] charts: add sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225042 (owner: 10Jasmine)
[23:35:19] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:36:55] <wikibugs>	 (03PS2) 10Jasmine: charts: add sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225042
[23:38:09] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:47:53] <wikibugs>	 (03Abandoned) 10Jasmine: charts: add Sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217570 (owner: 10Jasmine)
[23:51:03] <wikibugs>	 (03PS3) 10Jasmine: helmfile.d: add sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224998