[00:14:09] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[00:40:24] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1224244
[00:40:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1224244 (owner: 10TrainBranchBot)
[00:51:46] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1224244 (owner: 10TrainBranchBot)
[01:01:01] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:03:25] <wikibugs>	 (03PS1) 10Aaron Schulz: Use meta.wikimedia.org for "wmf-restbase-global" sandbox specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224253
[01:10:29] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1224254
[01:10:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1224254 (owner: 10TrainBranchBot)
[01:14:00] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 59s)
[01:24:09] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[01:33:51] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1224254 (owner: 10TrainBranchBot)
[02:12:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:47:57] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:49:09] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:49:12] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[02:49:13] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[02:50:06] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:51:19] <rzl>	 !ack
[02:51:20] <sirenbot>	 7259 (RESOLVED)  frban1002/check_ipsec
[02:51:20] <sirenbot>	 7259 (RESOLVED)  frban1002/check_ipsec
[02:51:36] <rzl>	 hmm, I'll look at that later
[02:51:37] <rzl>	 !incidents
[02:51:37] <sirenbot>	 7287 (ACKED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[02:51:37] <sirenbot>	 7288 (ACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[02:51:38] <sirenbot>	 7289 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[02:51:38] <sirenbot>	 7284 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[02:51:38] <sirenbot>	 7283 (RESOLVED)  ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw)
[02:51:38] <sirenbot>	 7282 (RESOLVED)  ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw)
[02:51:38] <sirenbot>	 7281 (RESOLVED)  ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw)
[02:51:39] <sirenbot>	 7280 (RESOLVED)  ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw)
[02:52:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:53:01] <urandom>	 \o
[02:53:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[02:54:05] <rzl>	 try it again...
[02:54:05] <rzl>	 !ack
[02:54:06] <sirenbot>	 7259 (RESOLVED)  frban1002/check_ipsec
[02:54:09] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:54:12] <jinxer-wm>	 RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[02:54:13] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[02:54:29] <rzl>	 okay it's acking correctly, just replying with the wrong incident
[02:56:12] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[02:56:13] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[02:56:22] <rzl>	 !ack
[02:56:23] <sirenbot>	 7259 (RESOLVED)  frban1002/check_ipsec
[02:56:23] <sirenbot>	 7259 (RESOLVED)  frban1002/check_ipsec
[02:58:51] <jinxer-wm>	 FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[02:59:26] <rzl>	 !ack
[02:59:26] <sirenbot>	 no value provided for parameter incident and no default available
[02:59:27] <sirenbot>	 Incident id must be an integer
[03:00:57] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:03:51] <jinxer-wm>	 FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[03:05:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:07:06] <urandom>	 your turnillo-fu is strong swfrench-wmf 
[03:08:51] <jinxer-wm>	 FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[03:09:09] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:11:12] <jinxer-wm>	 RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[03:11:13] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[03:12:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet CI, 10Puppet-Infrastructure, 13Patch-For-Review: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399#11502051 (10hashar) Something I forgot, the `operations-puppet-catalog-...
[03:13:51] <jinxer-wm>	 FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[03:35:29] <jinxer-wm>	 FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:38:56] <icinga-wm>	 PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:43:51] <jinxer-wm>	 RESOLVED: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[03:48:56] <icinga-wm>	 RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:50:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:14:09] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:09:09] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:24:09] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:34:09] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:34:12] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[05:34:41] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance
[05:34:50] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2147 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86778 and previous config saved to /var/cache/conftool/dbconfig/20260108-053449-marostegui.json
[05:34:54] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[05:34:54] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[05:43:47] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2249 [puppet] - 10https://gerrit.wikimedia.org/r/1224527 (https://phabricator.wikimedia.org/T407941)
[05:45:35] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2249 [puppet] - 10https://gerrit.wikimedia.org/r/1224527 (https://phabricator.wikimedia.org/T407941) (owner: 10Marostegui)
[05:51:16] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db2231.codfw.wmnet onto db2249.codfw.wmnet
[05:51:21] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db2231 - Depool db2231.codfw.wmnet to then clone it to db2249.codfw.wmnet - marostegui@cumin1003
[05:51:41] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2231 - Depool db2231.codfw.wmnet to then clone it to db2249.codfw.wmnet - marostegui@cumin1003
[05:52:10] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11502131 (10Marostegui) 05Stalled→03Open
[05:52:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11502132 (10Marostegui)
[05:53:34] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11502134 (10Marostegui) @KOfori you are the approver for `cassandra-staging-devs` can you take a look at this? thanks
[05:58:34] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[05:58:53] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[05:59:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1165 (T413525)', diff saved to https://phabricator.wikimedia.org/P86780 and previous config saved to /var/cache/conftool/dbconfig/20260108-055901-marostegui.json
[05:59:05] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[06:02:44] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.newdepool depool pc1013: test
[06:02:44] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache
[06:02:52] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[06:02:53] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newdepool (exit_code=0) depool pc1013: test
[06:03:47] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool pc1013: test
[06:03:47] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache
[06:04:00] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[06:04:00] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool pc1013: test
[06:05:06] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.newdepool depool db2144: test
[06:05:06] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache
[06:05:14] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[06:05:14] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newdepool (exit_code=0) depool db2144: test
[06:05:26] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2144: test
[06:05:26] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache
[06:05:39] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[06:05:39] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db2144: test
[06:05:52] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T413525)', diff saved to https://phabricator.wikimedia.org/P86785 and previous config saved to /var/cache/conftool/dbconfig/20260108-060551-marostegui.json
[06:05:55] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[06:06:19] <wikibugs>	 (03CR) 10Marostegui: "Hi, thanks for this." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[06:12:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:16:00] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P86786 and previous config saved to /var/cache/conftool/dbconfig/20260108-061600-marostegui.json
[06:26:09] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P86787 and previous config saved to /var/cache/conftool/dbconfig/20260108-062608-marostegui.json
[06:35:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:36:17] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T413525)', diff saved to https://phabricator.wikimedia.org/P86788 and previous config saved to /var/cache/conftool/dbconfig/20260108-063616-marostegui.json
[06:36:20] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[06:36:34] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[06:36:43] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1168 (T413525)', diff saved to https://phabricator.wikimedia.org/P86789 and previous config saved to /var/cache/conftool/dbconfig/20260108-063642-marostegui.json
[06:40:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:42:09] <wikibugs>	 (03PS1) 10Ayounsi: Revert^2 "Line-wrap Homer diffs" [puppet] - 10https://gerrit.wikimedia.org/r/1224528
[06:43:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T413525)', diff saved to https://phabricator.wikimedia.org/P86790 and previous config saved to /var/cache/conftool/dbconfig/20260108-064333-marostegui.json
[06:43:37] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[06:44:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert^2 "Line-wrap Homer diffs" [puppet] - 10https://gerrit.wikimedia.org/r/1224528 (owner: 10Ayounsi)
[06:53:43] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P86791 and previous config saved to /var/cache/conftool/dbconfig/20260108-065342-marostegui.json
[06:59:40] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:59:46] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T0700)
[07:00:05] <jouncebot>	 marostegui, Amir1, and federico3: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T0700).
[07:00:06] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:01:40] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:01:46] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:03:52] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P86792 and previous config saved to /var/cache/conftool/dbconfig/20260108-070351-marostegui.json
[07:04:09] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:08:07] <wikibugs>	 (03PS1) 10Marostegui: filtered_tables.txt: Add new columns [puppet] - 10https://gerrit.wikimedia.org/r/1224533 (https://phabricator.wikimedia.org/T413688)
[07:08:22] <wikibugs>	 (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1224533 (https://phabricator.wikimedia.org/T413688) (owner: 10Marostegui)
[07:08:38] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Add new columns [puppet] - 10https://gerrit.wikimedia.org/r/1224533 (https://phabricator.wikimedia.org/T413688) (owner: 10Marostegui)
[07:13:46] <wikibugs>	 (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.5 [puppet] - 10https://gerrit.wikimedia.org/r/1224534 (https://phabricator.wikimedia.org/T414053)
[07:14:00] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T413525)', diff saved to https://phabricator.wikimedia.org/P86793 and previous config saved to /var/cache/conftool/dbconfig/20260108-071359-marostegui.json
[07:14:03] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[07:14:05] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[07:14:13] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1173 (T413525)', diff saved to https://phabricator.wikimedia.org/P86794 and previous config saved to /var/cache/conftool/dbconfig/20260108-071413-marostegui.json
[07:21:31] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T413525)', diff saved to https://phabricator.wikimedia.org/P86795 and previous config saved to /var/cache/conftool/dbconfig/20260108-072130-marostegui.json
[07:21:34] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[07:22:07] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.5 [puppet] - 10https://gerrit.wikimedia.org/r/1224534 (https://phabricator.wikimedia.org/T414053) (owner: 10Jelto)
[07:22:08] <wikibugs>	 (03CR) 10Arnaudb: "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1224534 (https://phabricator.wikimedia.org/T414053) (owner: 10Jelto)
[07:22:11] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.5 [puppet] - 10https://gerrit.wikimedia.org/r/1224534 (https://phabricator.wikimedia.org/T414053) (owner: 10Jelto)
[07:23:41] <wikibugs>	 (03CR) 10Jelto: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.5 [puppet] - 10https://gerrit.wikimedia.org/r/1224534 (https://phabricator.wikimedia.org/T414053) (owner: 10Jelto)
[07:27:47] <wikibugs>	 (03PS2) 10Ayounsi: Revert^2 "Line-wrap Homer diffs" [puppet] - 10https://gerrit.wikimedia.org/r/1224528
[07:29:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:04-1] "I don't think the patch is correct. Since my patch we're actually receiving a lot less of these and the reason we're still seeing some is " [puppet] - 10https://gerrit.wikimedia.org/r/1224528 (owner: 10Ayounsi)
[07:31:39] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P86796 and previous config saved to /var/cache/conftool/dbconfig/20260108-073139-marostegui.json
[07:31:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1224205 (https://phabricator.wikimedia.org/T414032) (owner: 10Ahmon Dancy)
[07:33:27] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] Add Cumin alias for tcpproxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/1224057 (https://phabricator.wikimedia.org/T408532) (owner: 10Muehlenhoff)
[07:36:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11502255 (10Jclark-ctr) @Clement_Goubert Before I start racking these, do you want to verify that they’re correct by row, since we had so many orders for these?
[07:37:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Yubikey-SSH-FIDO: add new key for dancy [puppet] - 10https://gerrit.wikimedia.org/r/1224205 (https://phabricator.wikimedia.org/T414032) (owner: 10Ahmon Dancy)
[07:37:54] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Release-Engineering-Team, 13Patch-For-Review: Add yubikey ssh key for dancy - https://phabricator.wikimedia.org/T414032#11502257 (10MoritzMuehlenhoff)
[07:41:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Confusion cleared up on IRC, this actually makes sense now!" [puppet] - 10https://gerrit.wikimedia.org/r/1224528 (owner: 10Ayounsi)
[07:41:19] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Revert^2 "Line-wrap Homer diffs" [puppet] - 10https://gerrit.wikimedia.org/r/1224528 (owner: 10Ayounsi)
[07:41:38] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] cache-text: add wikipedia25 to enabled_certificates [puppet] - 10https://gerrit.wikimedia.org/r/1224096 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto)
[07:41:48] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P86797 and previous config saved to /var/cache/conftool/dbconfig/20260108-074147-marostegui.json
[07:43:35] <wikibugs>	 (03CR) 10Joal: [C:03+1] "I'm not yet very familiar with charts, but looked ok to me :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224145 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[07:44:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add Cumin alias for tcpproxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/1224057 (https://phabricator.wikimedia.org/T408532) (owner: 10Muehlenhoff)
[07:46:03] <wikibugs>	 (03PS4) 10Dzahn: cache-text: add wikipedia25 to enabled_certificates [puppet] - 10https://gerrit.wikimedia.org/r/1224096 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto)
[07:47:39] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] cache-text: add wikipedia25 to enabled_certificates [puppet] - 10https://gerrit.wikimedia.org/r/1224096 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto)
[07:51:56] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T413525)', diff saved to https://phabricator.wikimedia.org/P86798 and previous config saved to /var/cache/conftool/dbconfig/20260108-075155-marostegui.json
[07:51:59] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[07:52:12] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[07:52:21] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T413525)', diff saved to https://phabricator.wikimedia.org/P86799 and previous config saved to /var/cache/conftool/dbconfig/20260108-075220-marostegui.json
[07:53:31] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Alert in need of triage: HDFS topology check (instance an-master1003) - https://phabricator.wikimedia.org/T413742#11502270 (10JAllemandou) a:03JAllemandou
[07:55:39] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica
[07:56:04] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Alert in need of triage: HDFS topology check (instance an-master1003) - https://phabricator.wikimedia.org/T413742#11502277 (10JAllemandou) Current hadoop topology:  ` joal@an-master1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -p...
[07:59:37] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T413525)', diff saved to https://phabricator.wikimedia.org/P86800 and previous config saved to /var/cache/conftool/dbconfig/20260108-075937-marostegui.json
[07:59:41] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T0800). nyaa~
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:05:01] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica
[08:06:37] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 4904.26 ms
[08:07:05] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica
[08:07:29] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms
[08:08:41] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:08:45] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:08:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[08:08:57] <jinxer-wm>	 FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:09:46] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P86801 and previous config saved to /var/cache/conftool/dbconfig/20260108-080945-marostegui.json
[08:10:06] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:10:12] <effie>	 !!incidents
[08:10:14] <effie>	 !incidents
[08:10:15] <sirenbot>	 7294 (ACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[08:10:15] <sirenbot>	 7295 (ACKED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[08:10:15] <sirenbot>	 7290 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[08:10:15] <sirenbot>	 7292 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[08:10:16] <sirenbot>	 7291 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[08:10:16] <sirenbot>	 7293 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[08:10:16] <sirenbot>	 7289 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[08:10:17] <sirenbot>	 7288 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[08:10:17] <sirenbot>	 7287 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[08:10:18] <sirenbot>	 7284 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[08:10:18] <sirenbot>	 7283 (RESOLVED)  ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw)
[08:10:19] <sirenbot>	 7282 (RESOLVED)  ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw)
[08:10:19] <sirenbot>	 7281 (RESOLVED)  ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw)
[08:10:20] <sirenbot>	 7280 (RESOLVED)  ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw)
[08:10:47] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:10:53] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING WARNING - Packet loss = 0%, RTA = 1515.83 ms
[08:13:41] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:13:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:14:09] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:14:09] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[08:14:19] <effie>	 !incidents
[08:14:19] <sirenbot>	 7294 (ACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[08:14:19] <sirenbot>	 7295 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[08:14:20] <sirenbot>	 7290 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[08:14:20] <sirenbot>	 7292 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[08:14:20] <sirenbot>	 7291 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[08:14:20] <sirenbot>	 7293 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[08:14:20] <sirenbot>	 7289 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[08:14:21] <sirenbot>	 7288 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[08:14:21] <sirenbot>	 7287 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[08:14:22] <sirenbot>	 7284 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[08:14:22] <sirenbot>	 7283 (RESOLVED)  ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw)
[08:14:23] <sirenbot>	 7282 (RESOLVED)  ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw)
[08:14:23] <sirenbot>	 7281 (RESOLVED)  ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw)
[08:14:24] <sirenbot>	 7280 (RESOLVED)  ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw)
[08:14:45] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:14:49] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:15:30] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms
[08:15:56] <icinga-wm>	 PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:16:34] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica
[08:19:09] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:19:54] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P86802 and previous config saved to /var/cache/conftool/dbconfig/20260108-081953-marostegui.json
[08:20:04] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab
[08:25:56] <icinga-wm>	 RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:26:32] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:29:09] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:29:49] <wikibugs>	 (03PS1) 10Dzahn: microsites: monitor wikipedia25.org (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1224575
[08:30:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T413525)', diff saved to https://phabricator.wikimedia.org/P86803 and previous config saved to /var/cache/conftool/dbconfig/20260108-083001-marostegui.json
[08:30:05] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[08:30:06] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:30:18] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[08:30:27] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T413525)', diff saved to https://phabricator.wikimedia.org/P86804 and previous config saved to /var/cache/conftool/dbconfig/20260108-083026-marostegui.json
[08:31:30] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[08:34:09] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:36:01] <wikibugs>	 (03PS1) 10Dzahn: add wikipedia25.org to list of wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/1224576 (https://phabricator.wikimedia.org/T408592)
[08:37:35] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T413525)', diff saved to https://phabricator.wikimedia.org/P86805 and previous config saved to /var/cache/conftool/dbconfig/20260108-083734-marostegui.json
[08:37:38] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[08:41:50] <wikibugs>	 (03CR) 10Dzahn: "compiler says it only changes profile::environment for proxy settings: https://puppet-compiler.wmflabs.org/output/1224576/7857/cp3070.esam" [puppet] - 10https://gerrit.wikimedia.org/r/1224576 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[08:45:02] <wikibugs>	 (03PS1) 10Dzahn: add wikipedia25.org to profile::cache::base::wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592)
[08:47:43] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P86806 and previous config saved to /var/cache/conftool/dbconfig/20260108-084742-marostegui.json
[08:48:09] <wikibugs>	 (03PS2) 10Dzahn: add wikipedia25.org to profile::cache::base::wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592)
[08:48:34] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1224580/7858/cp3070.esams.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[08:50:11] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-2] "varnish uses `$wikimedia_domains = $profile::cache::base::wikimedia_domains` so I think this could be abandoned" [puppet] - 10https://gerrit.wikimedia.org/r/1224576 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[08:50:41] <wikibugs>	 (03CR) 10Dzahn: "just found that - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1224580" [puppet] - 10https://gerrit.wikimedia.org/r/1224576 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[08:50:52] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] add wikipedia25.org to profile::cache::base::wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[08:51:48] <wikibugs>	 (03CR) 10JMeybohm: docker registry: add ml build user password (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski)
[08:57:51] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P86807 and previous config saved to /var/cache/conftool/dbconfig/20260108-085751-marostegui.json
[08:58:16] <wikibugs>	 (03PS1) 10Dzahn: Revert "cache-text: add wikipedia25.org to alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/1224581
[08:58:32] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for tgritschacher - https://phabricator.wikimedia.org/T414061 (10Tobi_WMDE_SW) 03NEW
[08:58:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "cache-text: add wikipedia25.org to alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn)
[08:59:06] <wikibugs>	 (03Abandoned) 10Dzahn: add wikipedia25.org to list of wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/1224576 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[09:03:28] <wikibugs>	 (03PS3) 10Muehlenhoff: Add javiermonton to kafka-jumbo-access group [puppet] - 10https://gerrit.wikimedia.org/r/1218337 (https://phabricator.wikimedia.org/T411774)
[09:04:45] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for tgritschacher - https://phabricator.wikimedia.org/T414061#11502464 (10Tobi_WMDE_SW)
[09:08:00] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T413525)', diff saved to https://phabricator.wikimedia.org/P86808 and previous config saved to /var/cache/conftool/dbconfig/20260108-090759-marostegui.json
[09:08:03] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[09:08:16] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[09:11:20] <icinga-wm>	 PROBLEM - jenkins_service_running on releases2003 is CRITICAL: PROCS CRITICAL: 2 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins
[09:11:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add javiermonton to kafka-jumbo-access group [puppet] - 10https://gerrit.wikimedia.org/r/1218337 (https://phabricator.wikimedia.org/T411774) (owner: 10Muehlenhoff)
[09:12:20] <icinga-wm>	 RECOVERY - jenkins_service_running on releases2003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins
[09:12:52] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "looks good to me, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[09:13:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[09:14:38] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] add wikipedia25.org to profile::cache::base::wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[09:16:44] <logmsgbot>	 jelto@cumin1003 jelto: The backup on gitlab1004 is complete, ready to proceed with upgrade.
[09:19:04] <wikibugs>	 (03PS2) 10Slyngshede: Notification for users to link their Phabricator account [software/bitu] - 10https://gerrit.wikimedia.org/r/1224076
[09:19:25] <wikibugs>	 (03CR) 10Dzahn: "Am I right that a domain is either a "wikimedia_domain" or an "alternate_domain" but not both?" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn)
[09:19:44] <logmsgbot>	 jelto@cumin1003 upgrade (PID 3866799) is awaiting input
[09:20:20] <wikibugs>	 (03PS1) 10Aklapper: admin: remove old ssh key of aklapper [puppet] - 10https://gerrit.wikimedia.org/r/1224584 (https://phabricator.wikimedia.org/T413009)
[09:21:50] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Remove spurious 'diff' file [alerts] - 10https://gerrit.wikimedia.org/r/1224585
[09:23:55] <wikibugs>	 (03PS4) 10Vgutierrez: cache::haproxy: Get rid of http-request after use_backend warning [puppet] - 10https://gerrit.wikimedia.org/r/1215119
[09:24:09] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:26:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove cookbooks to migrate roles/hosts to Puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/1219861 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[09:28:21] <wikibugs>	 (03CR) 10Vgutierrez: "alternate_domains lists the domains that need to be handled by the `misc` VCL rather than the `text` VCL on the varnish text cluster, so r" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn)
[09:28:54] <wikibugs>	 (03PS1) 10Jon Harald Søby: planet: Update Wikimedia Norge's feed URL [puppet] - 10https://gerrit.wikimedia.org/r/1224587
[09:30:57] <wikibugs>	 (03CR) 10Dzahn: "Tbh, I thought misc cluster did not exist anymore and had been merged into "text". The intent is to treat this like any other micro site h" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn)
[09:31:52] <XioNoX>	 !log remove pybal BGP group on pfw1-codfw (replaced with Bird) - T414015
[09:31:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:55] <stashbot>	 T414015: Remove pfw configuration related to former pybal/LVS service - https://phabricator.wikimedia.org/T414015
[09:33:49] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "in terms of hardware that's true, but we still have two VCLs. I'd recommend keeping this as similar as wikiworkshop.org as possible" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn)
[09:34:55] <wikibugs>	 (03CR) 10Dzahn: "gotcha! Yea, that was my approach as well. Comparing to wikiworkshop.org - so in that case we should remove it here and merge this revert." [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn)
[09:35:14] <icinga-wm>	 PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[09:36:14] <icinga-wm>	 RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 120153 bytes in 0.412 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[09:36:44] <wikibugs>	 (03CR) 10Muehlenhoff: "Personally I think the global default from WMFConfig.test_on is an antipattern we should get rid off. The current default is still" [puppet] - 10https://gerrit.wikimedia.org/r/1219149 (owner: 10Majavah)
[09:37:12] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Alert in need of triage: HDFS topology check (instance an-master1003) - https://phabricator.wikimedia.org/T413742#11502561 (10JAllemandou) I have verified in puppet: all hosts in the `default` rack have already been added to the net-topology. N...
[09:37:28] <wikibugs>	 (03PS1) 10Joal: Hieradata/common.yaml: Update hadoop net topology [puppet] - 10https://gerrit.wikimedia.org/r/1224590 (https://phabricator.wikimedia.org/T413742)
[09:38:46] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab
[09:40:37] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Hieradata/common.yaml: Update hadoop net topology [puppet] - 10https://gerrit.wikimedia.org/r/1224590 (https://phabricator.wikimedia.org/T413742) (owner: 10Joal)
[09:40:58] <wikibugs>	 (03CR) 10Slyngshede: Notification for users to link their Phabricator account (035 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1224076 (owner: 10Slyngshede)
[09:44:02] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+2 C:03+2] ratelimit: Update to main branch e9ce92c [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1224124 (owner: 10Clément Goubert)
[09:44:46] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1224587 (owner: 10Jon Harald Søby)
[09:46:22] <wikibugs>	 (03PS2) 10Dzahn: Revert "cache-text: add wikipedia25.org to alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/1224581
[09:46:29] <wikibugs>	 (03PS7) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524)
[09:46:34] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:46:36] <wikibugs>	 (03CR) 10Dpogorzelski: docker registry: add ml build user password (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski)
[09:46:48] <claime>	 !log Rebuilding ratelimit image - T414002
[09:46:49] <wikibugs>	 (03PS2) 10Federico Ceratto: mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642)
[09:46:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:51] <stashbot>	 T414002: Upgrade ratelimit service to latest main - https://phabricator.wikimedia.org/T414002
[09:46:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "cache-text: add wikipedia25.org to alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn)
[09:48:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto)
[09:49:09] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:50:30] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[09:50:56] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "+1 for keeping wikipedia25 config similar to wikiworkshop" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn)
[09:54:09] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:54:10] <wikibugs>	 (03PS1) 10Clément Goubert: ratelimit: Fix golang base image version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1224591
[09:56:41] <wikibugs>	 (03PS1) 10Clément Goubert: ratelimit: Fix golang base image version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1224592
[09:57:08] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::docker_registry: add the ML instance [puppet] - 10https://gerrit.wikimedia.org/r/1224091 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey)
[09:59:31] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[10:02:02] <wikibugs>	 (03PS1) 10Clément Goubert: go mod tidy [software/envoyproxy/ratelimiter] (git20260107.e9ce92c-vendor) - 10https://gerrit.wikimedia.org/r/1224593
[10:02:36] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "go mod tidy" [software/envoyproxy/ratelimiter] (git20260107.e9ce92c-vendor) - 10https://gerrit.wikimedia.org/r/1224594
[10:03:38] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add a kyuubi service to the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224145 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[10:04:38] <wikibugs>	 (03PS4) 10Muehlenhoff: Only select Puppet version based on the Debian release [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798)
[10:05:29] <wikibugs>	 (03Merged) 10jenkins-bot: Add a kyuubi service to the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224145 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[10:07:49] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance
[10:07:58] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2151 (T413525)', diff saved to https://phabricator.wikimedia.org/P86809 and previous config saved to /var/cache/conftool/dbconfig/20260108-100757-marostegui.json
[10:08:01] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[10:09:49] <wikibugs>	 (03PS3) 10Federico Ceratto: mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642)
[10:10:37] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[10:10:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto)
[10:12:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:13:12] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply
[10:13:36] <wikibugs>	 (03PS1) 10Clément Goubert: ratelimit: fix go build command [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1224597
[10:19:18] <wikibugs>	 (03PS4) 10Federico Ceratto: mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642)
[10:20:42] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:20:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto)
[10:20:50] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:21:02] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100%
[10:21:15] <wikibugs>	 (03PS1) 10Clément Goubert: rest-gateway: Use fixed version for ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224598 (https://phabricator.wikimedia.org/T414002)
[10:21:30] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[10:21:42] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:22:20] <vgutierrez>	 uh... :)
[10:22:31] <vgutierrez>	 ^^ expected?
[10:23:25] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Use fixed version for ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224598 (https://phabricator.wikimedia.org/T414002) (owner: 10Clément Goubert)
[10:23:28] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply
[10:23:37] <moritzm>	 it flapped like that earlier before as well
[10:23:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:23:58] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100%
[10:24:09] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:24:30] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[10:24:49] <vgutierrez>	 tappof: are you around?
[10:25:06] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:25:18] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: Use fixed version for ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224598 (https://phabricator.wikimedia.org/T414002) (owner: 10Clément Goubert)
[10:26:14] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T413525)', diff saved to https://phabricator.wikimedia.org/P86810 and previous config saved to /var/cache/conftool/dbconfig/20260108-102613-marostegui.json
[10:26:17] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[10:27:15] <wikibugs>	 (03PS1) 10Clément Goubert: rest-gateway: Remove ratelimiter staging version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224600
[10:27:20] <wikibugs>	 (03PS5) 10Federico Ceratto: mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642)
[10:28:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto)
[10:29:09] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:30:22] <wikibugs>	 (03PS3) 10Dzahn: Revert "cache-text: add wikipedia25.org to alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/1224581
[10:33:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Only select Puppet version based on the Debian release [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[10:35:21] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for tgritschacher - https://phabricator.wikimedia.org/T414061#11502687 (10WMDE-leszek) I approve this request on WMDE end.
[10:35:32] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Remove ratelimiter staging version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224600 (owner: 10Clément Goubert)
[10:35:57] <mutante>	 !log wikitech wiki - made lsobanski an admin - T414065
[10:35:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:00] <stashbot>	 T414065: Requesting Wikitech admin access for @lsobanski - https://phabricator.wikimedia.org/T414065
[10:36:23] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P86811 and previous config saved to /var/cache/conftool/dbconfig/20260108-103622-marostegui.json
[10:36:32] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "cache-text: add wikipedia25.org to alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn)
[10:37:21] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: Remove ratelimiter staging version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224600 (owner: 10Clément Goubert)
[10:39:27] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:39:43] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:43:14] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] ml-services: Update image and config for revise-tone-task-generator on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224117 (https://phabricator.wikimedia.org/T412210) (owner: 10AikoChou)
[10:44:52] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Update image and config for revise-tone-task-generator on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224117 (https://phabricator.wikimedia.org/T412210) (owner: 10AikoChou)
[10:45:49] <wikibugs>	 (03PS1) 10Gkyziridis: revert-risk: Deploy on prod and staging new model version for both language-agnosting and multingual. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224604 (https://phabricator.wikimedia.org/T411786)
[10:46:31] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P86812 and previous config saved to /var/cache/conftool/dbconfig/20260108-104630-marostegui.json
[10:50:17] <wikibugs>	 (03PS1) 10Muehlenhoff: puppet: Remove the force_puppet7 parameter [puppet] - 10https://gerrit.wikimedia.org/r/1224605 (https://phabricator.wikimedia.org/T365798)
[10:52:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey)
[10:53:14] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224605 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[10:56:02] <wikibugs>	 (03CR) 10Santiago Faci: "I would say we can merge it. My understanding is that this is not a blocker anyway (at least for the renaming on our side) because there i" [puppet] - 10https://gerrit.wikimedia.org/r/1212435 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol)
[10:56:39] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T413525)', diff saved to https://phabricator.wikimedia.org/P86814 and previous config saved to /var/cache/conftool/dbconfig/20260108-105639-marostegui.json
[10:56:43] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[10:56:55] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance
[10:57:04] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2158 (T413525)', diff saved to https://phabricator.wikimedia.org/P86815 and previous config saved to /var/cache/conftool/dbconfig/20260108-105703-marostegui.json
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1100)
[11:01:27] <wikibugs>	 (03PS4) 10Elukey: sre.hosts.reimage: remove puppet 5 support and default to 7 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219)
[11:03:18] <wikibugs>	 (03PS1) 10Ozge: feat: embeddings server performance tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224607
[11:04:41] <wikibugs>	 (03PS2) 10Ozge: feat: embeddings server performance tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224607
[11:05:45] <wikibugs>	 (03PS1) 10Fabfur: cache::text enable auth, known, bot ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224609 (https://phabricator.wikimedia.org/T406545)
[11:07:06] <wikibugs>	 (03PS2) 10Fabfur: cache::text enable auth, known, bot ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224609 (https://phabricator.wikimedia.org/T406545)
[11:09:37] <wikibugs>	 (03CR) 10Elukey: docker registry: add ml build user password (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski)
[11:09:58] <wikibugs>	 (03PS1) 10Fabfur: cache::text: enable unid, browser ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224614 (https://phabricator.wikimedia.org/T406545)
[11:10:02] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[11:10:19] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[11:11:06] <wikibugs>	 (03PS1) 10Mszwarc: Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224615 (https://phabricator.wikimedia.org/T413929)
[11:11:56] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224607 (owner: 10Ozge)
[11:14:40] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224615 (https://phabricator.wikimedia.org/T413929) (owner: 10Mszwarc)
[11:15:18] <wikibugs>	 (03PS1) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1224616 (https://phabricator.wikimedia.org/T406545)
[11:15:20] <wikibugs>	 (03PS1) 10Fabfur: cache::text: enable unid, browser ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1224617 (https://phabricator.wikimedia.org/T406545)
[11:15:38] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T413525)', diff saved to https://phabricator.wikimedia.org/P86816 and previous config saved to /var/cache/conftool/dbconfig/20260108-111537-marostegui.json
[11:15:47] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[11:16:45] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[11:16:58] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[11:17:19] <wikibugs>	 (03CR) 10Ozge: [C:03+2] feat: embeddings server performance tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224607 (owner: 10Ozge)
[11:17:43] <wikibugs>	 (03CR) 10Ozge: [V:03+2 C:03+2] feat: embeddings server performance tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224607 (owner: 10Ozge)
[11:18:43] <wikibugs>	 (03PS1) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224619 (https://phabricator.wikimedia.org/T406545)
[11:18:45] <wikibugs>	 (03PS1) 10Fabfur: cache::text: enable unid, browser ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224620 (https://phabricator.wikimedia.org/T406545)
[11:19:43] <wikibugs>	 (03Merged) 10jenkins-bot: feat: embeddings server performance tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224607 (owner: 10Ozge)
[11:21:13] <wikibugs>	 (03PS1) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224621 (https://phabricator.wikimedia.org/T406545)
[11:21:15] <wikibugs>	 (03PS1) 10Fabfur: cache::text: enable unid, browser ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224622 (https://phabricator.wikimedia.org/T406545)
[11:22:35] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] cache::text enable auth, known, bot ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224609 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[11:22:36] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11502848 (10elukey) Today I created a new docker distribution instance only for ML, backed by S3/apus and I created a new bucket for it (same account): registry-ml
[11:23:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster::scripts [puppet] - 10https://gerrit.wikimedia.org/r/1224089 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[11:25:46] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P86817 and previous config saved to /var/cache/conftool/dbconfig/20260108-112546-marostegui.json
[11:27:57] <wikibugs>	 (03PS1) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545)
[11:28:40] <wikibugs>	 (03PS4) 10Silvan Heintze: Report progress of Wikibase entity dumps [dumps] - 10https://gerrit.wikimedia.org/r/1219837 (https://phabricator.wikimedia.org/T408423)
[11:28:40] <wikibugs>	 (03PS2) 10Silvan Heintze: Report # of skipped entities by type [dumps] - 10https://gerrit.wikimedia.org/r/1224110 (https://phabricator.wikimedia.org/T413869)
[11:29:15] <wikibugs>	 (03PS1) 10Fabfur: cache::text: enable unid, browser ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224626 (https://phabricator.wikimedia.org/T406545)
[11:29:37] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224609 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[11:30:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1224076 (owner: 10Slyngshede)
[11:30:50] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Add Hugh as approver for mw-log-readers and logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/1214078 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff)
[11:31:41] <wikibugs>	 (03PS3) 10Muehlenhoff: Add Hugh as approver for mw-log-readers and logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/1214078 (https://phabricator.wikimedia.org/T276465)
[11:33:16] <wikibugs>	 (03CR) 10Silvan Heintze: "Thanks for the review" [dumps] - 10https://gerrit.wikimedia.org/r/1224110 (https://phabricator.wikimedia.org/T413869) (owner: 10Silvan Heintze)
[11:34:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add Hugh as approver for mw-log-readers and logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/1214078 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff)
[11:35:55] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P86818 and previous config saved to /var/cache/conftool/dbconfig/20260108-113554-marostegui.json
[11:35:59] <wikibugs>	 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11502915 (10MoritzMuehlenhoff)
[11:42:36] <icinga-wm>	 RECOVERY - Backup freshness on backup1014 is OK: Fresh: 140 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[11:45:21] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:04-1] "Ack" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219541 (https://phabricator.wikimedia.org/T411479) (owner: 10Sergio Gimeno)
[11:45:32] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+1] "looking good: https://puppet-compiler.wmflabs.org/output/1224609/7859/" [puppet] - 10https://gerrit.wikimedia.org/r/1224609 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[11:46:03] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T413525)', diff saved to https://phabricator.wikimedia.org/P86819 and previous config saved to /var/cache/conftool/dbconfig/20260108-114602-marostegui.json
[11:46:06] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[11:46:19] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance
[11:46:20] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache::text enable auth, known, bot ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224609 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[11:46:27] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2169 (T413525)', diff saved to https://phabricator.wikimedia.org/P86820 and previous config saved to /var/cache/conftool/dbconfig/20260108-114627-marostegui.json
[11:50:10] <logmsgbot>	 !log ozge@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[11:52:40] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply
[11:53:32] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply
[11:54:01] <wikibugs>	 (03PS2) 10Fabfur: cache::text: enable unid, browser ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224614 (https://phabricator.wikimedia.org/T406545)
[11:54:17] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Notification for users to link their Phabricator account [software/bitu] - 10https://gerrit.wikimedia.org/r/1224076 (owner: 10Slyngshede)
[11:55:29] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1224614 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[11:55:39] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+1] cache::text: enable unid, browser ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224614 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[11:57:13] <wikibugs>	 (03Merged) 10jenkins-bot: Notification for users to link their Phabricator account [software/bitu] - 10https://gerrit.wikimedia.org/r/1224076 (owner: 10Slyngshede)
[12:00:13] <wikibugs>	 (03PS1) 10Btullis: Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977)
[12:00:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[12:00:22] <wikibugs>	 (03PS2) 10Btullis: Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977)
[12:02:01] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache::text: enable unid, browser ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224614 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[12:02:17] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224615 (https://phabricator.wikimedia.org/T413929) (owner: 10Mszwarc)
[12:03:41] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Copy rest_v1-wikimedia.json to standard-docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224228 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz)
[12:03:42] <wikibugs>	 (03PS3) 10Btullis: Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977)
[12:03:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[12:03:51] <wikibugs>	 (03PS4) 10Btullis: Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977)
[12:04:37] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T413525)', diff saved to https://phabricator.wikimedia.org/P86821 and previous config saved to /var/cache/conftool/dbconfig/20260108-120436-marostegui.json
[12:04:40] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[12:06:34] <wikibugs>	 (03PS1) 10Clément Goubert: Revert^2 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1224638
[12:07:21] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Revert^2 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1224638 (owner: 10Clément Goubert)
[12:11:04] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Revert^2 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1224638 (owner: 10Clément Goubert)
[12:13:52] <wikibugs>	 (03PS1) 10Clément Goubert: deploy: Add redioscope kubeconfig to aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1224643 (https://phabricator.wikimedia.org/T413999)
[12:14:09] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[12:14:45] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P86822 and previous config saved to /var/cache/conftool/dbconfig/20260108-121445-marostegui.json
[12:16:15] <wikibugs>	 (03PS1) 10Clément Goubert: aux-k8s: Add redioscope namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224644 (https://phabricator.wikimedia.org/T413999)
[12:19:02] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Alert in need of triage: HDFS topology check (instance an-master1003) - https://phabricator.wikimedia.org/T413742#11503111 (10JAllemandou) Namenodes restarted on `an-master1003` and `an-master1004`. The alert is solved.
[12:19:43] <wikibugs>	 (03PS1) 10Hashar: admin: hashar: sync .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/1224648
[12:24:54] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P86823 and previous config saved to /var/cache/conftool/dbconfig/20260108-122453-marostegui.json
[12:24:59] <wikibugs>	 (03PS8) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524)
[12:25:21] <wikibugs>	 (03CR) 10Dpogorzelski: docker registry: add ml build user password (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski)
[12:25:58] <wikibugs>	 (03CR) 10Jelto: [C:03+2] admin: hashar: sync .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/1224648 (owner: 10Hashar)
[12:32:12] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[12:33:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm
[12:34:09] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:34:12] <wikibugs>	 (03Merged) 10jenkins-bot: Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[12:34:40] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100%
[12:35:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T413525)', diff saved to https://phabricator.wikimedia.org/P86824 and previous config saved to /var/cache/conftool/dbconfig/20260108-123501-marostegui.json
[12:35:05] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[12:35:18] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance
[12:35:27] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2180 (T413525)', diff saved to https://phabricator.wikimedia.org/P86825 and previous config saved to /var/cache/conftool/dbconfig/20260108-123526-marostegui.json
[12:35:46] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply
[12:37:30] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[12:39:00] <logmsgbot>	 !log ozge@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[12:39:09] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:40:13] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply
[12:40:50] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "what about https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common.yaml#2826 ?" [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[12:41:27] <logmsgbot>	 jmm@cumin2002 reimage (PID 2452309) is awaiting input
[12:42:39] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T413525)', diff saved to https://phabricator.wikimedia.org/P86826 and previous config saved to /var/cache/conftool/dbconfig/20260108-124238-marostegui.json
[12:42:42] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[12:44:53] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1224616/7861/ looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1224616 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[12:45:34] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+1] extension-list: Add Test Kitchen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216847 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming)
[12:47:38] <icinga-wm>	 PROBLEM - LDAP -writable server- on serpens is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting
[12:49:28] <wikibugs>	 (03Abandoned) 10Tchanders: WIP Check if adding - prevents "no change" CI failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224045 (owner: 10Tchanders)
[12:49:40] <wikibugs>	 (03CR) 10Tchanders: "Same failure" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224045 (owner: 10Tchanders)
[12:52:47] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P86827 and previous config saved to /var/cache/conftool/dbconfig/20260108-125247-marostegui.json
[12:53:16] <moritzm>	 !log installing libsodium security updates on bullseye
[12:53:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:26] <wikibugs>	 (03PS1) 10Btullis: Failover the hive server2 and metastore services to the standby [dns] - 10https://gerrit.wikimedia.org/r/1224657 (https://phabricator.wikimedia.org/T303168)
[12:59:40] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "That was https://gerrit.wikimedia.org/r/c/operations/puppet/+/1224576 but see the comment there." [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[13:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1300)
[13:02:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:02:55] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P86828 and previous config saved to /var/cache/conftool/dbconfig/20260108-130255-marostegui.json
[13:04:38] <icinga-wm>	 RECOVERY - LDAP -writable server- on serpens is OK: LDAP OK - 0.101 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting
[13:09:42] <wikibugs>	 (03PS1) 10Slyngshede: Account linking: hide message box when linked [software/bitu] - 10https://gerrit.wikimedia.org/r/1224660
[13:10:43] <wikibugs>	 (03PS1) 10Btullis: Add a networkpolicy to spark-support permitting access to kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224661 (https://phabricator.wikimedia.org/T413977)
[13:10:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a networkpolicy to spark-support permitting access to kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224661 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[13:10:52] <wikibugs>	 (03PS2) 10Btullis: Add a networkpolicy to spark-support permitting access to kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224661 (https://phabricator.wikimedia.org/T413977)
[13:12:51] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache::text: enable auth, known, bot ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1224616 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[13:12:59] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Report progress of Wikibase entity dumps [dumps] - 10https://gerrit.wikimedia.org/r/1219837 (https://phabricator.wikimedia.org/T408423) (owner: 10Silvan Heintze)
[13:13:04] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T413525)', diff saved to https://phabricator.wikimedia.org/P86831 and previous config saved to /var/cache/conftool/dbconfig/20260108-131303-marostegui.json
[13:13:07] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[13:13:08] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Alert in need of triage: HDFS topology check (instance an-master1003) - https://phabricator.wikimedia.org/T413742#11503308 (10Gehel)
[13:13:20] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance
[13:13:27] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Alert in need of triage: PuppetPendingCertificateRequest (instance puppetserver1001:9100) - https://phabricator.wikimedia.org/T412789#11503322 (10Gehel)
[13:13:28] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2193 (T413525)', diff saved to https://phabricator.wikimedia.org/P86832 and previous config saved to /var/cache/conftool/dbconfig/20260108-131327-marostegui.json
[13:13:31] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add a networkpolicy to spark-support permitting access to kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224661 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[13:14:32] <wikibugs>	 10SRE-SLO, 10observability, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11503350 (10Gehel)
[13:14:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11503357 (10Gehel)
[13:15:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413336#11503378 (10Gehel)
[13:15:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1191 - https://phabricator.wikimedia.org/T411209#11503386 (10Gehel)
[13:16:06] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' .
[13:16:40] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11503413 (10Gehel)
[13:17:22] <wikibugs>	 (03Merged) 10jenkins-bot: Add a networkpolicy to spark-support permitting access to kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224661 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[13:17:45] <wikibugs>	 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 13Patch-For-Review: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11503431 (10Gehel)
[13:17:57] <moritzm>	 !log installing imagemagick security updates
[13:17:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:53] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply
[13:19:03] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply
[13:19:31] <Lucas_WMDE>	 FTR, I won’t be around during the first half of today’s UTC afternoon backport window
[13:19:37] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), and 3 others: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573#11503473 (10Gehel)
[13:19:48] <Lucas_WMDE>	 maybe someone else from the Wikidata team will be around to deploy the config change I scheduled, otherwise I’ll do it after ca. 14:30 UTC
[13:20:04] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T413525)', diff saved to https://phabricator.wikimedia.org/P86833 and previous config saved to /var/cache/conftool/dbconfig/20260108-132003-marostegui.json
[13:20:05] <Lucas_WMDE>	 (and hopefully someone else can deploy for Superpes3 and Msz2001)
[13:20:07] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[13:20:31] <Msz2001>	 I'm a deployer myself, I can do it
[13:21:47] <Lucas_WMDE>	 👍
[13:23:18] <icinga-wm>	 PROBLEM - Ensure traffic_manager is running for instance backend on cp6004 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:23:28] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2231 gradually with 4 steps - Pool db2231.codfw.wmnet in after cloning
[13:23:40] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Alert in need of triage: HDFS topology check (instance an-master1003) - https://phabricator.wikimedia.org/T413742#11503552 (10Gehel) 05Open→03Resolved
[13:24:09] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:24:18] <icinga-wm>	 RECOVERY - Ensure traffic_manager is running for instance backend on cp6004 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:26:56] <icinga-wm>	 PROBLEM - Host ncredir7003 is DOWN: CRITICAL - Time to live exceeded (10.140.2.3)
[13:26:56] <icinga-wm>	 PROBLEM - Host cp7014 is DOWN: CRITICAL - Time to live exceeded (10.140.1.10)
[13:26:56] <icinga-wm>	 PROBLEM - Host ncredir7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.8)
[13:27:14] <icinga-wm>	 RECOVERY - Host cp7014 is UP: PING OK - Packet loss = 0%, RTA = 137.56 ms
[13:27:18] <icinga-wm>	 RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 138.13 ms
[13:27:18] <icinga-wm>	 RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 138.16 ms
[13:30:12] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P86835 and previous config saved to /var/cache/conftool/dbconfig/20260108-133011-marostegui.json
[13:31:12] <wikibugs>	 (03PS2) 10Fabfur: cache::text: enable unid, browser ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1224617 (https://phabricator.wikimedia.org/T406545)
[13:32:24] <wikibugs>	 (03CR) 10Joal: [C:03+1] Failover the hive server2 and metastore services to the standby [dns] - 10https://gerrit.wikimedia.org/r/1224657 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis)
[13:32:45] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1224617 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[13:32:54] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+1] cache::text: enable unid, browser ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1224617 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[13:33:17] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache::text: enable unid, browser ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1224617 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[13:37:04] <wikibugs>	 (03PS1) 10Jgreen: Remove deprecated pay-lvs records and related transitional records [dns] - 10https://gerrit.wikimedia.org/r/1224672 (https://phabricator.wikimedia.org/T398321)
[13:38:25] <wikibugs>	 (03CR) 10Jgreen: [C:03+2] Remove deprecated pay-lvs records and related transitional records [dns] - 10https://gerrit.wikimedia.org/r/1224672 (https://phabricator.wikimedia.org/T398321) (owner: 10Jgreen)
[13:39:33] <wikibugs>	 (03PS1) 10Ladsgroup: Update API call in edit.js with rvslots [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224673 (https://phabricator.wikimedia.org/T412762)
[13:39:53] <wikibugs>	 (03PS2) 10Clément Goubert: aux-k8s: Add redioscope namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224644 (https://phabricator.wikimedia.org/T413999)
[13:40:20] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P86837 and previous config saved to /var/cache/conftool/dbconfig/20260108-134020-marostegui.json
[13:42:01] <wikibugs>	 (03CR) 10Gehel: [C:03+1] Failover the hive server2 and metastore services to the standby [dns] - 10https://gerrit.wikimedia.org/r/1224657 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis)
[13:45:40] <icinga-wm>	 PROBLEM - Thanos swift https on thanos-fe1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos
[13:45:50] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:46:30] <icinga-wm>	 RECOVERY - Thanos swift https on thanos-fe1006 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Thanos
[13:46:48] <logmsgbot>	 !log jgreen@dns1004 START - running authdns-update
[13:47:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:47:55] <logmsgbot>	 !log jgreen@dns1004 END - running authdns-update
[13:50:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:50:29] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T413525)', diff saved to https://phabricator.wikimedia.org/P86838 and previous config saved to /var/cache/conftool/dbconfig/20260108-135028-marostegui.json
[13:50:32] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[13:50:32] <wikibugs>	 (03PS9) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524)
[13:50:45] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance
[13:51:03] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance
[13:51:07] <wikibugs>	 (03PS2) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224619 (https://phabricator.wikimedia.org/T406545)
[13:51:10] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224619 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[13:51:11] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2217 (T413525)', diff saved to https://phabricator.wikimedia.org/P86839 and previous config saved to /var/cache/conftool/dbconfig/20260108-135111-marostegui.json
[13:53:42] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1224619 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[13:53:53] <wikibugs>	 (03CR) 10CDanis: [C:03+2] benthos: webrequest: add res_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1224178 (owner: 10CDanis)
[13:55:07] <wikibugs>	 (03PS1) 10Zabe: MimeAnalyzer: Fix syntax error in MAJOR_MIME_TYPES array [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224679 (https://phabricator.wikimedia.org/T414077)
[13:55:19] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+1] cache::text: enable auth, known, bot ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224619 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[13:57:05] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' .
[13:57:14] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache::text: enable auth, known, bot ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224619 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[13:58:16] <logmsgbot>	 !log jgreen@cumin1003 START - Cookbook sre.dns.netbox
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1400).
[14:00:05] <jouncebot>	 Lucas_WMDE, Superpes, and Msz2001: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:14] <Msz2001>	 I'm ready to deploy
[14:00:23] <Msz2001>	 Superpes3: Are you around?
[14:00:47] <wikibugs>	 (03PS2) 10Fabfur: cache::text: enable unid, browser ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224620 (https://phabricator.wikimedia.org/T406545)
[14:00:57] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224620 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[14:01:07] <Msz2001>	 I guess, I'll start with my patch, then
[14:01:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224615 (https://phabricator.wikimedia.org/T413929) (owner: 10Mszwarc)
[14:01:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:02:02] <wikibugs>	 (03PS1) 10Clément Goubert: Revert^3 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1224681
[14:02:58] <wikibugs>	 (03Merged) 10jenkins-bot: Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224615 (https://phabricator.wikimedia.org/T413929) (owner: 10Mszwarc)
[14:03:23] <logmsgbot>	 !log jgreen@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[14:04:06] <logmsgbot>	 !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1224615|Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler (T413929)]]
[14:04:09] <stashbot>	 T413929: CheckUser TransactionProfiler warnings when using Special:CentralAutoLogin creates local accounts - https://phabricator.wikimedia.org/T413929
[14:05:43] <wikibugs>	 (03PS11) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573)
[14:05:50] <wikibugs>	 (03PS1) 10Btullis: Fix the kyuubi-headless service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224682 (https://phabricator.wikimedia.org/T413977)
[14:05:59] <wikibugs>	 (03PS2) 10Btullis: Fix the kyuubi-headless service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224682 (https://phabricator.wikimedia.org/T413977)
[14:06:36] <logmsgbot>	 !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1224615|Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler (T413929)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:06:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:07:06] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T413525)', diff saved to https://phabricator.wikimedia.org/P86842 and previous config saved to /var/cache/conftool/dbconfig/20260108-140705-marostegui.json
[14:07:09] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[14:07:12] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm
[14:07:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1224660 (owner: 10Slyngshede)
[14:08:37] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Fix the kyuubi-headless service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224682 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[14:08:51] <logmsgbot>	 !log mszwarc@deploy2002 Sync cancelled.
[14:08:57] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2231 gradually with 4 steps - Pool db2231.codfw.wmnet in after cloning
[14:09:11] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Revert^3 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1224681 (owner: 10Clément Goubert)
[14:09:24] <wikibugs>	 (03PS1) 10TrainBranchBot: Revert "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224685
[14:09:25] <wikibugs>	 (03CR) 10TrainBranchBot: "mszwarc@deploy2002 created a revert of this change as I9a27e1e066cfee6b26fdca6c05408017b8810b92" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224615 (https://phabricator.wikimedia.org/T413929) (owner: 10Mszwarc)
[14:10:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224685 (owner: 10TrainBranchBot)
[14:10:32] <wikibugs>	 (03Merged) 10jenkins-bot: Fix the kyuubi-headless service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224682 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[14:11:15] <Superpes>	 o/
[14:13:37] <Msz2001>	 Hi, Superpes! I'm processing my patch, I can then go with yours
[14:15:11] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply
[14:15:20] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply
[14:16:07] <Superpes>	 Msz2001 Many thanks :) I just added another one, I'm on a train so I might have some internet issues, in any case I should be able to test my patches without problems!
[14:16:48] <Msz2001>	 Ack
[14:17:14] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P86844 and previous config saved to /var/cache/conftool/dbconfig/20260108-141714-marostegui.json
[14:17:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] pcc: Drop obsolete OS conditional [puppet] - 10https://gerrit.wikimedia.org/r/1126104 (owner: 10Muehlenhoff)
[14:18:18] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Failover the hive server2 and metastore services to the standby [dns] - 10https://gerrit.wikimedia.org/r/1224657 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis)
[14:18:36] <logmsgbot>	 !log btullis@dns1004 START - running authdns-update
[14:19:37] <logmsgbot>	 !log btullis@dns1004 END - running authdns-update
[14:23:25] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224685 (owner: 10TrainBranchBot)
[14:23:55] <logmsgbot>	 !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1224685|Revert "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler"]]
[14:24:09] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[14:25:42] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2231.codfw.wmnet onto db2249.codfw.wmnet
[14:26:08] <logmsgbot>	 !log mszwarc@deploy2002 mszwarc, trainbranchbot: Backport for [[gerrit:1224685|Revert "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:26:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Rename stale_certs_exporter and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224687 (https://phabricator.wikimedia.org/T365798)
[14:26:46] <logmsgbot>	 !log jgreen@cumin1003 START - Cookbook sre.dns.netbox
[14:27:23] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P86845 and previous config saved to /var/cache/conftool/dbconfig/20260108-142722-marostegui.json
[14:27:23] <logmsgbot>	 !log mszwarc@deploy2002 mszwarc, trainbranchbot: Continuing with sync
[14:28:45] <wikibugs>	 (03PS1) 10Mszwarc: Revert^2 "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224688
[14:28:47] <Amir1>	 Msz2001: would you mind pinging me once you're done? Thanks! :D
[14:28:57] <Msz2001>	 Sure!
[14:31:12] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] cache::text: enable unid, browser ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224620 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[14:31:30] <logmsgbot>	 !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224685|Revert "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler"]] (duration: 07m 34s)
[14:31:52] <Msz2001>	 Superpes: Ready to deploy yours
[14:32:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Rename puppetmaster::gitsync and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224689 (https://phabricator.wikimedia.org/T365798)
[14:32:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223155 (https://phabricator.wikimedia.org/T413530) (owner: 10Superpes15)
[14:32:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223159 (https://phabricator.wikimedia.org/T413737) (owner: 10Superpes15)
[14:32:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224232 (https://phabricator.wikimedia.org/T413848) (owner: 10Superpes15)
[14:33:16] <moritzm>	 !log installing systemd bugfix updates from Bookworm point release
[14:33:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:44] <wikibugs>	 (03Merged) 10jenkins-bot: [enwikiquote] Enable block feature for AbuseFilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223155 (https://phabricator.wikimedia.org/T413530) (owner: 10Superpes15)
[14:33:46] <wikibugs>	 (03Merged) 10jenkins-bot: [ruwiki] Disable setting a cookie for blocked anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223159 (https://phabricator.wikimedia.org/T413737) (owner: 10Superpes15)
[14:33:57] <wikibugs>	 (03PS2) 10Zabe: MimeAnalyzer: Fix syntax error in MAJOR_MIME_TYPES array [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224679 (https://phabricator.wikimedia.org/T414077)
[14:34:02] <wikibugs>	 (03Merged) 10jenkins-bot: [enwikiquote] Add new autopatrolled and patroller usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224232 (https://phabricator.wikimedia.org/T413848) (owner: 10Superpes15)
[14:34:36] <logmsgbot>	 !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1223155|[enwikiquote] Enable block feature for AbuseFilter (T413530)]], [[gerrit:1223159|[ruwiki] Disable setting a cookie for blocked anonymous users (T413737)]], [[gerrit:1224232|[enwikiquote] Add new autopatrolled and patroller usergroups (T413848)]]
[14:34:41] <Superpes>	 Thanks Msz2001 :)
[14:34:43] <stashbot>	 T413530: Enable the AbuseFilter block action on the English Wikiquote - https://phabricator.wikimedia.org/T413530
[14:34:43] <stashbot>	 T413737: Disable installing a "block cookie" to a proxy-blocked anons in ruwiki - https://phabricator.wikimedia.org/T413737
[14:34:43] <stashbot>	 T413848: [enwikiquote] Create the autopatroller and patroller user groups - https://phabricator.wikimedia.org/T413848
[14:34:54] <logmsgbot>	 !log jgreen@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[14:34:58] <Msz2001>	 If we have time later, I'll try to redeploy my original patch - I did so at 14:00 UTC, but when testing it appeared not to work, so I didn't proceed with deploying to prod, instead created a revert patch, but on further analysis it turned out that I was testing it on a wrong version of wiki, so I'll try to redeploy the patch later, because the problem was at my side and not at the patch's :D
[14:36:36] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie
[14:36:48] <logmsgbot>	 !log mszwarc@deploy2002 mszwarc, superpes: Backport for [[gerrit:1223155|[enwikiquote] Enable block feature for AbuseFilter (T413530)]], [[gerrit:1223159|[ruwiki] Disable setting a cookie for blocked anonymous users (T413737)]], [[gerrit:1224232|[enwikiquote] Add new autopatrolled and patroller usergroups (T413848)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified t
[14:36:48] <logmsgbot>	 here.
[14:37:10] <Superpes>	 Testing
[14:37:31] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T413525)', diff saved to https://phabricator.wikimedia.org/P86846 and previous config saved to /var/cache/conftool/dbconfig/20260108-143730-marostegui.json
[14:37:34] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[14:37:47] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2224.codfw.wmnet with reason: Maintenance
[14:37:56] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2224 (T413525)', diff saved to https://phabricator.wikimedia.org/P86847 and previous config saved to /var/cache/conftool/dbconfig/20260108-143755-marostegui.json
[14:39:10] <Superpes>	 Msz2001 They all look fine :D
[14:39:15] <logmsgbot>	 !log mszwarc@deploy2002 mszwarc, superpes: Continuing with sync
[14:39:32] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache::text: enable unid, browser ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224620 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[14:40:18] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224689 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[14:41:03] <wikibugs>	 (03PS2) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224621 (https://phabricator.wikimedia.org/T406545)
[14:41:45] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224621 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[14:41:52] <Lucas_WMDE>	 o/
[14:42:18] <Lucas_WMDE>	 Msz2001: you’re still deploying, right?
[14:42:29] <Msz2001>	 Yes, finishing Superpes' patches
[14:42:33] <Lucas_WMDE>	 ok thanks
[14:42:35] <Amir1>	 I called dibs
[14:42:56] <Lucas_WMDE>	 well, I scheduled my change…
[14:43:05] <Lucas_WMDE>	 Amir1: what do you want to deploy?
[14:43:13] <logmsgbot>	 !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1223155|[enwikiquote] Enable block feature for AbuseFilter (T413530)]], [[gerrit:1223159|[ruwiki] Disable setting a cookie for blocked anonymous users (T413737)]], [[gerrit:1224232|[enwikiquote] Add new autopatrolled and patroller usergroups (T413848)]] (duration: 08m 36s)
[14:43:15] <Amir1>	 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1224673?usp=email
[14:43:19] <stashbot>	 T413530: Enable the AbuseFilter block action on the English Wikiquote - https://phabricator.wikimedia.org/T413530
[14:43:19] <stashbot>	 T413737: Disable installing a "block cookie" to a proxy-blocked anons in ruwiki - https://phabricator.wikimedia.org/T413737
[14:43:19] <stashbot>	 T413848: [enwikiquote] Create the autopatroller and patroller user groups - https://phabricator.wikimedia.org/T413848
[14:43:25] <Superpes>	 Msz2001 Thanks for your assistance :3
[14:43:30] <Msz2001>	 You're welcome
[14:44:16] <Msz2001>	 Lucas_WMDE: You can go
[14:44:39] <Lucas_WMDE>	 thanks
[14:44:47] <Lucas_WMDE>	 and then you can both start gate-and-submit for your backports
[14:45:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214986 (https://phabricator.wikimedia.org/T403015) (owner: 10Arthur taylor)
[14:45:19] <logmsgbot>	 !log jgreen@cumin1003 START - Cookbook sre.dns.netbox
[14:45:23] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Update API call in edit.js with rvslots [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224673 (https://phabricator.wikimedia.org/T412762) (owner: 10Ladsgroup)
[14:46:06] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the MEX / wbui2025 beta feature on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214986 (https://phabricator.wikimedia.org/T403015) (owner: 10Arthur taylor)
[14:46:36] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1214986|Enable the MEX / wbui2025 beta feature on wikidata (T403015)]]
[14:46:39] <stashbot>	 T403015: [MEX]  M3 - Release onto wikidata.org under feature flag - https://phabricator.wikimedia.org/T403015
[14:47:01] <Lucas_WMDE>	 Msz2001: want to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/1224688 together with Amir1? (and +2 it now?)
[14:47:19] <Msz2001>	 I don't have +2 rights in that branch
[14:47:20] <wikibugs>	 (03PS2) 10Muehlenhoff: Rename puppetmaster::gitsync and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224689 (https://phabricator.wikimedia.org/T365798)
[14:47:27] <Msz2001>	 But otherwise can deploy it together
[14:47:30] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224688 (owner: 10Mszwarc)
[14:47:32] <Lucas_WMDE>	 o_O
[14:47:48] <Lucas_WMDE>	 that sounds like a permissions mistake to me, if you can deploy then I assume you should have +2 rights
[14:47:55] <logmsgbot>	 !log jgreen@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:48:17] <Msz2001>	 I'll then dig in the documentation what to do about it
[14:48:36] <Msz2001>	 But thanks for the +2 :)
[14:48:46] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, arthurtaylor: Backport for [[gerrit:1214986|Enable the MEX / wbui2025 beta feature on wikidata (T403015)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:49:28] <urbanecm>	 Msz2001: i can fix that...
[14:49:28] <wikibugs>	 (03PS1) 10Gehel: chore(elasticsearch): cleanup unused roles / profiles after migration to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1224691 (https://phabricator.wikimedia.org/T388607)
[14:49:36] <Lucas_WMDE>	 testing
[14:50:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] chore(elasticsearch): cleanup unused roles / profiles after migration to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1224691 (https://phabricator.wikimedia.org/T388607) (owner: 10Gehel)
[14:50:01] <urbanecm>	 Msz2001: but in theory you shouldn't need +2, as scap will apply it on your behalf if neeeded
[14:50:23] <Msz2001>	 Yes, and that's how I proceeded with deployments normally
[14:50:42] <urbanecm>	 Msz2001: you're `mszwarc` in the shell world, right?
[14:50:48] <Msz2001>	 Right
[14:51:20] <urbanecm>	 !log Add `mszwarc` to `wmf-deployment` on Gerrit (existing deployer, T404697)
[14:51:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:23] <stashbot>	 T404697: Requesting access to deployment for mszwarc - https://phabricator.wikimedia.org/T404697
[14:51:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, arthurtaylor: Continuing with sync
[14:51:31] <urbanecm>	 Msz2001: Lucas_WMDE: should work now!
[14:51:45] <Lucas_WMDE>	 thanks!
[14:51:49] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] cache::text: enable auth, known, bot ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224621 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[14:51:52] <Msz2001>	 Thanks, it works indeed
[14:52:18] <Lucas_WMDE>	 (re “but in theory” – yes, but I don’t think we want deployers who can’t “clean up” the git situation outside of scap)
[14:52:18] <urbanecm>	 Msz2001: curiously, `spiderpig-access` should have access by default, and you're not there (https://ldap.toolforge.org/group/spiderpig-access) it seems either...
[14:52:27] <urbanecm>	 ...does spiderpig work for you somehow anyway?
[14:52:46] <Msz2001>	 No, it doesn't
[14:53:16] <urbanecm>	 okay. then you probably want request access to that LDAP group in IDM (https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access#Using_the_Wikimedia_Identity_Management_System)
[14:53:26] <Msz2001>	 ok, will do it
[14:53:28] <Msz2001>	 Thanks
[14:53:31] <urbanecm>	 np
[14:53:44] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T413525)', diff saved to https://phabricator.wikimedia.org/P86848 and previous config saved to /var/cache/conftool/dbconfig/20260108-145343-marostegui.json
[14:53:48] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[14:54:16] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224689 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[14:55:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214986|Enable the MEX / wbui2025 beta feature on wikidata (T403015)]] (duration: 08m 49s)
[14:55:28] <stashbot>	 T403015: [MEX]  M3 - Release onto wikidata.org under feature flag - https://phabricator.wikimedia.org/T403015
[14:55:34] <Lucas_WMDE>	 Amir1, Msz2001: over to you
[14:55:48] <Msz2001>	 Amir1: do you want to deploy or should I do it?
[14:55:49] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission pay-lvs1003.frack.eqiad.wmnet and pay-lvs1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T413986#11503958 (10Jgreen) a:05Jgreen→03None
[14:55:59] <Amir1>	 Msz2001: go for it
[14:56:02] <Msz2001>	 ok
[14:56:14] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224693
[14:56:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224688 (owner: 10Mszwarc)
[14:56:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224673 (https://phabricator.wikimedia.org/T412762) (owner: 10Ladsgroup)
[14:57:28] <wikibugs>	 (03PS1) 10Btullis: Fix malformed networkpolicy for spark-support and kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224694 (https://phabricator.wikimedia.org/T413977)
[14:57:37] <wikibugs>	 (03PS2) 10Btullis: Fix malformed networkpolicy for spark-support and kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224694 (https://phabricator.wikimedia.org/T413977)
[14:57:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Fix malformed networkpolicy for spark-support and kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224694 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis)
[14:58:09] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1029.eqiad.wmnet with OS trixie
[14:58:46] <wikibugs>	 (03Merged) 10jenkins-bot: Update API call in edit.js with rvslots [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224673 (https://phabricator.wikimedia.org/T412762) (owner: 10Ladsgroup)
[14:59:46] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224688 (owner: 10Mszwarc)
[15:00:44] <logmsgbot>	 !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1224688|Revert^2 "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler"]], [[gerrit:1224673|Update API call in edit.js with rvslots (T412762)]]
[15:00:47] <stashbot>	 T412762: Fix edit.js to set rvslots in API calls - https://phabricator.wikimedia.org/T412762
[15:01:05] <wikibugs>	 (03PS1) 10Muehlenhoff: pontoon: Cleanup dead projects [puppet] - 10https://gerrit.wikimedia.org/r/1224696 (https://phabricator.wikimedia.org/T365798)
[15:01:23] <wikibugs>	 (03PS1) 10Gehel: chore(elasticsearch): cleanup unused hiera regexes [puppet] - 10https://gerrit.wikimedia.org/r/1224697 (https://phabricator.wikimedia.org/T388607)
[15:02:51] <logmsgbot>	 !log mszwarc@deploy2002 ladsgroup, mszwarc: Backport for [[gerrit:1224688|Revert^2 "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler"]], [[gerrit:1224673|Update API call in edit.js with rvslots (T412762)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:03:20] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache::text: enable auth, known, bot ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224621 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[15:03:32] <Msz2001>	 Amir1: test please
[15:03:43] <Msz2001>	 (my patch works fine)
[15:03:45] <Amir1>	 on it
[15:03:52] <jclark-ctr>	 On site at eqiad just noticed alot of orange warning lights in Rack C3.  looks like tripped breaker L3-L1  investigating right now
[15:03:52] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P86849 and previous config saved to /var/cache/conftool/dbconfig/20260108-150351-marostegui.json
[15:04:59] <wikibugs>	 (03CR) 10Elukey: "Tried with test-cookbook for wdqs1029 and got:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey)
[15:05:36] <Amir1>	 works fine Msz2001 let's goooo
[15:05:49] <logmsgbot>	 !log mszwarc@deploy2002 ladsgroup, mszwarc: Continuing with sync
[15:08:44] <wikibugs>	 (03CR) 10CDanis: [C:03+2] turnilo: webrequest: add res_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1224181 (owner: 10CDanis)
[15:09:09] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:55] <logmsgbot>	 !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224688|Revert^2 "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler"]], [[gerrit:1224673|Update API call in edit.js with rvslots (T412762)]] (duration: 09m 11s)
[15:09:58] <stashbot>	 T412762: Fix edit.js to set rvslots in API calls - https://phabricator.wikimedia.org/T412762
[15:10:00] <Msz2001>	 Done
[15:10:44] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[15:10:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:01] <wikibugs>	 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#11504015 (10Andrew)
[15:11:13] <jclark-ctr>	 dual power is restored to all devices except kafka-main1008
[15:13:29] <wikibugs>	 (03CR) 10Federico Ceratto: "I added a more explicit log line e.g. "INFO The whole 'pc5' section will be depooled" but maybe you meant to also change how the parsercac" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[15:13:30] <jinxer-wm>	 FIRING: LibericaUnhealthyRealserverPooled: Liberica service text-httpslb_443 has 5 unhealthy realservers pooled on lvs5006:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://grafana.wikimedia.org/d/d70d14db-4a71-414d-8425-7a30d7127ca6/liberica-services?orgId=1&var-site=eqsin&var-service=text-httpslb_443&var-instance=lvs5006 - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserv
[15:13:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:14:00] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P86850 and previous config saved to /var/cache/conftool/dbconfig/20260108-151400-marostegui.json
[15:14:28] <_joe_>	 uh what's going on?
[15:15:37] <_joe_>	 !incidents
[15:15:37] <sirenbot>	 7296 (UNACKED)  [2x] ProbeDown sre (text-https:443 probes/service eqsin)
[15:15:37] <sirenbot>	 7294 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[15:15:37] <sirenbot>	 7295 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[15:15:38] <sirenbot>	 7290 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[15:15:38] <sirenbot>	 7292 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[15:15:38] <sirenbot>	 7291 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[15:15:38] <sirenbot>	 7293 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[15:15:39] <sirenbot>	 7289 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[15:15:39] <sirenbot>	 7288 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[15:15:40] <sirenbot>	 7287 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[15:15:51] <fabfur>	 in eqsin
[15:15:55] <fabfur>	 !ack 7296
[15:15:56] <sirenbot>	 7260 (RESOLVED)  payments2006/check_mysql
[15:16:13] <_joe_>	 yeah looks like traffic-level issues
[15:18:30] <jinxer-wm>	 RESOLVED: [2x] LibericaUnhealthyRealserverPooled: Liberica service text-httpslb_443 has 2 unhealthy realservers pooled on lvs5004:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[15:18:37] <wikibugs>	 (03CR) 10Elukey: "Tried to run install-console, and ran `puppet agent --test --color=false --debug`:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey)
[15:18:39] <_joe_>	 fabfur: do you see anything in the graphs?
[15:18:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:19:15] <fabfur>	 _joe_: looking , we had a big deep in both text and upload
[15:19:47] <_joe_>	 a big dip of what?
[15:20:00] <fabfur>	 traffic to haproxy, looking at the network now
[15:20:11] <vgutierrez>	 what I'm seeing is a huge spike of new connections in text@eqsin
[15:20:16] <_joe_>	 a lot of NELs
[15:20:26] <vgutierrez>	 https://grafana.wikimedia.org/goto/YCrFXJVDR?orgId=1
[15:20:55] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance
[15:21:04] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1190 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86851 and previous config saved to /var/cache/conftool/dbconfig/20260108-152103-marostegui.json
[15:21:09] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[15:21:10] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[15:24:08] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T413525)', diff saved to https://phabricator.wikimedia.org/P86852 and previous config saved to /var/cache/conftool/dbconfig/20260108-152407-marostegui.json
[15:24:11] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[15:24:24] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2229.codfw.wmnet with reason: Maintenance
[15:24:32] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2229 (T413525)', diff saved to https://phabricator.wikimedia.org/P86853 and previous config saved to /var/cache/conftool/dbconfig/20260108-152432-marostegui.json
[15:25:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Failed Power supply on   kafka-main1008 - https://phabricator.wikimedia.org/T414101 (10Jclark-ctr) 03NEW
[15:27:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Failed Power supply on   kafka-main1008 - https://phabricator.wikimedia.org/T414101#11504114 (10Jclark-ctr) Removed all power cords for the affected breaker, reset it, and added the cords back individually until locating a fried PSU on kafka-main1008. dual power is restored to a...
[15:29:01] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for aghirelli - https://phabricator.wikimedia.org/T414102 (10AGhirelli-WMF) 03NEW
[15:29:17] <wikibugs>	 06SRE, 10Data Pipelines, 06Data-Engineering: Unrecognised file under /srv/deployment-charts - https://phabricator.wikimedia.org/T413433#11504126 (10Dzahn)
[15:29:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Failed Power supply on   kafka-main1008 - https://phabricator.wikimedia.org/T414101#11504129 (10Jclark-ctr) Opened Service request 221019443
[15:29:58] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie
[15:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1530)
[15:30:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Failed Power supply on kafka-main1008 - https://phabricator.wikimedia.org/T414101#11504133 (10Reedy)
[15:31:39] <wikibugs>	 (03PS2) 10Fabfur: cache::text: enable unid, browser ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224622 (https://phabricator.wikimedia.org/T406545)
[15:31:41] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224622 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[15:33:19] <wikibugs>	 (03CR) 10Marostegui: "Ideally we should change both. In any case, if this cookbook will be de facto cookbook, then we probably should just make changes here and" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[15:34:09] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:36:28] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] cache::text: enable unid, browser ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224622 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[15:37:32] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache::text: enable unid, browser ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224622 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[15:40:13] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T413525)', diff saved to https://phabricator.wikimedia.org/P86854 and previous config saved to /var/cache/conftool/dbconfig/20260108-154013-marostegui.json
[15:40:17] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[15:41:23] <wikibugs>	 (03PS2) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545)
[15:41:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Rename puppetmaster::gitsync and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224689 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[15:41:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, we can restore this at any time" [puppet] - 10https://gerrit.wikimedia.org/r/1224696 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[15:42:34] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[15:43:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] pontoon: Cleanup dead projects [puppet] - 10https://gerrit.wikimedia.org/r/1224696 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[15:43:29] <wikibugs>	 (03CR) 10Elukey: [C:03+1] deploy: Add redioscope kubeconfig to aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1224643 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert)
[15:43:47] <wikibugs>	 (03CR) 10Elukey: [C:03+1] aux-k8s: Add redioscope namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224644 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert)
[15:44:22] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS trixie
[15:44:23] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS trixie
[15:48:04] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[15:49:32] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Release-Engineering-Team, 13Patch-For-Review: Add yubikey ssh key for dancy - https://phabricator.wikimedia.org/T414032#11504220 (10dancy)
[15:50:22] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P86855 and previous config saved to /var/cache/conftool/dbconfig/20260108-155021-marostegui.json
[15:50:51] <wikibugs>	 (03PS1) 10Muehlenhoff: durum: Enable Bird 2.18 for all servers [puppet] - 10https://gerrit.wikimedia.org/r/1224704 (https://phabricator.wikimedia.org/T413740)
[15:51:20] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski)
[15:54:05] <wikibugs>	 (03CR) 10Jakob: [C:03+1] "LGTM, thanks!" [dumps] - 10https://gerrit.wikimedia.org/r/1224110 (https://phabricator.wikimedia.org/T413869) (owner: 10Silvan Heintze)
[15:54:22] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[15:55:10] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] aux-k8s: Add redioscope namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224644 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert)
[15:55:25] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] deploy: Add redioscope kubeconfig to aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1224643 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert)
[15:55:59] <wikibugs>	 (03PS10) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524)
[15:57:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski)
[16:00:05] <jouncebot>	 dduvall and dancy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Train log triage . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1600).
[16:00:30] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P86856 and previous config saved to /var/cache/conftool/dbconfig/20260108-160029-marostegui.json
[16:00:44] <wikibugs>	 (03PS11) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524)
[16:02:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski)
[16:03:30] <wikibugs>	 (03PS1) 10Muehlenhoff: wikidough: Enable Bird 2.18 for all servers [puppet] - 10https://gerrit.wikimedia.org/r/1224708 (https://phabricator.wikimedia.org/T413740)
[16:03:51] <wikibugs>	 (03PS12) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524)
[16:05:47] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "you need to split this... first set enable it in esams and in a following commit unify it" [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[16:06:21] <wikibugs>	 (03PS1) 10Muehlenhoff: hcaptcha proxy: Enable Bird 2.18 for all servers [puppet] - 10https://gerrit.wikimedia.org/r/1224709 (https://phabricator.wikimedia.org/T413740)
[16:08:40] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1224704/7864/durum1001.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/1224704 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff)
[16:08:58] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] wikidough: Enable Bird 2.18 for all servers [puppet] - 10https://gerrit.wikimedia.org/r/1224708 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff)
[16:10:38] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T413525)', diff saved to https://phabricator.wikimedia.org/P86857 and previous config saved to /var/cache/conftool/dbconfig/20260108-161038-marostegui.json
[16:10:42] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[16:11:48] <wikibugs>	 (03PS2) 10Gehel: chore(elasticsearch): cleanup unused hiera regexes [puppet] - 10https://gerrit.wikimedia.org/r/1224697 (https://phabricator.wikimedia.org/T388607)
[16:11:48] <wikibugs>	 (03PS1) 10Gehel: chore(elasticsearch): remove references to elasticsearch for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/1224712 (https://phabricator.wikimedia.org/T388607)
[16:11:49] <wikibugs>	 (03PS1) 10Gehel: chore(elasticsearch): cloudelastic1001-1004 have been decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/1224713 (https://phabricator.wikimedia.org/T388607)
[16:11:51] <wikibugs>	 (03PS1) 10Gehel: chore(elasticsearch): remove references to elasticsearch for cluster config [puppet] - 10https://gerrit.wikimedia.org/r/1224714 (https://phabricator.wikimedia.org/T388607)
[16:12:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] chore(elasticsearch): cleanup unused hiera regexes [puppet] - 10https://gerrit.wikimedia.org/r/1224697 (https://phabricator.wikimedia.org/T388607) (owner: 10Gehel)
[16:12:45] <wikibugs>	 (03PS3) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545)
[16:13:24] <wikibugs>	 06SRE, 10Data Pipelines, 06Data-Engineering: Unrecognised file under /srv/deployment-charts - https://phabricator.wikimedia.org/T413433#11504313 (10JMeybohm) 05Open→03Resolved a:03JMeybohm I've moved the file out of the way to `/root/See_T413433` in case someone lost a session.
[16:14:12] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1029.eqiad.wmnet with OS trixie
[16:14:37] <wikibugs>	 (03PS4) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545)
[16:14:47] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for aghirelli [puppet] - 10https://gerrit.wikimedia.org/r/1224715
[16:15:41] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[16:18:40] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1159.eqiad.wmnet with reason: Maintenance
[16:18:49] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T413525)', diff saved to https://phabricator.wikimedia.org/P86858 and previous config saved to /var/cache/conftool/dbconfig/20260108-161848-marostegui.json
[16:18:52] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[16:21:58] <wikibugs>	 (03CR) 10Elukey: "I think we are in a good place, let's wait for some other review from ServiceOps. We could tentatively deploy this on Monday :)" [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski)
[16:23:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for aghirelli [puppet] - 10https://gerrit.wikimedia.org/r/1224715 (owner: 10Muehlenhoff)
[16:24:28] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for aghirelli - https://phabricator.wikimedia.org/T414102#11504358 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Access was granted via Wikimedia IDM.
[16:29:18] <wikibugs>	 (03CR) 10Muehlenhoff: sre.hosts.reimage: remove puppet 5 support and default to 7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey)
[16:29:53] <wikibugs>	 (03PS1) 10C. Scott Ananian: Increase PRV percentage on fawiki/kowiki/azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224719 (https://phabricator.wikimedia.org/T413108)
[16:30:30] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11504408 (10JMeybohm)   >>! In T413364#11483999, @cmooney wrote: >> Do you currently have shell access (Yes/No): Not sure - how can I check? >  > Looking at our...
[16:31:26] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] cache::text: enable auth, known, bot ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[16:31:32] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T413525)', diff saved to https://phabricator.wikimedia.org/P86860 and previous config saved to /var/cache/conftool/dbconfig/20260108-163131-marostegui.json
[16:31:35] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[16:31:47] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224719 (https://phabricator.wikimedia.org/T413108) (owner: 10C. Scott Ananian)
[16:32:00] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmde for martyn.ranyard - https://phabricator.wikimedia.org/T413994#11504421 (10JMeybohm) @KFrancis could you please confirm NDA status?
[16:32:25] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache::text: enable auth, known, bot ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[16:32:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224169 (https://phabricator.wikimedia.org/T414019) (owner: 10C. Scott Ananian)
[16:32:50] <wikibugs>	 (03PS1) 10AikoChou: ml-services: Update image for revise-tone-task-generator on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224721 (https://phabricator.wikimedia.org/T412210)
[16:34:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] sre.hosts.reimage: remove puppet 5 support and default to 7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey)
[16:35:01] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for aghirelli - https://phabricator.wikimedia.org/T414102#11504512 (10JMeybohm) Access to the wmf group needs to be requested [[ https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access#Using_the_Wikimedia_Identity_Management_System | Using_the_W...
[16:35:38] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-1] "Annoyingly it'll break MediaSearch in Commons. https://codesearch.wmcloud.org/search/?q=sdmsThumbRenderMap&files=&excludeFiles=&repos= It " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224071 (https://phabricator.wikimedia.org/T408062) (owner: 10MVernon)
[16:36:42] <wikibugs>	 (03CR) 10Elukey: sre.hosts.reimage: remove puppet 5 support and default to 7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey)
[16:37:02] <logmsgbot>	 andrew@cumin2002 reimage (PID 2547254) is awaiting input
[16:37:40] <wikibugs>	 (03CR) 10BryanDavis: "Cause of T414111 in Beta Clustger where the /usr/share/GeoIP/proxy.mmdb file does not exist." [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede)
[16:37:52] <logmsgbot>	 andrew@cumin2002 reimage (PID 2547204) is awaiting input
[16:38:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798)
[16:39:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11504536 (10Clement_Goubert) >>! In T408752#11502255, @Jclark-ctr wrote: > @Clement_Goubert Before I start racking these, do you want to verify that they’re correct by row, s...
[16:39:09] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:40:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] sre.hosts.reimage: remove puppet 5 support and default to 7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey)
[16:41:11] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11504543 (10JMeybohm) #release-engineering-team: Could you help with removing +2 ?
[16:41:40] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P86861 and previous config saved to /var/cache/conftool/dbconfig/20260108-164140-marostegui.json
[16:41:42] <wikibugs>	 (03Abandoned) 10Fabfur: cache::text: enable unid, browser ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224626 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[16:42:11] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for tgritschacher - https://phabricator.wikimedia.org/T414061#11504547 (10JMeybohm) @KFrancis could you please confirm NDA status?
[16:43:53] <wikibugs>	 (03CR) 10Elukey: sre.hosts.reimage: remove puppet 5 support and default to 7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey)
[16:44:09] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:44:30] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for tgritschacher - https://phabricator.wikimedia.org/T414061#11504555 (10JMeybohm)
[16:45:06] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:45:58] <wikibugs>	 (03PS1) 10Fabfur: cache::text: enable unid, browser ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224725 (https://phabricator.wikimedia.org/T406545)
[16:49:09] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:49:51] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[16:50:00] <wikibugs>	 (03PS1) 10Fabfur: cache::text: cleanup rate_limiting_flags [puppet] - 10https://gerrit.wikimedia.org/r/1224727 (https://phabricator.wikimedia.org/T406545)
[16:51:30] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224725 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[16:51:46] <fabfur>	 !incidents
[16:51:47] <sirenbot>	 7297 (UNACKED)  [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[16:51:47] <sirenbot>	 7296 (RESOLVED)  [2x] ProbeDown sre (text-https:443 probes/service eqsin)
[16:51:47] <sirenbot>	 7294 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[16:51:48] <sirenbot>	 7295 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[16:51:48] <sirenbot>	 7290 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[16:51:48] <sirenbot>	 7292 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[16:51:48] <sirenbot>	 7291 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[16:51:49] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P86862 and previous config saved to /var/cache/conftool/dbconfig/20260108-165148-marostegui.json
[16:51:49] <sirenbot>	 7293 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[16:51:49] <sirenbot>	 7289 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[16:51:50] <sirenbot>	 7288 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[16:51:50] <sirenbot>	 7287 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[16:51:57] <fabfur>	 !ack 7297
[16:51:58] <sirenbot>	 7260 (RESOLVED)  payments2006/check_mysql
[16:52:22] <fabfur>	 !incidents
[16:52:23] <sirenbot>	 7297 (ACKED)  [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[16:52:23] <sirenbot>	 7296 (RESOLVED)  [2x] ProbeDown sre (text-https:443 probes/service eqsin)
[16:52:23] <sirenbot>	 7294 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[16:52:23] <sirenbot>	 7295 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[16:52:24] <sirenbot>	 7290 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[16:52:24] <sirenbot>	 7292 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[16:52:24] <sirenbot>	 7291 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[16:52:24] <sirenbot>	 7293 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[16:52:25] <sirenbot>	 7289 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[16:52:25] <sirenbot>	 7288 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[16:52:26] <sirenbot>	 7287 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[16:54:51] <jinxer-wm>	 FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[16:55:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86863 and previous config saved to /var/cache/conftool/dbconfig/20260108-165501-marostegui.json
[16:55:06] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[16:55:07] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[16:55:40] <_joe_>	 !ack 
[16:55:40] <sirenbot>	 no value provided for parameter incident and no default available
[16:55:41] <sirenbot>	 Incident id must be an integer
[16:55:54] <_joe_>	 uhm rzl ^^ not working apparently
[16:55:58] <_joe_>	 !incidents
[16:55:58] <sirenbot>	 7297 (ACKED)  [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[16:55:59] <sirenbot>	 7296 (RESOLVED)  [2x] ProbeDown sre (text-https:443 probes/service eqsin)
[16:55:59] <sirenbot>	 7294 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[16:55:59] <sirenbot>	 7295 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[16:55:59] <sirenbot>	 7290 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[16:55:59] <sirenbot>	 7292 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[16:56:00] <sirenbot>	 7291 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[16:56:00] <sirenbot>	 7293 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[16:56:00] <sirenbot>	 7289 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[16:56:01] <sirenbot>	 7288 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[16:56:01] <sirenbot>	 7287 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[16:57:25] <claime>	 _joe_: I think that's what rzl's not in corto from last night refers to
[16:57:31] <claime>	 s/not/note/
[16:59:09] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:00:05] <jouncebot>	 jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1700).
[17:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:06] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:01:57] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T413525)', diff saved to https://phabricator.wikimedia.org/P86864 and previous config saved to /var/cache/conftool/dbconfig/20260108-170156-marostegui.json
[17:02:00] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[17:02:13] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[17:02:17] <fabfur>	 !incidents
[17:02:18] <sirenbot>	 7297 (ACKED)  [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[17:02:18] <sirenbot>	 7296 (RESOLVED)  [2x] ProbeDown sre (text-https:443 probes/service eqsin)
[17:02:18] <sirenbot>	 7294 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[17:02:18] <sirenbot>	 7295 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[17:02:18] <sirenbot>	 7290 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[17:02:19] <sirenbot>	 7292 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[17:02:19] <sirenbot>	 7291 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[17:02:19] <sirenbot>	 7293 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[17:02:19] <sirenbot>	 7289 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[17:02:20] <sirenbot>	 7288 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[17:02:20] <sirenbot>	 7287 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[17:02:33] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[17:02:41] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1161 (T413525)', diff saved to https://phabricator.wikimedia.org/P86865 and previous config saved to /var/cache/conftool/dbconfig/20260108-170241-marostegui.json
[17:05:10] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P86866 and previous config saved to /var/cache/conftool/dbconfig/20260108-170509-marostegui.json
[17:10:49] <rzl>	 _joe_: yeah sorry, two output issues -- one is because I kept a pointer to the loop variable (:facepalm:) and the other is there should be a better error message when everything is already acked
[17:10:59] <rzl>	 looking at both this morning
[17:11:03] <_joe_>	 <3
[17:12:42] <rzl>	 what's not obvious until you look at the timeline is that the new page at 16:54 was another alert that went into 7297, so it was already acked -- which is why that's not an uncommon situation
[17:15:06] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:15:06] <wikibugs>	 (03PS1) 10Bking: DO NOT MERGE: test blackbox integration for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1224738
[17:15:15] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224738 (owner: 10Bking)
[17:15:18] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T413525)', diff saved to https://phabricator.wikimedia.org/P86868 and previous config saved to /var/cache/conftool/dbconfig/20260108-171517-marostegui.json
[17:15:18] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P86867 and previous config saved to /var/cache/conftool/dbconfig/20260108-171517-marostegui.json
[17:15:22] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[17:17:47] <fabfur>	 !incidents
[17:17:48] <sirenbot>	 7297 (ACKED)  [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[17:17:48] <sirenbot>	 7296 (RESOLVED)  [2x] ProbeDown sre (text-https:443 probes/service eqsin)
[17:17:48] <sirenbot>	 7294 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[17:17:48] <sirenbot>	 7295 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[17:17:48] <sirenbot>	 7290 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[17:17:49] <sirenbot>	 7292 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[17:17:49] <sirenbot>	 7291 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[17:17:49] <sirenbot>	 7293 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[17:17:50] <sirenbot>	 7289 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[17:17:50] <sirenbot>	 7288 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[17:17:51] <sirenbot>	 7287 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[17:20:06] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:20:19] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] git::clone: Get default branch name a different way [puppet] - 10https://gerrit.wikimedia.org/r/1219907 (https://phabricator.wikimedia.org/T413193) (owner: 10Ahmon Dancy)
[17:21:20] <wikibugs>	 (03PS2) 10Clément Goubert: wmnet: Add redioscope CNAMES [dns] - 10https://gerrit.wikimedia.org/r/1224652 (https://phabricator.wikimedia.org/T413999)
[17:21:29] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[17:21:45] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[17:24:09] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[17:24:51] <jinxer-wm>	 FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[17:25:06] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:25:26] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P86869 and previous config saved to /var/cache/conftool/dbconfig/20260108-172526-marostegui.json
[17:25:26] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86870 and previous config saved to /var/cache/conftool/dbconfig/20260108-172526-marostegui.json
[17:25:33] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[17:25:33] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[17:25:43] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[17:25:51] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2155 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86871 and previous config saved to /var/cache/conftool/dbconfig/20260108-172551-marostegui.json
[17:26:26] <ragesoss>	 hey folks! I have a somewhat urgent request. Wiki Education Dashboard is suddenly getting 429 errors for OAuth login. I created an issue for it here: https://phabricator.wikimedia.org/T414114
[17:26:57] <ragesoss>	 can we get rate-limits lifted for those IPs?
[17:27:34] <ragesoss>	 @fabfur @_joe_ ^
[17:27:56] <sukhe>	 ragesoss: we are responding to an incident right now, but we will take a look shortly
[17:27:59] <sukhe>	 thanks
[17:28:18] <ragesoss>	 thanks!
[17:29:35] <Amir1>	 jouncebot: nowandnext
[17:29:35] <jouncebot>	 For the next 0 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1700)
[17:29:35] <jouncebot>	 In 0 hour(s) and 30 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1800)
[17:29:35] <jouncebot>	 In 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1800)
[17:29:59] <wikibugs>	 (03PS2) 10MVernon: Only generate 120,250 thumbs (temporary) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224071 (https://phabricator.wikimedia.org/T408062)
[17:31:09] <jynus>	 !log restarted wmf_auto_restart_prometheus-mysqld-exporter.service @ db2231
[17:31:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Only generate 120,250 thumbs (temporary) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224071 (https://phabricator.wikimedia.org/T408062) (owner: 10MVernon)
[17:31:15] <wikibugs>	 (03PS3) 10Ladsgroup: Only generate 120,250 thumbs (temporary) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224071 (https://phabricator.wikimedia.org/T408062) (owner: 10MVernon)
[17:31:19] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Only generate 120,250 thumbs (temporary) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224071 (https://phabricator.wikimedia.org/T408062) (owner: 10MVernon)
[17:31:28] <wikibugs>	 (03CR) 10MVernon: "ACK, here's that change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224071 (https://phabricator.wikimedia.org/T408062) (owner: 10MVernon)
[17:31:43] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup2003.codfw.wmnet with OS trixie
[17:32:15] <wikibugs>	 (03Merged) 10jenkins-bot: Only generate 120,250 thumbs (temporary) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224071 (https://phabricator.wikimedia.org/T408062) (owner: 10MVernon)
[17:32:33] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS trixie
[17:32:42] <wikibugs>	 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11504733 (10ssingh)
[17:32:59] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1224071|Only generate 120,250 thumbs (temporary) (T408062 T412971)]]
[17:33:03] <stashbot>	 T408062: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062
[17:33:03] <stashbot>	 T412971: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971
[17:34:09] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:35:35] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P86872 and previous config saved to /var/cache/conftool/dbconfig/20260108-173534-marostegui.json
[17:35:41] <logmsgbot>	 !log ladsgroup@deploy2002 mvernon, ladsgroup: Backport for [[gerrit:1224071|Only generate 120,250 thumbs (temporary) (T408062 T412971)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:37:42] <Tchanders>	 Is anyone from SRE around for the puppet request window? I have a simple patch but didn't get round to adding it before the window
[17:38:39] <fabfur>	 Tchanders: we're dealing with an incident rn, I'd postpone this unless is super urgent
[17:38:49] <Tchanders>	 np - thanks
[17:39:05] <jynus>	 Tchanders: I am about to leave for the day, but if you add me as reviewer I can have a look at it tomorrow
[17:39:34] <jynus>	 Tchanders: Jcrespo
[17:39:36] <Tchanders>	 jynus: Thank you - done
[17:39:59] <jynus>	 (assuming it is a trivial generic SRE one, if not I will direct you to the expert)
[17:41:46] <jynus>	 yeah, thats something that I will be able to take care
[17:42:17] <jynus>	 let me know on a comment if it is something that can be deployed any time or you want to be around
[17:44:51] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[17:45:43] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T413525)', diff saved to https://phabricator.wikimedia.org/P86873 and previous config saved to /var/cache/conftool/dbconfig/20260108-174542-marostegui.json
[17:45:46] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[17:45:59] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[17:46:07] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1185 (T413525)', diff saved to https://phabricator.wikimedia.org/P86874 and previous config saved to /var/cache/conftool/dbconfig/20260108-174606-marostegui.json
[17:46:16] <logmsgbot>	 !log ladsgroup@deploy2002 mvernon, ladsgroup: Continuing with sync
[17:49:50] <fabfur>	 !incidents
[17:49:51] <sirenbot>	 7297 (ACKED)  [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[17:49:51] <sirenbot>	 7296 (RESOLVED)  [2x] ProbeDown sre (text-https:443 probes/service eqsin)
[17:49:51] <sirenbot>	 7294 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[17:49:51] <sirenbot>	 7295 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[17:49:52] <sirenbot>	 7290 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[17:49:52] <sirenbot>	 7292 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[17:49:52] <sirenbot>	 7291 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[17:49:52] <sirenbot>	 7293 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[17:49:53] <sirenbot>	 7289 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[17:49:53] <sirenbot>	 7288 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[17:49:54] <sirenbot>	 7287 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[17:50:32] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224071|Only generate 120,250 thumbs (temporary) (T408062 T412971)]] (duration: 17m 34s)
[17:50:37] <stashbot>	 T408062: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062
[17:50:37] <stashbot>	 T412971: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971
[17:50:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:51:08] <wikibugs>	 06SRE: add avishua stein to acl*procurement-review - https://phabricator.wikimedia.org/T414115#11504796 (10Zabe)
[17:54:51] <jinxer-wm>	 FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[17:56:03] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T413525)', diff saved to https://phabricator.wikimedia.org/P86875 and previous config saved to /var/cache/conftool/dbconfig/20260108-175602-marostegui.json
[17:56:06] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[18:00:05] <jouncebot>	 bd808: May I have your attention please! Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1800)
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1800)
[18:00:35] <bd808>	 o/ I will be updating developer-portal today.
[18:02:26] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump to 2025-12-29-122831-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224764
[18:04:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[18:06:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission pay-lvs1003.frack.eqiad.wmnet and pay-lvs1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T413986#11504837 (10Jclark-ctr) a:03Jclark-ctr
[18:06:11] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P86876 and previous config saved to /var/cache/conftool/dbconfig/20260108-180611-marostegui.json
[18:06:12] <wikibugs>	 (03CR) 10BryanDavis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224764 (owner: 10BryanDavis)
[18:08:46] <wikibugs>	 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11504853 (10Joe) @Ragesoss what is the User-Agent you use when making those requests?
[18:08:58] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2025-12-29-122831-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224764 (owner: 10BryanDavis)
[18:09:11] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224727 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[18:10:47] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump to 2025-12-29-122831-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224764 (owner: 10BryanDavis)
[18:11:30] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[18:11:53] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[18:12:02] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[18:12:39] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[18:14:27] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[18:15:03] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[18:15:42] <_joe_>	 ragesoss: replied on-task
[18:15:44] <wikibugs>	 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11504886 (10Joe) @Ragesoss as far as I can tell, the problem is you are not honoring the wikimedia User-Agent policy, and we have recently started to enforce stricter rat...
[18:16:14] <wikibugs>	 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11504890 (10ssingh) Hi @Ragesoss: We looked through the logs and it seems like requests originating from your end are not respecting our UA policy, documented at https://...
[18:16:19] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P86877 and previous config saved to /var/cache/conftool/dbconfig/20260108-181619-marostegui.json
[18:16:21] <bd808>	 I'm done with my deploy window now.
[18:22:18] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Reviewing only from the point of view of the commit and the changed hiera, since I don't have the full context :)" [puppet] - 10https://gerrit.wikimedia.org/r/1224725 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[18:24:06] <logmsgbot>	 andrew@cumin2002 reimage (PID 2602474) is awaiting input
[18:24:09] <jinxer-wm>	 FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[18:25:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[18:25:58] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "(Basing this off the previous commit, with the same caveat.)" [puppet] - 10https://gerrit.wikimedia.org/r/1224727 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[18:26:08] <icinga-wm>	 PROBLEM - Host cp7014 is DOWN: CRITICAL - Time to live exceeded (10.140.1.10)
[18:26:14] <sukhe>	 wow ok, that's new
[18:26:28] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T413525)', diff saved to https://phabricator.wikimedia.org/P86878 and previous config saved to /var/cache/conftool/dbconfig/20260108-182627-marostegui.json
[18:26:29] <_joe_>	 sigh
[18:26:30] <icinga-wm>	 RECOVERY - Host cp7014 is UP: PING OK - Packet loss = 0%, RTA = 137.56 ms
[18:26:31] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[18:26:33] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[18:26:41] <sukhe>	 no it's not actually, it's a monitoring thing
[18:26:42] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1200 (T413525)', diff saved to https://phabricator.wikimedia.org/P86879 and previous config saved to /var/cache/conftool/dbconfig/20260108-182641-marostegui.json
[18:27:22] <_joe_>	 !ack
[18:27:23] <sirenbot>	 7260 (RESOLVED)  payments2006/check_mysql
[18:27:29] <_joe_>	 uhh
[18:27:31] <sukhe>	 !incidents
[18:27:32] <sirenbot>	 7298 (ACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[18:27:32] <sirenbot>	 7297 (RESOLVED)  [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[18:27:32] <sirenbot>	 7296 (RESOLVED)  [2x] ProbeDown sre (text-https:443 probes/service eqsin)
[18:27:32] <sirenbot>	 7294 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[18:27:32] <sirenbot>	 7295 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[18:27:33] <sirenbot>	 7290 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[18:27:33] <sirenbot>	 7292 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[18:27:33] <sirenbot>	 7291 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[18:27:34] <sirenbot>	 7293 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[18:27:34] <sirenbot>	 7289 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[18:27:35] <sirenbot>	 7288 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[18:27:35] <sirenbot>	 7287 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[18:27:40] <sukhe>	 already ACKed
[18:27:44] <_joe_>	 ah already acked
[18:33:49] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache::text: enable unid, browser ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224725 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[18:34:22] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp2042 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:26] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6006 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:26] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6008 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:26] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6004 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:26] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6002 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:26] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6007 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:26] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6005 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:28] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp7012 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:28] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp7009 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:28] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp7013 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:28] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp1105 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:28] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp1101 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:30] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp7011 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:32] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp7010 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:32] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp7016 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:32] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp7015 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:32] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp7014 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:35] <sukhe>	 oh boy
[18:34:40] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5029 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:40] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5032 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:40] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5028 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:40] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5025 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:40] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp1113 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:40] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp1115 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:40] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp1107 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:41] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp1111 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:41] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp1103 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:42] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp1109 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:42] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5027 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:43] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5026 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:43] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5031 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:44] <sukhe>	 I am guessing this is the classic reload race condition at play
[18:34:44] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp5030 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:44] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp2028 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:45] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp2030 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:45] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp2034 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:46] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp2032 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:48] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp2036 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:50] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp2038 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:50] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4051 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:50] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4047 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:52] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4052 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:52] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4048 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:52] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4046 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:52] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4045 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:54] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3076 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:54] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3078 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:54] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3081 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:54] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3075 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:54] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3074 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:54] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3077 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:54] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4050 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:55] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp4049 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:55] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3080 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:56] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp3079 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:56] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6001 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:57] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp6003 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:34:57] <icinga-wm>	 PROBLEM - Confd vcl based reload on cp2040 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:36:16] <fabfur>	 I've just merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1224725 but don't think it's the cause as it really just happened 
[18:36:38] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T413525)', diff saved to https://phabricator.wikimedia.org/P86880 and previous config saved to /var/cache/conftool/dbconfig/20260108-183637-marostegui.json
[18:36:41] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[18:40:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[18:41:22] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp2042 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:26] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6008 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:26] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6004 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:26] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6002 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:26] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6006 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:26] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6007 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:26] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6005 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:28] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp7012 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:28] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp7009 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:28] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp7013 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:28] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp1105 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:28] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp1101 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:30] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp7011 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:32] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp7016 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:32] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp7010 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:32] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp7015 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:32] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp7014 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:40] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5025 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:40] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5028 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:40] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5032 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:40] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5029 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:40] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp1113 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:40] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp1115 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:40] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp1111 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:41] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp1103 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:41] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp1107 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:42] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5026 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:42] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp1109 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:43] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5031 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:43] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5030 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:44] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp5027 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:44] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp2030 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:45] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp2028 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:45] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp2034 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:46] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp2032 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:48] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp2036 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:50] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp2038 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:50] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4047 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:50] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4051 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:52] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4052 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:54] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4045 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:54] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4046 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:54] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4048 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:54] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3074 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:54] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3081 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:54] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3078 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:54] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3077 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:55] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3076 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:55] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3075 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:56] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4049 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:56] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4050 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:57] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3080 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:57] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp3079 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:58] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6001 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:58] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp6003 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:41:59] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp2040 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[18:45:14] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1224725 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[18:46:46] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P86881 and previous config saved to /var/cache/conftool/dbconfig/20260108-184645-marostegui.json
[18:46:59] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224725 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[18:47:58] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224727 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[18:53:42] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache::text: cleanup rate_limiting_flags [puppet] - 10https://gerrit.wikimedia.org/r/1224727 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur)
[18:56:54] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P86882 and previous config saved to /var/cache/conftool/dbconfig/20260108-185654-marostegui.json
[18:58:49] <wikibugs>	 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11505043 (10Ragesoss) Thanks! Unfortunately, the OAuth library we use doesn't support setting the User Agent, so I'm going to have to figure out how to monkey patch it. :-(
[19:00:04] <jouncebot>	 dduvall and dancy: That opportune time for a MediaWiki train - Utc-7 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1900).
[19:03:41] <wikibugs>	 (03CR) 10Scott French: [C:03+1] haproxy: proxy mmdb: all 🌍 [puppet] - 10https://gerrit.wikimedia.org/r/1224168 (owner: 10CDanis)
[19:06:17] <dduvall>	 o/ just zeroing my brain on the current error logs and then rolling train
[19:07:03] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T413525)', diff saved to https://phabricator.wikimedia.org/P86883 and previous config saved to /var/cache/conftool/dbconfig/20260108-190702-marostegui.json
[19:07:06] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[19:07:19] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1207.eqiad.wmnet with reason: Maintenance
[19:07:27] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1207 (T413525)', diff saved to https://phabricator.wikimedia.org/P86884 and previous config saved to /var/cache/conftool/dbconfig/20260108-190727-marostegui.json
[19:07:52] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224777 (https://phabricator.wikimedia.org/T408280)
[19:07:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224777 (https://phabricator.wikimedia.org/T408280) (owner: 10TrainBranchBot)
[19:08:42] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224777 (https://phabricator.wikimedia.org/T408280) (owner: 10TrainBranchBot)
[19:16:25] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T413525)', diff saved to https://phabricator.wikimedia.org/P86885 and previous config saved to /var/cache/conftool/dbconfig/20260108-191624-marostegui.json
[19:16:28] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[19:20:09] <wikibugs>	 (03PS1) 10Ebenezer Rao: fixed typo of word initial in the WmfConfig.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224781 (https://phabricator.wikimedia.org/T201491)
[19:24:37] <logmsgbot>	 !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.10  refs T408280
[19:24:41] <stashbot>	 T408280: 1.46.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T408280
[19:26:33] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P86886 and previous config saved to /var/cache/conftool/dbconfig/20260108-192633-marostegui.json
[19:28:33] <wikibugs>	 (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224781 (https://phabricator.wikimedia.org/T201491) (owner: 10Ebenezer Rao)
[19:30:33] <wikibugs>	 (03PS1) 10Ebenezer Rao: fixed typo of the word initial in the srllogin file [puppet] - 10https://gerrit.wikimedia.org/r/1224784 (https://phabricator.wikimedia.org/T201491)
[19:33:17] <wikibugs>	 (03CR) 10CDanis: [C:03+2] haproxy: proxy mmdb: all 🌍 [puppet] - 10https://gerrit.wikimedia.org/r/1224168 (owner: 10CDanis)
[19:34:38] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:36:42] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P86887 and previous config saved to /var/cache/conftool/dbconfig/20260108-193641-marostegui.json
[19:37:07] <dduvall>	 cccccbbneubnidfrkfhfdlejlcrunfbjfhbldtdrbbbl
[19:38:04] * dduvall is good at yubikey
[19:39:00] <zabe>	 dduvall: may I deploy the fix for https://phabricator.wikimedia.org/T414077?
[19:39:03] <wikibugs>	 (03PS1) 10Ebenezer Rao: fixed typo of the word initial in swiftcleanermanager [software] - 10https://gerrit.wikimedia.org/r/1224788 (https://phabricator.wikimedia.org/T201491)
[19:39:30] <dduvall>	 zabe: yes, please do. train looks ok
[19:40:11] <wikibugs>	 (03CR) 10Zabe: [C:03+2] MimeAnalyzer: Fix syntax error in MAJOR_MIME_TYPES array [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224679 (https://phabricator.wikimedia.org/T414077) (owner: 10Zabe)
[19:40:15] <dduvall>	 i missed that as a new blocker. sorry about that
[19:40:22] <zabe>	 Alright, no worries
[19:40:36] <zabe>	 Its commons which is mostly affected by this anyway
[19:40:41] <dduvall>	 right
[19:41:11] <wikibugs>	 (03CR) 10Zabe: [C:03+2] fixed typo of word initial in the WmfConfig.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224781 (https://phabricator.wikimedia.org/T201491) (owner: 10Ebenezer Rao)
[19:41:13] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Enable phan on more php files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223289 (owner: 10Zabe)
[19:41:57] <wikibugs>	 (03CR) 10Pppery: "This is still spelled wrong." [software] - 10https://gerrit.wikimedia.org/r/1224788 (https://phabricator.wikimedia.org/T201491) (owner: 10Ebenezer Rao)
[19:41:58] <wikibugs>	 (03Merged) 10jenkins-bot: fixed typo of word initial in the WmfConfig.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224781 (https://phabricator.wikimedia.org/T201491) (owner: 10Ebenezer Rao)
[19:42:00] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup2003.codfw.wmnet with OS trixie
[19:42:04] <wikibugs>	 (03Merged) 10jenkins-bot: Enable phan on more php files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223289 (owner: 10Zabe)
[19:42:04] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1003.eqiad.wmnet with OS trixie
[19:42:47] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1224781|fixed typo of word initial in the WmfConfig.php file (T201491)]], [[gerrit:1223289|Enable phan on more php files]]
[19:42:50] <stashbot>	 T201491: Fix common typos in code - https://phabricator.wikimedia.org/T201491
[19:44:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223283 (https://phabricator.wikimedia.org/T413108) (owner: 10Arlolra)
[19:44:45] <logmsgbot>	 !log zabe@deploy2002 zabe, ebenezerrao: Backport for [[gerrit:1224781|fixed typo of word initial in the WmfConfig.php file (T201491)]], [[gerrit:1223289|Enable phan on more php files]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:46:50] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T413525)', diff saved to https://phabricator.wikimedia.org/P86888 and previous config saved to /var/cache/conftool/dbconfig/20260108-194649-marostegui.json
[19:46:51] <logmsgbot>	 !log zabe@deploy2002 zabe, ebenezerrao: Continuing with sync
[19:46:53] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[19:47:06] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[19:47:23] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1230.eqiad.wmnet with reason: Maintenance
[19:47:32] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1230 (T413525)', diff saved to https://phabricator.wikimedia.org/P86889 and previous config saved to /var/cache/conftool/dbconfig/20260108-194731-marostegui.json
[19:50:56] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224781|fixed typo of word initial in the WmfConfig.php file (T201491)]], [[gerrit:1223289|Enable phan on more php files]] (duration: 08m 09s)
[19:51:00] <stashbot>	 T201491: Fix common typos in code - https://phabricator.wikimedia.org/T201491
[19:52:45] <wikibugs>	 (03Merged) 10jenkins-bot: MimeAnalyzer: Fix syntax error in MAJOR_MIME_TYPES array [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224679 (https://phabricator.wikimedia.org/T414077) (owner: 10Zabe)
[19:53:12] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1224679|MimeAnalyzer: Fix syntax error in MAJOR_MIME_TYPES array (T414077)]]
[19:53:15] <stashbot>	 T414077: WAV files being uploaded with wrong MIME type - https://phabricator.wikimedia.org/T414077
[19:55:07] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:1224679|MimeAnalyzer: Fix syntax error in MAJOR_MIME_TYPES array (T414077)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:55:57] <wikibugs>	 (03PS1) 10Tbodt: Add MultiTitle to extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224793 (https://phabricator.wikimedia.org/T404461)
[19:55:59] <wikibugs>	 (03PS1) 10Tbodt: Add config variable for MultiTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224794
[19:55:59] <wikibugs>	 (03PS1) 10Tbodt: Enable MultiTitle on beta cluster testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224795 (https://phabricator.wikimedia.org/T404461)
[19:56:01] <wikibugs>	 (03PS1) 10Tbodt: Load MultiTitle on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224796 (https://phabricator.wikimedia.org/T404461)
[19:56:28] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T413525)', diff saved to https://phabricator.wikimedia.org/P86891 and previous config saved to /var/cache/conftool/dbconfig/20260108-195627-marostegui.json
[19:56:31] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[19:56:45] <wikibugs>	 (03PS2) 10Tbodt: Add config variable for MultiTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224794 (https://phabricator.wikimedia.org/T404461)
[19:56:47] <wikibugs>	 (03PS2) 10Tbodt: Enable MultiTitle on beta cluster testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224795 (https://phabricator.wikimedia.org/T404461)
[19:56:47] <wikibugs>	 (03PS2) 10Tbodt: Load MultiTitle on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224796 (https://phabricator.wikimedia.org/T404461)
[20:04:41] <logmsgbot>	 !log zabe@deploy2002 zabe: Continuing with sync
[20:06:36] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P86892 and previous config saved to /var/cache/conftool/dbconfig/20260108-200635-marostegui.json
[20:08:43] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224679|MimeAnalyzer: Fix syntax error in MAJOR_MIME_TYPES array (T414077)]] (duration: 15m 31s)
[20:08:46] <stashbot>	 T414077: WAV files being uploaded with wrong MIME type - https://phabricator.wikimedia.org/T414077
[20:08:57] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmde for martyn.ranyard - https://phabricator.wikimedia.org/T413994#11505355 (10KFrancis) Hi @JMeybohm, it doesn't look like we have an NDA on file for Martyn Ranyard.  Would you please provide their email address?
[20:09:14] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudbackups: update partman recipes. [puppet] - 10https://gerrit.wikimedia.org/r/1224798 (https://phabricator.wikimedia.org/T375217)
[20:11:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudbackups: update partman recipes. [puppet] - 10https://gerrit.wikimedia.org/r/1224798 (https://phabricator.wikimedia.org/T375217) (owner: 10Andrew Bogott)
[20:14:38] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:15:58] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:16:44] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P86893 and previous config saved to /var/cache/conftool/dbconfig/20260108-201643-marostegui.json
[20:16:46] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:17:12] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS trixie
[20:17:16] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS trixie
[20:17:38] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:18:42] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:19:58] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad  Pdu  tripped breaker on ps1-c3-eqiad no automated allerts - https://phabricator.wikimedia.org/T414134#11505404 (10Jclark-ctr)
[20:21:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet CI, 10Puppet-Infrastructure, 13Patch-For-Review: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399#11505413 (10jhathaway) >>! In T367399#11502051, @hashar wrote: > Someth...
[20:24:59] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudbackup2003.codfw.wmnet with OS trixie
[20:25:03] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudbackup1003.eqiad.wmnet with OS trixie
[20:26:32] <zabe>	 !log zabe@deploy2002:~$ mwscript refreshImageMetadata.php commonswiki --mediatype AUDIO --mime unknown/wav --force # T414077
[20:26:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:36] <stashbot>	 T414077: WAV files being uploaded with wrong MIME type - https://phabricator.wikimedia.org/T414077
[20:26:53] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T413525)', diff saved to https://phabricator.wikimedia.org/P86894 and previous config saved to /var/cache/conftool/dbconfig/20260108-202652-marostegui.json
[20:26:55] <stashbot>	 T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525
[20:27:09] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[20:27:11] <wikibugs>	 (03PS1) 10Pppery: Igwiki: add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224799 (https://phabricator.wikimedia.org/T406178)
[20:27:26] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[20:27:54] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS trixie
[20:27:55] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS trixie
[20:28:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Igwiki: add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224799 (https://phabricator.wikimedia.org/T406178) (owner: 10Pppery)
[20:30:37] <wikibugs>	 (03PS1) 10CDanis: ip_reputation: tweak interval [puppet] - 10https://gerrit.wikimedia.org/r/1224800
[20:31:01] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224800 (owner: 10CDanis)
[20:32:12] <wikibugs>	 (03PS1) 10Ebenezer Rao: fixed typo of the word initial in the test_init.py [software/cumin] - 10https://gerrit.wikimedia.org/r/1224801 (https://phabricator.wikimedia.org/T201491)
[20:33:10] <zabe>	 !log zabe@deploy2002:~$ foreachwiki refreshImageMetadata.php --mediatype AUDIO --mime unknown/wav --force # T414077
[20:33:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:13] <stashbot>	 T414077: WAV files being uploaded with wrong MIME type - https://phabricator.wikimedia.org/T414077
[20:34:41] <wikibugs>	 (03CR) 10Scott French: [C:03+1] ip_reputation: tweak interval [puppet] - 10https://gerrit.wikimedia.org/r/1224800 (owner: 10CDanis)
[20:35:02] <zabe>	 !log zabe@deploy2002:~$ foreachwiki refreshImageMetadata.php --mediatype AUDIO --mime unknown/wav --force --oldimage # T414077
[20:35:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:10] <wikibugs>	 (03CR) 10CDanis: [C:03+2] ip_reputation: tweak interval [puppet] - 10https://gerrit.wikimedia.org/r/1224800 (owner: 10CDanis)
[20:35:38] <wikibugs>	 (03PS2) 10Pppery: Igwiki: add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224799 (https://phabricator.wikimedia.org/T406178)
[20:42:12] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1003.eqiad.wmnet with reason: host reimage
[20:43:31] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] puppet: Remove the force_puppet7 parameter [puppet] - 10https://gerrit.wikimedia.org/r/1224605 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[20:46:37] <wikibugs>	 (03PS1) 10Ebenezer Rao: fixed typo of the word initial in the zerrors_windows.go file [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1224802 (https://phabricator.wikimedia.org/T201491)
[20:47:05] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudbackup partman: second attempt [puppet] - 10https://gerrit.wikimedia.org/r/1224803
[20:48:36] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1003.eqiad.wmnet with reason: host reimage
[20:48:48] <wikibugs>	 (03PS1) 10RLazarus: httpbb: Update the expected error message for auth.wm.o page views [puppet] - 10https://gerrit.wikimedia.org/r/1224804 (https://phabricator.wikimedia.org/T414132)
[20:49:36] <wikibugs>	 (03CR) 10RLazarus: "Fails without, passes with:" [puppet] - 10https://gerrit.wikimedia.org/r/1224804 (https://phabricator.wikimedia.org/T414132) (owner: 10RLazarus)
[20:49:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudbackup partman: second attempt [puppet] - 10https://gerrit.wikimedia.org/r/1224803 (owner: 10Andrew Bogott)
[20:50:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224799 (https://phabricator.wikimedia.org/T406178) (owner: 10Pppery)
[20:51:40] <wikibugs>	 (03CR) 10MVernon: [C:03+1] httpbb: Update the expected error message for auth.wm.o page views [puppet] - 10https://gerrit.wikimedia.org/r/1224804 (https://phabricator.wikimedia.org/T414132) (owner: 10RLazarus)
[20:52:03] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] httpbb: Update the expected error message for auth.wm.o page views [puppet] - 10https://gerrit.wikimedia.org/r/1224804 (https://phabricator.wikimedia.org/T414132) (owner: 10RLazarus)
[20:52:26] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1003.eqiad.wmnet with OS trixie
[20:52:26] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudbackup2003.codfw.wmnet with OS trixie
[20:53:39] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS trixie
[21:00:04] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudbackup1003.eqiad.wmnet with OS trixie
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T2100).
[21:00:05] <jouncebot>	 sbassett, JSherman, arlolra, and Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:08] <Pppery>	 Here
[21:00:12] <JSherman>	 here
[21:00:16] <sbassett>	 o/
[21:00:26] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS trixie
[21:00:26] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS trixie
[21:01:23] <cjming>	 o/
[21:01:40] <cjming>	 lmk if anyone needs a deployer - otherwise please self-deploy at will
[21:02:08] <Pppery>	 I'm not a deployer
[21:02:11] <sbassett>	 ok.  my config change might blow up logstash.  not sure.
[21:02:19] * sbassett can also deploy for anyone
[21:02:32] <rzl>	 just fixed an issue that would have caused scap to give you a spurious httpbb test failure, but please ping me if you see anything unexpected and httpbb-flavored :)
[21:02:55] <sbassett>	 tx rzl
[21:03:15] <sbassett>	 I’m ready to deploy my cfg change if there are no objections...
[21:03:21] <rzl>	 (unless it's actually an httpbb failure caused by your change, in which case, I guess do the usual thing about that)
[21:03:27] <wikibugs>	 (03CR) 10SBassett: [C:03+1] Set CSP Report Only mode for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224144 (https://phabricator.wikimedia.org/T291867) (owner: 10SBassett)
[21:04:53] <cjming>	 sbassett: thanks - all you
[21:05:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224144 (https://phabricator.wikimedia.org/T291867) (owner: 10SBassett)
[21:05:55] <wikibugs>	 (03Merged) 10jenkins-bot: Set CSP Report Only mode for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224144 (https://phabricator.wikimedia.org/T291867) (owner: 10SBassett)
[21:06:15] <logmsgbot>	 !log sbassett@deploy2002 Started scap sync-world: Backport for [[gerrit:1224144|Set CSP Report Only mode for all wikis (T291867)]]
[21:07:47] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad  Pdu  tripped breaker on ps1-c3-eqiad no automated allerts - https://phabricator.wikimedia.org/T414134#11505510 (10Reedy)
[21:07:56] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad Pdu tripped breaker on ps1-c3-eqiad no automated alerts - https://phabricator.wikimedia.org/T414134#11505511 (10Reedy)
[21:08:23] <logmsgbot>	 !log sbassett@deploy2002 sbassett: Backport for [[gerrit:1224144|Set CSP Report Only mode for all wikis (T291867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:08:42] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:08:42] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti1024 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[21:09:42] <icinga-wm>	 RECOVERY - ganeti-noded running on ganeti1024 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[21:10:32] <logmsgbot>	 !log sbassett@deploy2002 sbassett: Continuing with sync
[21:12:18] <wikibugs>	 (03PS1) 10Kgraessle: When filtering for edits with high Revert Risk, Recent Changes shouldn't display edits from non-main namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224814 (https://phabricator.wikimedia.org/T404200)
[21:14:36] <logmsgbot>	 !log sbassett@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224144|Set CSP Report Only mode for all wikis (T291867)]] (duration: 08m 20s)
[21:14:38] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:15:58] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:16:46] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:17:00] <sbassett>	 Done with my cfg patch.  That… is definitely introducing a lot more traffic to logstash but is maybe ok for now.
[21:17:31] <wikibugs>	 (03PS1) 10Kgraessle: When filtering for edits with high Revert Risk, Recent Changes shouldn't display edits from non-main namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224814 (https://phabricator.wikimedia.org/T404200)
[21:17:38] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:17:52] <JSherman>	 sbassett: am I good to proceed?
[21:18:10] <sbassett>	 yes
[21:18:18] <JSherman>	 thanks!
[21:18:24] <arlolra>	 JSherman: Ping me when you're done, I'll go next
[21:18:37] <JSherman>	 arlolra: wilco
[21:18:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217786 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman)
[21:19:51] <wikibugs>	 (03Merged) 10jenkins-bot: extension-list: Add PersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217786 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman)
[21:20:10] <logmsgbot>	 !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1217786|extension-list: Add PersonalDashboard (T412528)]]
[21:20:13] <stashbot>	 T412528: Deploy the PersonalDashboard extension to Beta Cluster - https://phabricator.wikimedia.org/T412528
[21:21:19] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudbackup1003.eqiad.wmnet with OS trixie
[21:24:09] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[21:25:12] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup2003.codfw.wmnet with OS trixie
[21:25:27] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup2003']
[21:27:32] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup2003']
[21:27:58] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudbackup2003']
[21:28:08] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup1003']
[21:34:38] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:35:11] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudbackup2003']
[21:41:17] <JSherman>	 made it through the image registry push; I was starting to get antsy
[21:42:34] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS trixie
[21:44:08] <logmsgbot>	 !log jsn@deploy2002 jsn: Backport for [[gerrit:1217786|extension-list: Add PersonalDashboard (T412528)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:44:11] <stashbot>	 T412528: Deploy the PersonalDashboard extension to Beta Cluster - https://phabricator.wikimedia.org/T412528
[21:45:16] <logmsgbot>	 !log jsn@deploy2002 jsn: Continuing with sync
[21:49:49] <wikibugs>	 (03PS1) 10Eevans: WIP hoard chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224817 (https://phabricator.wikimedia.org/T414112)
[21:50:40] <zabe>	 I guess adding something extension-list causes a full i18n rebuild, syncing that takes quite a bit
[21:50:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:51:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP hoard chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224817 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[21:51:28] <JSherman>	 ugh, I'm sorry, I should have gone last then
[21:57:50] <logmsgbot>	 !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1217786|extension-list: Add PersonalDashboard (T412528)]] (duration: 37m 41s)
[21:57:54] <stashbot>	 T412528: Deploy the PersonalDashboard extension to Beta Cluster - https://phabricator.wikimedia.org/T412528
[21:58:09] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup2003.codfw.wmnet with reason: host reimage
[21:58:12] <JSherman>	 arlolra: done
[21:58:17] <arlolra>	 ty
[22:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T2200)
[22:01:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223283 (https://phabricator.wikimedia.org/T413108) (owner: 10Arlolra)
[22:02:24] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy PRV to 27 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223283 (https://phabricator.wikimedia.org/T413108) (owner: 10Arlolra)
[22:02:45] <logmsgbot>	 !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1223283|Deploy PRV to 27 wikis (T413108)]]
[22:02:48] <stashbot>	 T413108: Parsoid Read Views to deploy ~2026-01-01 - https://phabricator.wikimedia.org/T413108
[22:03:08] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup2003.codfw.wmnet with reason: host reimage
[22:09:07] <logmsgbot>	 !log arlolra@deploy2002 arlolra: Backport for [[gerrit:1223283|Deploy PRV to 27 wikis (T413108)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:09:10] <stashbot>	 T413108: Parsoid Read Views to deploy ~2026-01-01 - https://phabricator.wikimedia.org/T413108
[22:15:23] <logmsgbot>	 !log arlolra@deploy2002 arlolra: Continuing with sync
[22:16:29] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11505648 (10bd808) >>! In T413634#11504543, @JMeybohm wrote: > #release-engineering-team: Could you help with removing +2 ?  I [[https://gerrit...
[22:20:28] <wikibugs>	 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11505652 (10Ragesoss) @ssingh I've just deployed an update that should fix it. Now the user agent is `Wiki Education Dashboard/1.0 (dashboard.wikiedu.org; sage@wikiedu.or...
[22:21:16] <logmsgbot>	 !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1223283|Deploy PRV to 27 wikis (T413108)]] (duration: 18m 32s)
[22:21:19] <stashbot>	 T413108: Parsoid Read Views to deploy ~2026-01-01 - https://phabricator.wikimedia.org/T413108
[22:24:09] <jinxer-wm>	 FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[22:24:24] <arlolra>	 I'm done Pppery if you want to go
[22:24:28] <Pppery>	 Not a deployer
[22:24:44] <Pppery>	 Not the first time people thought I was one, though
[22:25:02] <arlolra>	 Did you want me to do that for you>
[22:25:27] <Pppery>	 If you mean "do I want you do deploy my patch", then sure
[22:25:41] <arlolra>	 Alrighty
[22:26:00] * cjming thanks arlolra
[22:26:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224799 (https://phabricator.wikimedia.org/T406178) (owner: 10Pppery)
[22:26:46] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11505670 (10bd808) @DannyS712 In addition to your MediaWiki +2 which I just revoked, do you want to give up other rights in Gerrit such as your...
[22:27:26] <Pppery>	 Someone should run namespaceDupes after it is deployed (although I checked and don't see any conflicts): https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#namespaceDupes
[22:27:30] <wikibugs>	 (03Merged) 10jenkins-bot: Igwiki: add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224799 (https://phabricator.wikimedia.org/T406178) (owner: 10Pppery)
[22:27:48] <logmsgbot>	 !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1224799|Igwiki: add draft namespace (T406178)]]
[22:27:51] <stashbot>	 T406178: Add draft namespace to Igbo Wikipedia - https://phabricator.wikimedia.org/T406178
[22:29:51] <logmsgbot>	 !log arlolra@deploy2002 pppery, arlolra: Backport for [[gerrit:1224799|Igwiki: add draft namespace (T406178)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:29:55] <Pppery>	 Looking
[22:30:28] <Pppery>	 Seems to work
[22:30:35] <arlolra>	 Thanks
[22:30:40] <logmsgbot>	 !log arlolra@deploy2002 pppery, arlolra: Continuing with sync
[22:34:53] <logmsgbot>	 !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224799|Igwiki: add draft namespace (T406178)]] (duration: 07m 04s)
[22:34:56] <stashbot>	 T406178: Add draft namespace to Igbo Wikipedia - https://phabricator.wikimedia.org/T406178
[22:35:14] <arlolra>	 All done
[22:35:25] <Pppery>	 What about namespaceDupes?
[22:35:31] <Pppery>	 https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#namespaceDupes
[22:35:51] <jinxer-wm>	 FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[22:36:04] <cjming>	 arlolra: do you have access to run that script?  otherwise i can do it
[22:36:14] <swfrench-wmf>	 !incidents
[22:36:15] <sirenbot>	 7299 (UNACKED)  [2x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[22:36:15] <sirenbot>	 7298 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[22:36:15] <sirenbot>	 7297 (RESOLVED)  [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[22:36:15] <sirenbot>	 7296 (RESOLVED)  [2x] ProbeDown sre (text-https:443 probes/service eqsin)
[22:36:16] <sirenbot>	 7294 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[22:36:16] <sirenbot>	 7295 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[22:36:16] <sirenbot>	 7290 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[22:36:16] <sirenbot>	 7292 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[22:36:17] <sirenbot>	 7291 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[22:36:17] <sirenbot>	 7293 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[22:36:18] <sirenbot>	 7289 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[22:36:18] <sirenbot>	 7288 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[22:36:19] <sirenbot>	 7287 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[22:36:25] <swfrench-wmf>	 !ack 7299
[22:36:26] <sirenbot>	 7272 (RESOLVED)  fransw2001/check_memory
[22:36:47] <rzl>	 (known bug, sorry -- it acked the correct incident and then gave the wrong reply)
[22:37:15] <wikibugs>	 (03PS2) 10Ryan Kemper: Move opensearch-ipoid to production state [puppet] - 10https://gerrit.wikimedia.org/r/1224738 (owner: 10Bking)
[22:38:20] <wikibugs>	 (03CR) 10Bking: [C:03+1] Move opensearch-ipoid to production state [puppet] - 10https://gerrit.wikimedia.org/r/1224738 (owner: 10Bking)
[22:38:50] <wikibugs>	 (03PS3) 10Ryan Kemper: Move opensearch-ipoid to production state [puppet] - 10https://gerrit.wikimedia.org/r/1224738 (owner: 10Bking)
[22:38:59] <arlolra>	 cjming: oh, can you take of that?
[22:39:05] <cjming>	 sure np
[22:39:22] <arlolra>	 Thanks
[22:39:27] <wikibugs>	 (03PS4) 10Ryan Kemper: Move opensearch-ipoid to production state [puppet] - 10https://gerrit.wikimedia.org/r/1224738 (owner: 10Bking)
[22:40:51] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[22:41:12] <arlolra>	 I do have access, just less experienced with running maintenance scripts
[22:41:44] <logmsgbot>	 !log cjming@deploy2002 mwscript-k8s job started: namespaceDupes igwiki --fix  # T406178
[22:41:47] <stashbot>	 T406178: Add draft namespace to Igbo Wikipedia - https://phabricator.wikimedia.org/T406178
[22:41:49] <Pppery>	 Anyone who can deploy patches has access to run maintenance scripts AFAIK
[22:42:00] <cjming>	 Pppery: done!
[22:42:04] <Pppery>	 Thanks
[22:43:10] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] Move opensearch-ipoid to production state [puppet] - 10https://gerrit.wikimedia.org/r/1224738 (owner: 10Bking)
[22:45:51] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[22:46:02] <swfrench-wmf>	 !incidents
[22:46:02] <sirenbot>	 7299 (ACKED)  [2x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[22:46:02] <sirenbot>	 7298 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[22:46:02] <sirenbot>	 7297 (RESOLVED)  [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[22:46:03] <sirenbot>	 7296 (RESOLVED)  [2x] ProbeDown sre (text-https:443 probes/service eqsin)
[22:46:03] <sirenbot>	 7294 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[22:46:03] <sirenbot>	 7295 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[22:46:03] <sirenbot>	 7290 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[22:46:04] <sirenbot>	 7292 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[22:46:04] <sirenbot>	 7291 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[22:46:05] <sirenbot>	 7293 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[22:46:05] <sirenbot>	 7289 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[22:46:06] <sirenbot>	 7288 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[22:46:06] <sirenbot>	 7287 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[22:49:09] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:54:09] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:58:48] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "Move opensearch-ipoid to production state" [puppet] - 10https://gerrit.wikimedia.org/r/1224827
[23:00:23] <wikibugs>	 (03PS2) 10Ryan Kemper: Revert "Move opensearch-ipoid to production state" [puppet] - 10https://gerrit.wikimedia.org/r/1224827 (https://phabricator.wikimedia.org/T414037)
[23:00:51] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[23:01:16] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] Revert "Move opensearch-ipoid to production state" [puppet] - 10https://gerrit.wikimedia.org/r/1224827 (https://phabricator.wikimedia.org/T414037) (owner: 10Ryan Kemper)
[23:06:52] <swfrench-wmf>	 !incidents
[23:06:52] <sirenbot>	 7299 (ACKED)  [2x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[23:06:53] <sirenbot>	 7298 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[23:06:53] <sirenbot>	 7297 (RESOLVED)  [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[23:06:53] <sirenbot>	 7296 (RESOLVED)  [2x] ProbeDown sre (text-https:443 probes/service eqsin)
[23:06:53] <sirenbot>	 7294 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[23:06:53] <sirenbot>	 7295 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[23:06:54] <sirenbot>	 7290 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[23:06:54] <sirenbot>	 7292 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[23:06:54] <sirenbot>	 7291 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[23:06:55] <sirenbot>	 7293 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[23:06:55] <sirenbot>	 7289 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[23:06:56] <sirenbot>	 7288 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[23:06:56] <sirenbot>	 7287 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[23:09:09] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:14:09] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:15:51] <jinxer-wm>	 RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[23:17:50] <wikibugs>	 10ops-drmrs: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T413005#11505824 (10phaultfinder)