[00:14:09] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:40:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1224244 [00:40:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1224244 (owner: 10TrainBranchBot) [00:51:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1224244 (owner: 10TrainBranchBot) [01:01:01] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:03:25] (03PS1) 10Aaron Schulz: Use meta.wikimedia.org for "wmf-restbase-global" sandbox specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224253 [01:10:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1224254 [01:10:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1224254 (owner: 10TrainBranchBot) [01:14:00] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 59s) [01:24:09] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:33:51] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1224254 (owner: 10TrainBranchBot) [02:12:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:47:57] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:49:09] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:49:12] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [02:49:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:50:06] FIRING: [2x] ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:51:19] !ack [02:51:20] 7259 (RESOLVED) frban1002/check_ipsec [02:51:20] 7259 (RESOLVED) frban1002/check_ipsec [02:51:36] hmm, I'll look at that later [02:51:37] !incidents [02:51:37] 7287 (ACKED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [02:51:37] 7288 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [02:51:38] 7289 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [02:51:38] 7284 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [02:51:38] 7283 (RESOLVED) ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw) [02:51:38] 7282 (RESOLVED) ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw) [02:51:38] 7281 (RESOLVED) ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw) [02:51:39] 7280 (RESOLVED) ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw) [02:52:57] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:53:01] \o [02:53:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [02:54:05] try it again... [02:54:05] !ack [02:54:06] 7259 (RESOLVED) frban1002/check_ipsec [02:54:09] FIRING: [2x] ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:54:12] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [02:54:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:54:29] okay it's acking correctly, just replying with the wrong incident [02:56:12] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [02:56:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:56:22] !ack [02:56:23] 7259 (RESOLVED) frban1002/check_ipsec [02:56:23] 7259 (RESOLVED) frban1002/check_ipsec [02:58:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [02:59:26] !ack [02:59:26] no value provided for parameter incident and no default available [02:59:27] Incident id must be an integer [03:00:57] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:03:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [03:05:57] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:07:06] your turnillo-fu is strong swfrench-wmf [03:08:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [03:09:09] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:11:12] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [03:11:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [03:12:29] 06SRE, 06Infrastructure-Foundations, 10Puppet CI, 10Puppet-Infrastructure, 13Patch-For-Review: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399#11502051 (10hashar) Something I forgot, the `operations-puppet-catalog-... [03:13:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [03:35:29] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:38:56] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:43:51] RESOLVED: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [03:48:56] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:50:25] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:14:09] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:09:09] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:24:09] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:34:09] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:12] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [05:34:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [05:34:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2147 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86778 and previous config saved to /var/cache/conftool/dbconfig/20260108-053449-marostegui.json [05:34:54] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:34:54] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:43:47] (03PS1) 10Marostegui: mariadb: Productionize db2249 [puppet] - 10https://gerrit.wikimedia.org/r/1224527 (https://phabricator.wikimedia.org/T407941) [05:45:35] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2249 [puppet] - 10https://gerrit.wikimedia.org/r/1224527 (https://phabricator.wikimedia.org/T407941) (owner: 10Marostegui) [05:51:16] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db2231.codfw.wmnet onto db2249.codfw.wmnet [05:51:21] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db2231 - Depool db2231.codfw.wmnet to then clone it to db2249.codfw.wmnet - marostegui@cumin1003 [05:51:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2231 - Depool db2231.codfw.wmnet to then clone it to db2249.codfw.wmnet - marostegui@cumin1003 [05:52:10] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11502131 (10Marostegui) 05Stalled→03Open [05:52:56] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11502132 (10Marostegui) [05:53:34] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11502134 (10Marostegui) @KOfori you are the approver for `cassandra-staging-devs` can you take a look at this? thanks [05:58:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [05:58:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:59:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1165 (T413525)', diff saved to https://phabricator.wikimedia.org/P86780 and previous config saved to /var/cache/conftool/dbconfig/20260108-055901-marostegui.json [05:59:05] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [06:02:44] !log marostegui@cumin1003 START - Cookbook sre.mysql.newdepool depool pc1013: test [06:02:44] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:02:52] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:02:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newdepool (exit_code=0) depool pc1013: test [06:03:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool pc1013: test [06:03:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:04:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:04:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool pc1013: test [06:05:06] !log marostegui@cumin1003 START - Cookbook sre.mysql.newdepool depool db2144: test [06:05:06] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:05:14] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:05:14] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newdepool (exit_code=0) depool db2144: test [06:05:26] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2144: test [06:05:26] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:05:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:05:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db2144: test [06:05:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T413525)', diff saved to https://phabricator.wikimedia.org/P86785 and previous config saved to /var/cache/conftool/dbconfig/20260108-060551-marostegui.json [06:05:55] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [06:06:19] (03CR) 10Marostegui: "Hi, thanks for this." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [06:12:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:16:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P86786 and previous config saved to /var/cache/conftool/dbconfig/20260108-061600-marostegui.json [06:26:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P86787 and previous config saved to /var/cache/conftool/dbconfig/20260108-062608-marostegui.json [06:35:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:36:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T413525)', diff saved to https://phabricator.wikimedia.org/P86788 and previous config saved to /var/cache/conftool/dbconfig/20260108-063616-marostegui.json [06:36:20] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [06:36:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [06:36:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1168 (T413525)', diff saved to https://phabricator.wikimedia.org/P86789 and previous config saved to /var/cache/conftool/dbconfig/20260108-063642-marostegui.json [06:40:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:42:09] (03PS1) 10Ayounsi: Revert^2 "Line-wrap Homer diffs" [puppet] - 10https://gerrit.wikimedia.org/r/1224528 [06:43:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T413525)', diff saved to https://phabricator.wikimedia.org/P86790 and previous config saved to /var/cache/conftool/dbconfig/20260108-064333-marostegui.json [06:43:37] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [06:44:02] (03CR) 10CI reject: [V:04-1] Revert^2 "Line-wrap Homer diffs" [puppet] - 10https://gerrit.wikimedia.org/r/1224528 (owner: 10Ayounsi) [06:53:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P86791 and previous config saved to /var/cache/conftool/dbconfig/20260108-065342-marostegui.json [06:59:40] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:59:46] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T0700) [07:00:05] marostegui, Amir1, and federico3: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T0700). [07:00:06] FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:01:40] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:01:46] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:03:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P86792 and previous config saved to /var/cache/conftool/dbconfig/20260108-070351-marostegui.json [07:04:09] RESOLVED: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:08:07] (03PS1) 10Marostegui: filtered_tables.txt: Add new columns [puppet] - 10https://gerrit.wikimedia.org/r/1224533 (https://phabricator.wikimedia.org/T413688) [07:08:22] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1224533 (https://phabricator.wikimedia.org/T413688) (owner: 10Marostegui) [07:08:38] (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Add new columns [puppet] - 10https://gerrit.wikimedia.org/r/1224533 (https://phabricator.wikimedia.org/T413688) (owner: 10Marostegui) [07:13:46] (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.5 [puppet] - 10https://gerrit.wikimedia.org/r/1224534 (https://phabricator.wikimedia.org/T414053) [07:14:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T413525)', diff saved to https://phabricator.wikimedia.org/P86793 and previous config saved to /var/cache/conftool/dbconfig/20260108-071359-marostegui.json [07:14:03] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [07:14:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [07:14:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1173 (T413525)', diff saved to https://phabricator.wikimedia.org/P86794 and previous config saved to /var/cache/conftool/dbconfig/20260108-071413-marostegui.json [07:21:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T413525)', diff saved to https://phabricator.wikimedia.org/P86795 and previous config saved to /var/cache/conftool/dbconfig/20260108-072130-marostegui.json [07:21:34] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [07:22:07] (03CR) 10Dzahn: [C:03+1] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.5 [puppet] - 10https://gerrit.wikimedia.org/r/1224534 (https://phabricator.wikimedia.org/T414053) (owner: 10Jelto) [07:22:08] (03CR) 10Arnaudb: "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1224534 (https://phabricator.wikimedia.org/T414053) (owner: 10Jelto) [07:22:11] (03CR) 10Arnaudb: [C:03+1] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.5 [puppet] - 10https://gerrit.wikimedia.org/r/1224534 (https://phabricator.wikimedia.org/T414053) (owner: 10Jelto) [07:23:41] (03CR) 10Jelto: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.5 [puppet] - 10https://gerrit.wikimedia.org/r/1224534 (https://phabricator.wikimedia.org/T414053) (owner: 10Jelto) [07:27:47] (03PS2) 10Ayounsi: Revert^2 "Line-wrap Homer diffs" [puppet] - 10https://gerrit.wikimedia.org/r/1224528 [07:29:51] (03CR) 10Muehlenhoff: [C:04-1] "I don't think the patch is correct. Since my patch we're actually receiving a lot less of these and the reason we're still seeing some is " [puppet] - 10https://gerrit.wikimedia.org/r/1224528 (owner: 10Ayounsi) [07:31:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P86796 and previous config saved to /var/cache/conftool/dbconfig/20260108-073139-marostegui.json [07:31:56] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1224205 (https://phabricator.wikimedia.org/T414032) (owner: 10Ahmon Dancy) [07:33:27] (03CR) 10Dzahn: [C:03+1] Add Cumin alias for tcpproxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/1224057 (https://phabricator.wikimedia.org/T408532) (owner: 10Muehlenhoff) [07:36:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11502255 (10Jclark-ctr) @Clement_Goubert Before I start racking these, do you want to verify that they’re correct by row, since we had so many orders for these? [07:37:09] (03CR) 10Muehlenhoff: [C:03+2] Yubikey-SSH-FIDO: add new key for dancy [puppet] - 10https://gerrit.wikimedia.org/r/1224205 (https://phabricator.wikimedia.org/T414032) (owner: 10Ahmon Dancy) [07:37:54] 06SRE, 10SRE-Access-Requests, 06Release-Engineering-Team, 13Patch-For-Review: Add yubikey ssh key for dancy - https://phabricator.wikimedia.org/T414032#11502257 (10MoritzMuehlenhoff) [07:41:02] (03CR) 10Muehlenhoff: [C:03+1] "Confusion cleared up on IRC, this actually makes sense now!" [puppet] - 10https://gerrit.wikimedia.org/r/1224528 (owner: 10Ayounsi) [07:41:19] (03CR) 10Ayounsi: [C:03+2] Revert^2 "Line-wrap Homer diffs" [puppet] - 10https://gerrit.wikimedia.org/r/1224528 (owner: 10Ayounsi) [07:41:38] (03CR) 10Dzahn: [C:03+2] cache-text: add wikipedia25 to enabled_certificates [puppet] - 10https://gerrit.wikimedia.org/r/1224096 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto) [07:41:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P86797 and previous config saved to /var/cache/conftool/dbconfig/20260108-074147-marostegui.json [07:43:35] (03CR) 10Joal: [C:03+1] "I'm not yet very familiar with charts, but looked ok to me :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224145 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [07:44:53] (03CR) 10Muehlenhoff: [C:03+2] Add Cumin alias for tcpproxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/1224057 (https://phabricator.wikimedia.org/T408532) (owner: 10Muehlenhoff) [07:46:03] (03PS4) 10Dzahn: cache-text: add wikipedia25 to enabled_certificates [puppet] - 10https://gerrit.wikimedia.org/r/1224096 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto) [07:47:39] (03CR) 10Dzahn: [C:03+2] cache-text: add wikipedia25 to enabled_certificates [puppet] - 10https://gerrit.wikimedia.org/r/1224096 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto) [07:51:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T413525)', diff saved to https://phabricator.wikimedia.org/P86798 and previous config saved to /var/cache/conftool/dbconfig/20260108-075155-marostegui.json [07:51:59] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [07:52:12] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [07:52:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T413525)', diff saved to https://phabricator.wikimedia.org/P86799 and previous config saved to /var/cache/conftool/dbconfig/20260108-075220-marostegui.json [07:53:31] 07sre-alert-triage, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Alert in need of triage: HDFS topology check (instance an-master1003) - https://phabricator.wikimedia.org/T413742#11502270 (10JAllemandou) a:03JAllemandou [07:55:39] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [07:56:04] 07sre-alert-triage, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Alert in need of triage: HDFS topology check (instance an-master1003) - https://phabricator.wikimedia.org/T413742#11502277 (10JAllemandou) Current hadoop topology: ` joal@an-master1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -p... [07:59:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T413525)', diff saved to https://phabricator.wikimedia.org/P86800 and previous config saved to /var/cache/conftool/dbconfig/20260108-075937-marostegui.json [07:59:41] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T0800). nyaa~ [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:05:01] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [08:06:37] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 4904.26 ms [08:07:05] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [08:07:29] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [08:08:41] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:08:45] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:08:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [08:08:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:09:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P86801 and previous config saved to /var/cache/conftool/dbconfig/20260108-080945-marostegui.json [08:10:06] FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:10:12] !!incidents [08:10:14] !incidents [08:10:15] 7294 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [08:10:15] 7295 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [08:10:15] 7290 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [08:10:15] 7292 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [08:10:16] 7291 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [08:10:16] 7293 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [08:10:16] 7289 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [08:10:17] 7288 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [08:10:17] 7287 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [08:10:18] 7284 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [08:10:18] 7283 (RESOLVED) ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw) [08:10:19] 7282 (RESOLVED) ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw) [08:10:19] 7281 (RESOLVED) ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw) [08:10:20] 7280 (RESOLVED) ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw) [08:10:47] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:10:53] RECOVERY - Host titan1002 is UP: PING WARNING - Packet loss = 0%, RTA = 1515.83 ms [08:13:41] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:13:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:14:09] FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:14:09] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:14:19] !incidents [08:14:19] 7294 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [08:14:19] 7295 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [08:14:20] 7290 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [08:14:20] 7292 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [08:14:20] 7291 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [08:14:20] 7293 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [08:14:20] 7289 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [08:14:21] 7288 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [08:14:21] 7287 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [08:14:22] 7284 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [08:14:22] 7283 (RESOLVED) ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw) [08:14:23] 7282 (RESOLVED) ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw) [08:14:23] 7281 (RESOLVED) ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw) [08:14:24] 7280 (RESOLVED) ProbeDown sre (10.2.1.16 ip4 zotero:4969 probes/service http_zotero_ip4 codfw) [08:14:45] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:14:49] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:15:30] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [08:15:56] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:16:34] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [08:19:09] RESOLVED: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:19:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P86802 and previous config saved to /var/cache/conftool/dbconfig/20260108-081953-marostegui.json [08:20:04] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [08:25:56] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:26:32] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:29:09] FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:29:49] (03PS1) 10Dzahn: microsites: monitor wikipedia25.org (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1224575 [08:30:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T413525)', diff saved to https://phabricator.wikimedia.org/P86803 and previous config saved to /var/cache/conftool/dbconfig/20260108-083001-marostegui.json [08:30:05] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:30:06] FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:30:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance [08:30:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T413525)', diff saved to https://phabricator.wikimedia.org/P86804 and previous config saved to /var/cache/conftool/dbconfig/20260108-083026-marostegui.json [08:31:30] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [08:34:09] RESOLVED: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:36:01] (03PS1) 10Dzahn: add wikipedia25.org to list of wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/1224576 (https://phabricator.wikimedia.org/T408592) [08:37:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T413525)', diff saved to https://phabricator.wikimedia.org/P86805 and previous config saved to /var/cache/conftool/dbconfig/20260108-083734-marostegui.json [08:37:38] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:41:50] (03CR) 10Dzahn: "compiler says it only changes profile::environment for proxy settings: https://puppet-compiler.wmflabs.org/output/1224576/7857/cp3070.esam" [puppet] - 10https://gerrit.wikimedia.org/r/1224576 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [08:45:02] (03PS1) 10Dzahn: add wikipedia25.org to profile::cache::base::wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592) [08:47:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P86806 and previous config saved to /var/cache/conftool/dbconfig/20260108-084742-marostegui.json [08:48:09] (03PS2) 10Dzahn: add wikipedia25.org to profile::cache::base::wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592) [08:48:34] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1224580/7858/cp3070.esams.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [08:50:11] (03CR) 10Vgutierrez: [C:04-2] "varnish uses `$wikimedia_domains = $profile::cache::base::wikimedia_domains` so I think this could be abandoned" [puppet] - 10https://gerrit.wikimedia.org/r/1224576 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [08:50:41] (03CR) 10Dzahn: "just found that - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1224580" [puppet] - 10https://gerrit.wikimedia.org/r/1224576 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [08:50:52] (03CR) 10Vgutierrez: [C:03+1] add wikipedia25.org to profile::cache::base::wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [08:51:48] (03CR) 10JMeybohm: docker registry: add ml build user password (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [08:57:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P86807 and previous config saved to /var/cache/conftool/dbconfig/20260108-085751-marostegui.json [08:58:16] (03PS1) 10Dzahn: Revert "cache-text: add wikipedia25.org to alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 [08:58:32] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for tgritschacher - https://phabricator.wikimedia.org/T414061 (10Tobi_WMDE_SW) 03NEW [08:58:46] (03CR) 10CI reject: [V:04-1] Revert "cache-text: add wikipedia25.org to alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn) [08:59:06] (03Abandoned) 10Dzahn: add wikipedia25.org to list of wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/1224576 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [09:03:28] (03PS3) 10Muehlenhoff: Add javiermonton to kafka-jumbo-access group [puppet] - 10https://gerrit.wikimedia.org/r/1218337 (https://phabricator.wikimedia.org/T411774) [09:04:45] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for tgritschacher - https://phabricator.wikimedia.org/T414061#11502464 (10Tobi_WMDE_SW) [09:08:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T413525)', diff saved to https://phabricator.wikimedia.org/P86808 and previous config saved to /var/cache/conftool/dbconfig/20260108-090759-marostegui.json [09:08:03] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:08:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [09:11:20] PROBLEM - jenkins_service_running on releases2003 is CRITICAL: PROCS CRITICAL: 2 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [09:11:45] (03CR) 10Muehlenhoff: [C:03+2] Add javiermonton to kafka-jumbo-access group [puppet] - 10https://gerrit.wikimedia.org/r/1218337 (https://phabricator.wikimedia.org/T411774) (owner: 10Muehlenhoff) [09:12:20] RECOVERY - jenkins_service_running on releases2003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [09:12:52] (03CR) 10Jelto: [C:03+1] "looks good to me, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [09:13:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:14:38] (03CR) 10Dzahn: [V:03+1 C:03+2] add wikipedia25.org to profile::cache::base::wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [09:16:44] jelto@cumin1003 jelto: The backup on gitlab1004 is complete, ready to proceed with upgrade. [09:19:04] (03PS2) 10Slyngshede: Notification for users to link their Phabricator account [software/bitu] - 10https://gerrit.wikimedia.org/r/1224076 [09:19:25] (03CR) 10Dzahn: "Am I right that a domain is either a "wikimedia_domain" or an "alternate_domain" but not both?" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn) [09:19:44] jelto@cumin1003 upgrade (PID 3866799) is awaiting input [09:20:20] (03PS1) 10Aklapper: admin: remove old ssh key of aklapper [puppet] - 10https://gerrit.wikimedia.org/r/1224584 (https://phabricator.wikimedia.org/T413009) [09:21:50] (03PS1) 10Filippo Giunchedi: Remove spurious 'diff' file [alerts] - 10https://gerrit.wikimedia.org/r/1224585 [09:23:55] (03PS4) 10Vgutierrez: cache::haproxy: Get rid of http-request after use_backend warning [puppet] - 10https://gerrit.wikimedia.org/r/1215119 [09:24:09] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:26:36] (03CR) 10Muehlenhoff: [C:03+2] Remove cookbooks to migrate roles/hosts to Puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/1219861 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:28:21] (03CR) 10Vgutierrez: "alternate_domains lists the domains that need to be handled by the `misc` VCL rather than the `text` VCL on the varnish text cluster, so r" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn) [09:28:54] (03PS1) 10Jon Harald Søby: planet: Update Wikimedia Norge's feed URL [puppet] - 10https://gerrit.wikimedia.org/r/1224587 [09:30:57] (03CR) 10Dzahn: "Tbh, I thought misc cluster did not exist anymore and had been merged into "text". The intent is to treat this like any other micro site h" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn) [09:31:52] !log remove pybal BGP group on pfw1-codfw (replaced with Bird) - T414015 [09:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:55] T414015: Remove pfw configuration related to former pybal/LVS service - https://phabricator.wikimedia.org/T414015 [09:33:49] (03CR) 10Vgutierrez: [C:03+1] "in terms of hardware that's true, but we still have two VCLs. I'd recommend keeping this as similar as wikiworkshop.org as possible" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn) [09:34:55] (03CR) 10Dzahn: "gotcha! Yea, that was my approach as well. Comparing to wikiworkshop.org - so in that case we should remove it here and merge this revert." [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn) [09:35:14] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [09:36:14] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 120153 bytes in 0.412 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [09:36:44] (03CR) 10Muehlenhoff: "Personally I think the global default from WMFConfig.test_on is an antipattern we should get rid off. The current default is still" [puppet] - 10https://gerrit.wikimedia.org/r/1219149 (owner: 10Majavah) [09:37:12] 07sre-alert-triage, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Alert in need of triage: HDFS topology check (instance an-master1003) - https://phabricator.wikimedia.org/T413742#11502561 (10JAllemandou) I have verified in puppet: all hosts in the `default` rack have already been added to the net-topology. N... [09:37:28] (03PS1) 10Joal: Hieradata/common.yaml: Update hadoop net topology [puppet] - 10https://gerrit.wikimedia.org/r/1224590 (https://phabricator.wikimedia.org/T413742) [09:38:46] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [09:40:37] (03CR) 10Btullis: [C:03+2] Hieradata/common.yaml: Update hadoop net topology [puppet] - 10https://gerrit.wikimedia.org/r/1224590 (https://phabricator.wikimedia.org/T413742) (owner: 10Joal) [09:40:58] (03CR) 10Slyngshede: Notification for users to link their Phabricator account (035 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1224076 (owner: 10Slyngshede) [09:44:02] (03CR) 10Clément Goubert: [V:03+2 C:03+2] ratelimit: Update to main branch e9ce92c [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1224124 (owner: 10Clément Goubert) [09:44:46] (03CR) 10Dzahn: [C:03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1224587 (owner: 10Jon Harald Søby) [09:46:22] (03PS2) 10Dzahn: Revert "cache-text: add wikipedia25.org to alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 [09:46:29] (03PS7) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) [09:46:34] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [09:46:36] (03CR) 10Dpogorzelski: docker registry: add ml build user password (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [09:46:48] !log Rebuilding ratelimit image - T414002 [09:46:49] (03PS2) 10Federico Ceratto: mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) [09:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:51] T414002: Upgrade ratelimit service to latest main - https://phabricator.wikimedia.org/T414002 [09:46:52] (03CR) 10CI reject: [V:04-1] Revert "cache-text: add wikipedia25.org to alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn) [09:48:46] (03CR) 10CI reject: [V:04-1] mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto) [09:49:09] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:50:30] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [09:50:56] (03CR) 10Jelto: [C:03+1] "+1 for keeping wikipedia25 config similar to wikiworkshop" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn) [09:54:09] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:54:10] (03PS1) 10Clément Goubert: ratelimit: Fix golang base image version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1224591 [09:56:41] (03PS1) 10Clément Goubert: ratelimit: Fix golang base image version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1224592 [09:57:08] (03CR) 10Elukey: [C:03+2] profile::docker_registry: add the ML instance [puppet] - 10https://gerrit.wikimedia.org/r/1224091 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey) [09:59:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance [10:02:02] (03PS1) 10Clément Goubert: go mod tidy [software/envoyproxy/ratelimiter] (git20260107.e9ce92c-vendor) - 10https://gerrit.wikimedia.org/r/1224593 [10:02:36] (03PS1) 10Clément Goubert: Revert "go mod tidy" [software/envoyproxy/ratelimiter] (git20260107.e9ce92c-vendor) - 10https://gerrit.wikimedia.org/r/1224594 [10:03:38] (03CR) 10Btullis: [C:03+2] Add a kyuubi service to the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224145 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [10:04:38] (03PS4) 10Muehlenhoff: Only select Puppet version based on the Debian release [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) [10:05:29] (03Merged) 10jenkins-bot: Add a kyuubi service to the spark-support chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224145 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [10:07:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance [10:07:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2151 (T413525)', diff saved to https://phabricator.wikimedia.org/P86809 and previous config saved to /var/cache/conftool/dbconfig/20260108-100757-marostegui.json [10:08:01] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [10:09:49] (03PS3) 10Federico Ceratto: mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) [10:10:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:10:58] (03CR) 10CI reject: [V:04-1] mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto) [10:12:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:12] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [10:13:36] (03PS1) 10Clément Goubert: ratelimit: fix go build command [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1224597 [10:19:18] (03PS4) 10Federico Ceratto: mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) [10:20:42] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:20:46] (03CR) 10CI reject: [V:04-1] mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto) [10:20:50] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:21:02] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [10:21:15] (03PS1) 10Clément Goubert: rest-gateway: Use fixed version for ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224598 (https://phabricator.wikimedia.org/T414002) [10:21:30] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [10:21:42] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:22:20] uh... :) [10:22:31] ^^ expected? [10:23:25] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Use fixed version for ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224598 (https://phabricator.wikimedia.org/T414002) (owner: 10Clément Goubert) [10:23:28] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [10:23:37] it flapped like that earlier before as well [10:23:50] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:23:58] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [10:24:09] FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:24:30] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [10:24:49] tappof: are you around? [10:25:06] FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:25:18] (03Merged) 10jenkins-bot: rest-gateway: Use fixed version for ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224598 (https://phabricator.wikimedia.org/T414002) (owner: 10Clément Goubert) [10:26:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T413525)', diff saved to https://phabricator.wikimedia.org/P86810 and previous config saved to /var/cache/conftool/dbconfig/20260108-102613-marostegui.json [10:26:17] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [10:27:15] (03PS1) 10Clément Goubert: rest-gateway: Remove ratelimiter staging version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224600 [10:27:20] (03PS5) 10Federico Ceratto: mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) [10:28:49] (03CR) 10CI reject: [V:04-1] mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto) [10:29:09] RESOLVED: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:22] (03PS3) 10Dzahn: Revert "cache-text: add wikipedia25.org to alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 [10:33:13] (03CR) 10Muehlenhoff: [C:03+2] Only select Puppet version based on the Debian release [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:35:21] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for tgritschacher - https://phabricator.wikimedia.org/T414061#11502687 (10WMDE-leszek) I approve this request on WMDE end. [10:35:32] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Remove ratelimiter staging version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224600 (owner: 10Clément Goubert) [10:35:57] !log wikitech wiki - made lsobanski an admin - T414065 [10:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:00] T414065: Requesting Wikitech admin access for @lsobanski - https://phabricator.wikimedia.org/T414065 [10:36:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P86811 and previous config saved to /var/cache/conftool/dbconfig/20260108-103622-marostegui.json [10:36:32] (03CR) 10Dzahn: [C:03+2] Revert "cache-text: add wikipedia25.org to alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/1224581 (owner: 10Dzahn) [10:37:21] (03Merged) 10jenkins-bot: rest-gateway: Remove ratelimiter staging version override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224600 (owner: 10Clément Goubert) [10:39:27] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:39:43] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:43:14] (03CR) 10AikoChou: [C:03+2] ml-services: Update image and config for revise-tone-task-generator on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224117 (https://phabricator.wikimedia.org/T412210) (owner: 10AikoChou) [10:44:52] (03Merged) 10jenkins-bot: ml-services: Update image and config for revise-tone-task-generator on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224117 (https://phabricator.wikimedia.org/T412210) (owner: 10AikoChou) [10:45:49] (03PS1) 10Gkyziridis: revert-risk: Deploy on prod and staging new model version for both language-agnosting and multingual. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224604 (https://phabricator.wikimedia.org/T411786) [10:46:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P86812 and previous config saved to /var/cache/conftool/dbconfig/20260108-104630-marostegui.json [10:50:17] (03PS1) 10Muehlenhoff: puppet: Remove the force_puppet7 parameter [puppet] - 10https://gerrit.wikimedia.org/r/1224605 (https://phabricator.wikimedia.org/T365798) [10:52:10] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey) [10:53:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224605 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:56:02] (03CR) 10Santiago Faci: "I would say we can merge it. My understanding is that this is not a blocker anyway (at least for the renaming on our side) because there i" [puppet] - 10https://gerrit.wikimedia.org/r/1212435 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [10:56:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T413525)', diff saved to https://phabricator.wikimedia.org/P86814 and previous config saved to /var/cache/conftool/dbconfig/20260108-105639-marostegui.json [10:56:43] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [10:56:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance [10:57:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2158 (T413525)', diff saved to https://phabricator.wikimedia.org/P86815 and previous config saved to /var/cache/conftool/dbconfig/20260108-105703-marostegui.json [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1100) [11:01:27] (03PS4) 10Elukey: sre.hosts.reimage: remove puppet 5 support and default to 7 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) [11:03:18] (03PS1) 10Ozge: feat: embeddings server performance tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224607 [11:04:41] (03PS2) 10Ozge: feat: embeddings server performance tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224607 [11:05:45] (03PS1) 10Fabfur: cache::text enable auth, known, bot ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224609 (https://phabricator.wikimedia.org/T406545) [11:07:06] (03PS2) 10Fabfur: cache::text enable auth, known, bot ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224609 (https://phabricator.wikimedia.org/T406545) [11:09:37] (03CR) 10Elukey: docker registry: add ml build user password (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [11:09:58] (03PS1) 10Fabfur: cache::text: enable unid, browser ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224614 (https://phabricator.wikimedia.org/T406545) [11:10:02] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:10:19] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:11:06] (03PS1) 10Mszwarc: Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224615 (https://phabricator.wikimedia.org/T413929) [11:11:56] (03CR) 10Kevin Bazira: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224607 (owner: 10Ozge) [11:14:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224615 (https://phabricator.wikimedia.org/T413929) (owner: 10Mszwarc) [11:15:18] (03PS1) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1224616 (https://phabricator.wikimedia.org/T406545) [11:15:20] (03PS1) 10Fabfur: cache::text: enable unid, browser ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1224617 (https://phabricator.wikimedia.org/T406545) [11:15:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T413525)', diff saved to https://phabricator.wikimedia.org/P86816 and previous config saved to /var/cache/conftool/dbconfig/20260108-111537-marostegui.json [11:15:47] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [11:16:45] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:16:58] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:17:19] (03CR) 10Ozge: [C:03+2] feat: embeddings server performance tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224607 (owner: 10Ozge) [11:17:43] (03CR) 10Ozge: [V:03+2 C:03+2] feat: embeddings server performance tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224607 (owner: 10Ozge) [11:18:43] (03PS1) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224619 (https://phabricator.wikimedia.org/T406545) [11:18:45] (03PS1) 10Fabfur: cache::text: enable unid, browser ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224620 (https://phabricator.wikimedia.org/T406545) [11:19:43] (03Merged) 10jenkins-bot: feat: embeddings server performance tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224607 (owner: 10Ozge) [11:21:13] (03PS1) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224621 (https://phabricator.wikimedia.org/T406545) [11:21:15] (03PS1) 10Fabfur: cache::text: enable unid, browser ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224622 (https://phabricator.wikimedia.org/T406545) [11:22:35] (03CR) 10Vgutierrez: [C:03+1] cache::text enable auth, known, bot ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224609 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [11:22:36] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11502848 (10elukey) Today I created a new docker distribution instance only for ML, backed by S3/apus and I created a new bucket for it (same account): registry-ml [11:23:42] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster::scripts [puppet] - 10https://gerrit.wikimedia.org/r/1224089 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [11:25:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P86817 and previous config saved to /var/cache/conftool/dbconfig/20260108-112546-marostegui.json [11:27:57] (03PS1) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545) [11:28:40] (03PS4) 10Silvan Heintze: Report progress of Wikibase entity dumps [dumps] - 10https://gerrit.wikimedia.org/r/1219837 (https://phabricator.wikimedia.org/T408423) [11:28:40] (03PS2) 10Silvan Heintze: Report # of skipped entities by type [dumps] - 10https://gerrit.wikimedia.org/r/1224110 (https://phabricator.wikimedia.org/T413869) [11:29:15] (03PS1) 10Fabfur: cache::text: enable unid, browser ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224626 (https://phabricator.wikimedia.org/T406545) [11:29:37] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224609 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [11:30:44] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1224076 (owner: 10Slyngshede) [11:30:50] (03CR) 10Hnowlan: [C:03+1] Add Hugh as approver for mw-log-readers and logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/1214078 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [11:31:41] (03PS3) 10Muehlenhoff: Add Hugh as approver for mw-log-readers and logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/1214078 (https://phabricator.wikimedia.org/T276465) [11:33:16] (03CR) 10Silvan Heintze: "Thanks for the review" [dumps] - 10https://gerrit.wikimedia.org/r/1224110 (https://phabricator.wikimedia.org/T413869) (owner: 10Silvan Heintze) [11:34:53] (03CR) 10Muehlenhoff: [C:03+2] Add Hugh as approver for mw-log-readers and logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/1214078 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [11:35:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P86818 and previous config saved to /var/cache/conftool/dbconfig/20260108-113554-marostegui.json [11:35:59] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11502915 (10MoritzMuehlenhoff) [11:42:36] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 140 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:45:21] (03CR) 10Sergio Gimeno: [C:04-1] "Ack" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219541 (https://phabricator.wikimedia.org/T411479) (owner: 10Sergio Gimeno) [11:45:32] (03CR) 10Vgutierrez: [V:03+1 C:03+1] "looking good: https://puppet-compiler.wmflabs.org/output/1224609/7859/" [puppet] - 10https://gerrit.wikimedia.org/r/1224609 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [11:46:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T413525)', diff saved to https://phabricator.wikimedia.org/P86819 and previous config saved to /var/cache/conftool/dbconfig/20260108-114602-marostegui.json [11:46:06] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [11:46:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [11:46:20] (03CR) 10Fabfur: [C:03+2] cache::text enable auth, known, bot ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224609 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [11:46:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2169 (T413525)', diff saved to https://phabricator.wikimedia.org/P86820 and previous config saved to /var/cache/conftool/dbconfig/20260108-114627-marostegui.json [11:50:10] !log ozge@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:52:40] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [11:53:32] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [11:54:01] (03PS2) 10Fabfur: cache::text: enable unid, browser ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224614 (https://phabricator.wikimedia.org/T406545) [11:54:17] (03CR) 10Slyngshede: [C:03+2] Notification for users to link their Phabricator account [software/bitu] - 10https://gerrit.wikimedia.org/r/1224076 (owner: 10Slyngshede) [11:55:29] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1224614 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [11:55:39] (03CR) 10Vgutierrez: [V:03+1 C:03+1] cache::text: enable unid, browser ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224614 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [11:57:13] (03Merged) 10jenkins-bot: Notification for users to link their Phabricator account [software/bitu] - 10https://gerrit.wikimedia.org/r/1224076 (owner: 10Slyngshede) [12:00:13] (03PS1) 10Btullis: Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977) [12:00:22] (03CR) 10CI reject: [V:04-1] Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [12:00:22] (03PS2) 10Btullis: Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977) [12:02:01] (03CR) 10Fabfur: [C:03+2] cache::text: enable unid, browser ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1224614 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [12:02:17] (03CR) 10Dreamy Jazz: [C:03+1] Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224615 (https://phabricator.wikimedia.org/T413929) (owner: 10Mszwarc) [12:03:41] (03CR) 10Clément Goubert: [C:03+1] Copy rest_v1-wikimedia.json to standard-docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224228 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [12:03:42] (03PS3) 10Btullis: Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977) [12:03:51] (03CR) 10CI reject: [V:04-1] Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [12:03:51] (03PS4) 10Btullis: Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977) [12:04:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T413525)', diff saved to https://phabricator.wikimedia.org/P86821 and previous config saved to /var/cache/conftool/dbconfig/20260108-120436-marostegui.json [12:04:40] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [12:06:34] (03PS1) 10Clément Goubert: Revert^2 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1224638 [12:07:21] (03CR) 10Hnowlan: [C:03+1] Revert^2 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1224638 (owner: 10Clément Goubert) [12:11:04] (03CR) 10Clément Goubert: [C:03+2] Revert^2 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1224638 (owner: 10Clément Goubert) [12:13:52] (03PS1) 10Clément Goubert: deploy: Add redioscope kubeconfig to aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1224643 (https://phabricator.wikimedia.org/T413999) [12:14:09] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:14:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P86822 and previous config saved to /var/cache/conftool/dbconfig/20260108-121445-marostegui.json [12:16:15] (03PS1) 10Clément Goubert: aux-k8s: Add redioscope namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224644 (https://phabricator.wikimedia.org/T413999) [12:19:02] 07sre-alert-triage, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Alert in need of triage: HDFS topology check (instance an-master1003) - https://phabricator.wikimedia.org/T413742#11503111 (10JAllemandou) Namenodes restarted on `an-master1003` and `an-master1004`. The alert is solved. [12:19:43] (03PS1) 10Hashar: admin: hashar: sync .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/1224648 [12:24:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P86823 and previous config saved to /var/cache/conftool/dbconfig/20260108-122453-marostegui.json [12:24:59] (03PS8) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) [12:25:21] (03CR) 10Dpogorzelski: docker registry: add ml build user password (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [12:25:58] (03CR) 10Jelto: [C:03+2] admin: hashar: sync .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/1224648 (owner: 10Hashar) [12:32:12] (03CR) 10Btullis: [C:03+2] Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [12:33:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [12:34:09] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:34:12] (03Merged) 10jenkins-bot: Update the zookeeper address for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224636 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [12:34:40] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [12:35:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T413525)', diff saved to https://phabricator.wikimedia.org/P86824 and previous config saved to /var/cache/conftool/dbconfig/20260108-123501-marostegui.json [12:35:05] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [12:35:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [12:35:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2180 (T413525)', diff saved to https://phabricator.wikimedia.org/P86825 and previous config saved to /var/cache/conftool/dbconfig/20260108-123526-marostegui.json [12:35:46] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [12:37:30] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [12:39:00] !log ozge@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [12:39:09] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:40:13] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [12:40:50] (03CR) 10Jelto: [C:03+1] "what about https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common.yaml#2826 ?" [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [12:41:27] jmm@cumin2002 reimage (PID 2452309) is awaiting input [12:42:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T413525)', diff saved to https://phabricator.wikimedia.org/P86826 and previous config saved to /var/cache/conftool/dbconfig/20260108-124238-marostegui.json [12:42:42] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [12:44:53] (03CR) 10Vgutierrez: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1224616/7861/ looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1224616 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [12:45:34] (03CR) 10Santiago Faci: [C:03+1] extension-list: Add Test Kitchen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216847 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [12:47:38] PROBLEM - LDAP -writable server- on serpens is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [12:49:28] (03Abandoned) 10Tchanders: WIP Check if adding - prevents "no change" CI failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224045 (owner: 10Tchanders) [12:49:40] (03CR) 10Tchanders: "Same failure" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224045 (owner: 10Tchanders) [12:52:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P86827 and previous config saved to /var/cache/conftool/dbconfig/20260108-125247-marostegui.json [12:53:16] !log installing libsodium security updates on bullseye [12:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:26] (03PS1) 10Btullis: Failover the hive server2 and metastore services to the standby [dns] - 10https://gerrit.wikimedia.org/r/1224657 (https://phabricator.wikimedia.org/T303168) [12:59:40] (03CR) 10Dzahn: [V:03+1 C:03+2] "That was https://gerrit.wikimedia.org/r/c/operations/puppet/+/1224576 but see the comment there." [puppet] - 10https://gerrit.wikimedia.org/r/1224580 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1300) [13:02:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P86828 and previous config saved to /var/cache/conftool/dbconfig/20260108-130255-marostegui.json [13:04:38] RECOVERY - LDAP -writable server- on serpens is OK: LDAP OK - 0.101 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [13:09:42] (03PS1) 10Slyngshede: Account linking: hide message box when linked [software/bitu] - 10https://gerrit.wikimedia.org/r/1224660 [13:10:43] (03PS1) 10Btullis: Add a networkpolicy to spark-support permitting access to kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224661 (https://phabricator.wikimedia.org/T413977) [13:10:52] (03CR) 10CI reject: [V:04-1] Add a networkpolicy to spark-support permitting access to kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224661 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [13:10:52] (03PS2) 10Btullis: Add a networkpolicy to spark-support permitting access to kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224661 (https://phabricator.wikimedia.org/T413977) [13:12:51] (03CR) 10Fabfur: [C:03+2] cache::text: enable auth, known, bot ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1224616 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [13:12:59] (03CR) 10Btullis: [C:03+2] Report progress of Wikibase entity dumps [dumps] - 10https://gerrit.wikimedia.org/r/1219837 (https://phabricator.wikimedia.org/T408423) (owner: 10Silvan Heintze) [13:13:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T413525)', diff saved to https://phabricator.wikimedia.org/P86831 and previous config saved to /var/cache/conftool/dbconfig/20260108-131303-marostegui.json [13:13:07] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [13:13:08] 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Alert in need of triage: HDFS topology check (instance an-master1003) - https://phabricator.wikimedia.org/T413742#11503308 (10Gehel) [13:13:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance [13:13:27] 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Alert in need of triage: PuppetPendingCertificateRequest (instance puppetserver1001:9100) - https://phabricator.wikimedia.org/T412789#11503322 (10Gehel) [13:13:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2193 (T413525)', diff saved to https://phabricator.wikimedia.org/P86832 and previous config saved to /var/cache/conftool/dbconfig/20260108-131327-marostegui.json [13:13:31] (03CR) 10Btullis: [C:03+2] Add a networkpolicy to spark-support permitting access to kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224661 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [13:14:32] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11503350 (10Gehel) [13:14:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11503357 (10Gehel) [13:15:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413336#11503378 (10Gehel) [13:15:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1191 - https://phabricator.wikimedia.org/T411209#11503386 (10Gehel) [13:16:06] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [13:16:40] 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11503413 (10Gehel) [13:17:22] (03Merged) 10jenkins-bot: Add a networkpolicy to spark-support permitting access to kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224661 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [13:17:45] 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 13Patch-For-Review: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11503431 (10Gehel) [13:17:57] !log installing imagemagick security updates [13:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:53] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [13:19:03] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [13:19:31] FTR, I won’t be around during the first half of today’s UTC afternoon backport window [13:19:37] 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), and 3 others: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573#11503473 (10Gehel) [13:19:48] maybe someone else from the Wikidata team will be around to deploy the config change I scheduled, otherwise I’ll do it after ca. 14:30 UTC [13:20:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T413525)', diff saved to https://phabricator.wikimedia.org/P86833 and previous config saved to /var/cache/conftool/dbconfig/20260108-132003-marostegui.json [13:20:05] (and hopefully someone else can deploy for Superpes3 and Msz2001) [13:20:07] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [13:20:31] I'm a deployer myself, I can do it [13:21:47] 👍 [13:23:18] PROBLEM - Ensure traffic_manager is running for instance backend on cp6004 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:23:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2231 gradually with 4 steps - Pool db2231.codfw.wmnet in after cloning [13:23:40] 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Alert in need of triage: HDFS topology check (instance an-master1003) - https://phabricator.wikimedia.org/T413742#11503552 (10Gehel) 05Open→03Resolved [13:24:09] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:24:18] RECOVERY - Ensure traffic_manager is running for instance backend on cp6004 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:26:56] PROBLEM - Host ncredir7003 is DOWN: CRITICAL - Time to live exceeded (10.140.2.3) [13:26:56] PROBLEM - Host cp7014 is DOWN: CRITICAL - Time to live exceeded (10.140.1.10) [13:26:56] PROBLEM - Host ncredir7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.8) [13:27:14] RECOVERY - Host cp7014 is UP: PING OK - Packet loss = 0%, RTA = 137.56 ms [13:27:18] RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 138.13 ms [13:27:18] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 138.16 ms [13:30:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P86835 and previous config saved to /var/cache/conftool/dbconfig/20260108-133011-marostegui.json [13:31:12] (03PS2) 10Fabfur: cache::text: enable unid, browser ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1224617 (https://phabricator.wikimedia.org/T406545) [13:32:24] (03CR) 10Joal: [C:03+1] Failover the hive server2 and metastore services to the standby [dns] - 10https://gerrit.wikimedia.org/r/1224657 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [13:32:45] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1224617 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [13:32:54] (03CR) 10Vgutierrez: [V:03+1 C:03+1] cache::text: enable unid, browser ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1224617 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [13:33:17] (03CR) 10Fabfur: [C:03+2] cache::text: enable unid, browser ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1224617 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [13:37:04] (03PS1) 10Jgreen: Remove deprecated pay-lvs records and related transitional records [dns] - 10https://gerrit.wikimedia.org/r/1224672 (https://phabricator.wikimedia.org/T398321) [13:38:25] (03CR) 10Jgreen: [C:03+2] Remove deprecated pay-lvs records and related transitional records [dns] - 10https://gerrit.wikimedia.org/r/1224672 (https://phabricator.wikimedia.org/T398321) (owner: 10Jgreen) [13:39:33] (03PS1) 10Ladsgroup: Update API call in edit.js with rvslots [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224673 (https://phabricator.wikimedia.org/T412762) [13:39:53] (03PS2) 10Clément Goubert: aux-k8s: Add redioscope namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224644 (https://phabricator.wikimedia.org/T413999) [13:40:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P86837 and previous config saved to /var/cache/conftool/dbconfig/20260108-134020-marostegui.json [13:42:01] (03CR) 10Gehel: [C:03+1] Failover the hive server2 and metastore services to the standby [dns] - 10https://gerrit.wikimedia.org/r/1224657 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [13:45:40] PROBLEM - Thanos swift https on thanos-fe1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [13:45:50] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:46:30] RECOVERY - Thanos swift https on thanos-fe1006 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Thanos [13:46:48] !log jgreen@dns1004 START - running authdns-update [13:47:50] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:47:55] !log jgreen@dns1004 END - running authdns-update [13:50:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T413525)', diff saved to https://phabricator.wikimedia.org/P86838 and previous config saved to /var/cache/conftool/dbconfig/20260108-135028-marostegui.json [13:50:32] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [13:50:32] (03PS9) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) [13:50:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance [13:51:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance [13:51:07] (03PS2) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224619 (https://phabricator.wikimedia.org/T406545) [13:51:10] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224619 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [13:51:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2217 (T413525)', diff saved to https://phabricator.wikimedia.org/P86839 and previous config saved to /var/cache/conftool/dbconfig/20260108-135111-marostegui.json [13:53:42] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1224619 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [13:53:53] (03CR) 10CDanis: [C:03+2] benthos: webrequest: add res_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1224178 (owner: 10CDanis) [13:55:07] (03PS1) 10Zabe: MimeAnalyzer: Fix syntax error in MAJOR_MIME_TYPES array [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224679 (https://phabricator.wikimedia.org/T414077) [13:55:19] (03CR) 10Vgutierrez: [V:03+1 C:03+1] cache::text: enable auth, known, bot ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224619 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [13:57:05] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [13:57:14] (03CR) 10Fabfur: [C:03+2] cache::text: enable auth, known, bot ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224619 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [13:58:16] !log jgreen@cumin1003 START - Cookbook sre.dns.netbox [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1400). [14:00:05] Lucas_WMDE, Superpes, and Msz2001: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:14] I'm ready to deploy [14:00:23] Superpes3: Are you around? [14:00:47] (03PS2) 10Fabfur: cache::text: enable unid, browser ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224620 (https://phabricator.wikimedia.org/T406545) [14:00:57] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224620 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [14:01:07] I guess, I'll start with my patch, then [14:01:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224615 (https://phabricator.wikimedia.org/T413929) (owner: 10Mszwarc) [14:01:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:02:02] (03PS1) 10Clément Goubert: Revert^3 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1224681 [14:02:58] (03Merged) 10jenkins-bot: Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224615 (https://phabricator.wikimedia.org/T413929) (owner: 10Mszwarc) [14:03:23] !log jgreen@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:04:06] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1224615|Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler (T413929)]] [14:04:09] T413929: CheckUser TransactionProfiler warnings when using Special:CentralAutoLogin creates local accounts - https://phabricator.wikimedia.org/T413929 [14:05:43] (03PS11) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) [14:05:50] (03PS1) 10Btullis: Fix the kyuubi-headless service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224682 (https://phabricator.wikimedia.org/T413977) [14:05:59] (03PS2) 10Btullis: Fix the kyuubi-headless service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224682 (https://phabricator.wikimedia.org/T413977) [14:06:36] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1224615|Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler (T413929)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:06:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:07:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T413525)', diff saved to https://phabricator.wikimedia.org/P86842 and previous config saved to /var/cache/conftool/dbconfig/20260108-140705-marostegui.json [14:07:09] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [14:07:12] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm [14:07:56] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1224660 (owner: 10Slyngshede) [14:08:37] (03CR) 10Btullis: [C:03+2] Fix the kyuubi-headless service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224682 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [14:08:51] !log mszwarc@deploy2002 Sync cancelled. [14:08:57] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2231 gradually with 4 steps - Pool db2231.codfw.wmnet in after cloning [14:09:11] (03CR) 10Clément Goubert: [C:03+2] Revert^3 "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1224681 (owner: 10Clément Goubert) [14:09:24] (03PS1) 10TrainBranchBot: Revert "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224685 [14:09:25] (03CR) 10TrainBranchBot: "mszwarc@deploy2002 created a revert of this change as I9a27e1e066cfee6b26fdca6c05408017b8810b92" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224615 (https://phabricator.wikimedia.org/T413929) (owner: 10Mszwarc) [14:10:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224685 (owner: 10TrainBranchBot) [14:10:32] (03Merged) 10jenkins-bot: Fix the kyuubi-headless service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224682 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [14:11:15] o/ [14:13:37] Hi, Superpes! I'm processing my patch, I can then go with yours [14:15:11] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [14:15:20] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [14:16:07] Msz2001 Many thanks :) I just added another one, I'm on a train so I might have some internet issues, in any case I should be able to test my patches without problems! [14:16:48] Ack [14:17:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P86844 and previous config saved to /var/cache/conftool/dbconfig/20260108-141714-marostegui.json [14:17:42] (03CR) 10Muehlenhoff: [C:03+2] pcc: Drop obsolete OS conditional [puppet] - 10https://gerrit.wikimedia.org/r/1126104 (owner: 10Muehlenhoff) [14:18:18] (03CR) 10Btullis: [C:03+2] Failover the hive server2 and metastore services to the standby [dns] - 10https://gerrit.wikimedia.org/r/1224657 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [14:18:36] !log btullis@dns1004 START - running authdns-update [14:19:37] !log btullis@dns1004 END - running authdns-update [14:23:25] (03Merged) 10jenkins-bot: Revert "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224685 (owner: 10TrainBranchBot) [14:23:55] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1224685|Revert "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler"]] [14:24:09] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:25:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2231.codfw.wmnet onto db2249.codfw.wmnet [14:26:08] !log mszwarc@deploy2002 mszwarc, trainbranchbot: Backport for [[gerrit:1224685|Revert "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:26:18] (03PS1) 10Muehlenhoff: Rename stale_certs_exporter and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224687 (https://phabricator.wikimedia.org/T365798) [14:26:46] !log jgreen@cumin1003 START - Cookbook sre.dns.netbox [14:27:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P86845 and previous config saved to /var/cache/conftool/dbconfig/20260108-142722-marostegui.json [14:27:23] !log mszwarc@deploy2002 mszwarc, trainbranchbot: Continuing with sync [14:28:45] (03PS1) 10Mszwarc: Revert^2 "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224688 [14:28:47] Msz2001: would you mind pinging me once you're done? Thanks! :D [14:28:57] Sure! [14:31:12] (03CR) 10Vgutierrez: [C:03+1] cache::text: enable unid, browser ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224620 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [14:31:30] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224685|Revert "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler"]] (duration: 07m 34s) [14:31:52] Superpes: Ready to deploy yours [14:32:04] (03PS1) 10Muehlenhoff: Rename puppetmaster::gitsync and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224689 (https://phabricator.wikimedia.org/T365798) [14:32:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223155 (https://phabricator.wikimedia.org/T413530) (owner: 10Superpes15) [14:32:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223159 (https://phabricator.wikimedia.org/T413737) (owner: 10Superpes15) [14:32:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224232 (https://phabricator.wikimedia.org/T413848) (owner: 10Superpes15) [14:33:16] !log installing systemd bugfix updates from Bookworm point release [14:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:44] (03Merged) 10jenkins-bot: [enwikiquote] Enable block feature for AbuseFilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223155 (https://phabricator.wikimedia.org/T413530) (owner: 10Superpes15) [14:33:46] (03Merged) 10jenkins-bot: [ruwiki] Disable setting a cookie for blocked anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223159 (https://phabricator.wikimedia.org/T413737) (owner: 10Superpes15) [14:33:57] (03PS2) 10Zabe: MimeAnalyzer: Fix syntax error in MAJOR_MIME_TYPES array [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224679 (https://phabricator.wikimedia.org/T414077) [14:34:02] (03Merged) 10jenkins-bot: [enwikiquote] Add new autopatrolled and patroller usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224232 (https://phabricator.wikimedia.org/T413848) (owner: 10Superpes15) [14:34:36] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1223155|[enwikiquote] Enable block feature for AbuseFilter (T413530)]], [[gerrit:1223159|[ruwiki] Disable setting a cookie for blocked anonymous users (T413737)]], [[gerrit:1224232|[enwikiquote] Add new autopatrolled and patroller usergroups (T413848)]] [14:34:41] Thanks Msz2001 :) [14:34:43] T413530: Enable the AbuseFilter block action on the English Wikiquote - https://phabricator.wikimedia.org/T413530 [14:34:43] T413737: Disable installing a "block cookie" to a proxy-blocked anons in ruwiki - https://phabricator.wikimedia.org/T413737 [14:34:43] T413848: [enwikiquote] Create the autopatroller and patroller user groups - https://phabricator.wikimedia.org/T413848 [14:34:54] !log jgreen@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:34:58] If we have time later, I'll try to redeploy my original patch - I did so at 14:00 UTC, but when testing it appeared not to work, so I didn't proceed with deploying to prod, instead created a revert patch, but on further analysis it turned out that I was testing it on a wrong version of wiki, so I'll try to redeploy the patch later, because the problem was at my side and not at the patch's :D [14:36:36] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [14:36:48] !log mszwarc@deploy2002 mszwarc, superpes: Backport for [[gerrit:1223155|[enwikiquote] Enable block feature for AbuseFilter (T413530)]], [[gerrit:1223159|[ruwiki] Disable setting a cookie for blocked anonymous users (T413737)]], [[gerrit:1224232|[enwikiquote] Add new autopatrolled and patroller usergroups (T413848)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified t [14:36:48] here. [14:37:10] Testing [14:37:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T413525)', diff saved to https://phabricator.wikimedia.org/P86846 and previous config saved to /var/cache/conftool/dbconfig/20260108-143730-marostegui.json [14:37:34] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [14:37:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2224.codfw.wmnet with reason: Maintenance [14:37:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2224 (T413525)', diff saved to https://phabricator.wikimedia.org/P86847 and previous config saved to /var/cache/conftool/dbconfig/20260108-143755-marostegui.json [14:39:10] Msz2001 They all look fine :D [14:39:15] !log mszwarc@deploy2002 mszwarc, superpes: Continuing with sync [14:39:32] (03CR) 10Fabfur: [C:03+2] cache::text: enable unid, browser ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1224620 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [14:40:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224689 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:41:03] (03PS2) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224621 (https://phabricator.wikimedia.org/T406545) [14:41:45] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224621 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [14:41:52] o/ [14:42:18] Msz2001: you’re still deploying, right? [14:42:29] Yes, finishing Superpes' patches [14:42:33] ok thanks [14:42:35] I called dibs [14:42:56] well, I scheduled my change… [14:43:05] Amir1: what do you want to deploy? [14:43:13] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1223155|[enwikiquote] Enable block feature for AbuseFilter (T413530)]], [[gerrit:1223159|[ruwiki] Disable setting a cookie for blocked anonymous users (T413737)]], [[gerrit:1224232|[enwikiquote] Add new autopatrolled and patroller usergroups (T413848)]] (duration: 08m 36s) [14:43:15] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1224673?usp=email [14:43:19] T413530: Enable the AbuseFilter block action on the English Wikiquote - https://phabricator.wikimedia.org/T413530 [14:43:19] T413737: Disable installing a "block cookie" to a proxy-blocked anons in ruwiki - https://phabricator.wikimedia.org/T413737 [14:43:19] T413848: [enwikiquote] Create the autopatroller and patroller user groups - https://phabricator.wikimedia.org/T413848 [14:43:25] Msz2001 Thanks for your assistance :3 [14:43:30] You're welcome [14:44:16] Lucas_WMDE: You can go [14:44:39] thanks [14:44:47] and then you can both start gate-and-submit for your backports [14:45:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214986 (https://phabricator.wikimedia.org/T403015) (owner: 10Arthur taylor) [14:45:19] !log jgreen@cumin1003 START - Cookbook sre.dns.netbox [14:45:23] (03CR) 10Ladsgroup: [C:03+2] Update API call in edit.js with rvslots [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224673 (https://phabricator.wikimedia.org/T412762) (owner: 10Ladsgroup) [14:46:06] (03Merged) 10jenkins-bot: Enable the MEX / wbui2025 beta feature on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214986 (https://phabricator.wikimedia.org/T403015) (owner: 10Arthur taylor) [14:46:36] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1214986|Enable the MEX / wbui2025 beta feature on wikidata (T403015)]] [14:46:39] T403015: [MEX] M3 - Release onto wikidata.org under feature flag - https://phabricator.wikimedia.org/T403015 [14:47:01] Msz2001: want to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/1224688 together with Amir1? (and +2 it now?) [14:47:19] I don't have +2 rights in that branch [14:47:20] (03PS2) 10Muehlenhoff: Rename puppetmaster::gitsync and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224689 (https://phabricator.wikimedia.org/T365798) [14:47:27] But otherwise can deploy it together [14:47:30] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224688 (owner: 10Mszwarc) [14:47:32] o_O [14:47:48] that sounds like a permissions mistake to me, if you can deploy then I assume you should have +2 rights [14:47:55] !log jgreen@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:48:17] I'll then dig in the documentation what to do about it [14:48:36] But thanks for the +2 :) [14:48:46] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, arthurtaylor: Backport for [[gerrit:1214986|Enable the MEX / wbui2025 beta feature on wikidata (T403015)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:49:28] Msz2001: i can fix that... [14:49:28] (03PS1) 10Gehel: chore(elasticsearch): cleanup unused roles / profiles after migration to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1224691 (https://phabricator.wikimedia.org/T388607) [14:49:36] testing [14:50:00] (03CR) 10CI reject: [V:04-1] chore(elasticsearch): cleanup unused roles / profiles after migration to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1224691 (https://phabricator.wikimedia.org/T388607) (owner: 10Gehel) [14:50:01] Msz2001: but in theory you shouldn't need +2, as scap will apply it on your behalf if neeeded [14:50:23] Yes, and that's how I proceeded with deployments normally [14:50:42] Msz2001: you're `mszwarc` in the shell world, right? [14:50:48] Right [14:51:20] !log Add `mszwarc` to `wmf-deployment` on Gerrit (existing deployer, T404697) [14:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:23] T404697: Requesting access to deployment for mszwarc - https://phabricator.wikimedia.org/T404697 [14:51:25] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, arthurtaylor: Continuing with sync [14:51:31] Msz2001: Lucas_WMDE: should work now! [14:51:45] thanks! [14:51:49] (03CR) 10Vgutierrez: [C:03+1] cache::text: enable auth, known, bot ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224621 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [14:51:52] Thanks, it works indeed [14:52:18] (re “but in theory” – yes, but I don’t think we want deployers who can’t “clean up” the git situation outside of scap) [14:52:18] Msz2001: curiously, `spiderpig-access` should have access by default, and you're not there (https://ldap.toolforge.org/group/spiderpig-access) it seems either... [14:52:27] ...does spiderpig work for you somehow anyway? [14:52:46] No, it doesn't [14:53:16] okay. then you probably want request access to that LDAP group in IDM (https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access#Using_the_Wikimedia_Identity_Management_System) [14:53:26] ok, will do it [14:53:28] Thanks [14:53:31] np [14:53:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T413525)', diff saved to https://phabricator.wikimedia.org/P86848 and previous config saved to /var/cache/conftool/dbconfig/20260108-145343-marostegui.json [14:53:48] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [14:54:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224689 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:55:25] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214986|Enable the MEX / wbui2025 beta feature on wikidata (T403015)]] (duration: 08m 49s) [14:55:28] T403015: [MEX] M3 - Release onto wikidata.org under feature flag - https://phabricator.wikimedia.org/T403015 [14:55:34] Amir1, Msz2001: over to you [14:55:48] Amir1: do you want to deploy or should I do it? [14:55:49] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission pay-lvs1003.frack.eqiad.wmnet and pay-lvs1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T413986#11503958 (10Jgreen) a:05Jgreen→03None [14:55:59] Msz2001: go for it [14:56:02] ok [14:56:14] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224693 [14:56:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224688 (owner: 10Mszwarc) [14:56:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224673 (https://phabricator.wikimedia.org/T412762) (owner: 10Ladsgroup) [14:57:28] (03PS1) 10Btullis: Fix malformed networkpolicy for spark-support and kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224694 (https://phabricator.wikimedia.org/T413977) [14:57:37] (03PS2) 10Btullis: Fix malformed networkpolicy for spark-support and kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224694 (https://phabricator.wikimedia.org/T413977) [14:57:37] (03CR) 10CI reject: [V:04-1] Fix malformed networkpolicy for spark-support and kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224694 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [14:58:09] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1029.eqiad.wmnet with OS trixie [14:58:46] (03Merged) 10jenkins-bot: Update API call in edit.js with rvslots [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224673 (https://phabricator.wikimedia.org/T412762) (owner: 10Ladsgroup) [14:59:46] (03Merged) 10jenkins-bot: Revert^2 "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler" [extensions/CheckUser] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224688 (owner: 10Mszwarc) [15:00:44] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1224688|Revert^2 "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler"]], [[gerrit:1224673|Update API call in edit.js with rvslots (T412762)]] [15:00:47] T412762: Fix edit.js to set rvslots in API calls - https://phabricator.wikimedia.org/T412762 [15:01:05] (03PS1) 10Muehlenhoff: pontoon: Cleanup dead projects [puppet] - 10https://gerrit.wikimedia.org/r/1224696 (https://phabricator.wikimedia.org/T365798) [15:01:23] (03PS1) 10Gehel: chore(elasticsearch): cleanup unused hiera regexes [puppet] - 10https://gerrit.wikimedia.org/r/1224697 (https://phabricator.wikimedia.org/T388607) [15:02:51] !log mszwarc@deploy2002 ladsgroup, mszwarc: Backport for [[gerrit:1224688|Revert^2 "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler"]], [[gerrit:1224673|Update API call in edit.js with rvslots (T412762)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:03:20] (03CR) 10Fabfur: [C:03+2] cache::text: enable auth, known, bot ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224621 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [15:03:32] Amir1: test please [15:03:43] (my patch works fine) [15:03:45] on it [15:03:52] On site at eqiad just noticed alot of orange warning lights in Rack C3. looks like tripped breaker L3-L1 investigating right now [15:03:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P86849 and previous config saved to /var/cache/conftool/dbconfig/20260108-150351-marostegui.json [15:04:59] (03CR) 10Elukey: "Tried with test-cookbook for wdqs1029 and got:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey) [15:05:36] works fine Msz2001 let's goooo [15:05:49] !log mszwarc@deploy2002 ladsgroup, mszwarc: Continuing with sync [15:08:44] (03CR) 10CDanis: [C:03+2] turnilo: webrequest: add res_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1224181 (owner: 10CDanis) [15:09:09] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:55] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224688|Revert^2 "Silence TransactionProfiler warnings in CheckUserPrivateEventsHandler"]], [[gerrit:1224673|Update API call in edit.js with rvslots (T412762)]] (duration: 09m 11s) [15:09:58] T412762: Fix edit.js to set rvslots in API calls - https://phabricator.wikimedia.org/T412762 [15:10:00] Done [15:10:44] !log UTC afternoon backport+config window done [15:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:01] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#11504015 (10Andrew) [15:11:13] dual power is restored to all devices except kafka-main1008 [15:13:29] (03CR) 10Federico Ceratto: "I added a more explicit log line e.g. "INFO The whole 'pc5' section will be depooled" but maybe you meant to also change how the parsercac" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [15:13:30] FIRING: LibericaUnhealthyRealserverPooled: Liberica service text-httpslb_443 has 5 unhealthy realservers pooled on lvs5006:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://grafana.wikimedia.org/d/d70d14db-4a71-414d-8425-7a30d7127ca6/liberica-services?orgId=1&var-site=eqsin&var-service=text-httpslb_443&var-instance=lvs5006 - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserv [15:13:57] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:14:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P86850 and previous config saved to /var/cache/conftool/dbconfig/20260108-151400-marostegui.json [15:14:28] <_joe_> uh what's going on? [15:15:37] <_joe_> !incidents [15:15:37] 7296 (UNACKED) [2x] ProbeDown sre (text-https:443 probes/service eqsin) [15:15:37] 7294 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [15:15:37] 7295 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [15:15:38] 7290 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [15:15:38] 7292 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [15:15:38] 7291 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [15:15:38] 7293 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [15:15:39] 7289 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [15:15:39] 7288 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [15:15:40] 7287 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [15:15:51] in eqsin [15:15:55] !ack 7296 [15:15:56] 7260 (RESOLVED) payments2006/check_mysql [15:16:13] <_joe_> yeah looks like traffic-level issues [15:18:30] RESOLVED: [2x] LibericaUnhealthyRealserverPooled: Liberica service text-httpslb_443 has 2 unhealthy realservers pooled on lvs5004:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [15:18:37] (03CR) 10Elukey: "Tried to run install-console, and ran `puppet agent --test --color=false --debug`:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey) [15:18:39] <_joe_> fabfur: do you see anything in the graphs? [15:18:57] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:15] _joe_: looking , we had a big deep in both text and upload [15:19:47] <_joe_> a big dip of what? [15:20:00] traffic to haproxy, looking at the network now [15:20:11] what I'm seeing is a huge spike of new connections in text@eqsin [15:20:16] <_joe_> a lot of NELs [15:20:26] https://grafana.wikimedia.org/goto/YCrFXJVDR?orgId=1 [15:20:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [15:21:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1190 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86851 and previous config saved to /var/cache/conftool/dbconfig/20260108-152103-marostegui.json [15:21:09] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [15:21:10] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [15:24:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T413525)', diff saved to https://phabricator.wikimedia.org/P86852 and previous config saved to /var/cache/conftool/dbconfig/20260108-152407-marostegui.json [15:24:11] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [15:24:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2229.codfw.wmnet with reason: Maintenance [15:24:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2229 (T413525)', diff saved to https://phabricator.wikimedia.org/P86853 and previous config saved to /var/cache/conftool/dbconfig/20260108-152432-marostegui.json [15:25:56] 10ops-eqiad, 06SRE, 06DC-Ops: Failed Power supply on kafka-main1008 - https://phabricator.wikimedia.org/T414101 (10Jclark-ctr) 03NEW [15:27:36] 10ops-eqiad, 06SRE, 06DC-Ops: Failed Power supply on kafka-main1008 - https://phabricator.wikimedia.org/T414101#11504114 (10Jclark-ctr) Removed all power cords for the affected breaker, reset it, and added the cords back individually until locating a fried PSU on kafka-main1008. dual power is restored to a... [15:29:01] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for aghirelli - https://phabricator.wikimedia.org/T414102 (10AGhirelli-WMF) 03NEW [15:29:17] 06SRE, 10Data Pipelines, 06Data-Engineering: Unrecognised file under /srv/deployment-charts - https://phabricator.wikimedia.org/T413433#11504126 (10Dzahn) [15:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: Failed Power supply on kafka-main1008 - https://phabricator.wikimedia.org/T414101#11504129 (10Jclark-ctr) Opened Service request 221019443 [15:29:58] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1530) [15:30:11] 10ops-eqiad, 06SRE, 06DC-Ops: Failed Power supply on kafka-main1008 - https://phabricator.wikimedia.org/T414101#11504133 (10Reedy) [15:31:39] (03PS2) 10Fabfur: cache::text: enable unid, browser ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224622 (https://phabricator.wikimedia.org/T406545) [15:31:41] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224622 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [15:33:19] (03CR) 10Marostegui: "Ideally we should change both. In any case, if this cookbook will be de facto cookbook, then we probably should just make changes here and" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [15:34:09] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:28] (03CR) 10Vgutierrez: [C:03+1] cache::text: enable unid, browser ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224622 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [15:37:32] (03CR) 10Fabfur: [C:03+2] cache::text: enable unid, browser ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1224622 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [15:40:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T413525)', diff saved to https://phabricator.wikimedia.org/P86854 and previous config saved to /var/cache/conftool/dbconfig/20260108-154013-marostegui.json [15:40:17] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [15:41:23] (03PS2) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545) [15:41:24] (03CR) 10Filippo Giunchedi: [C:03+1] Rename puppetmaster::gitsync and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224689 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:41:51] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, we can restore this at any time" [puppet] - 10https://gerrit.wikimedia.org/r/1224696 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:42:34] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [15:43:00] (03CR) 10Muehlenhoff: [C:03+2] pontoon: Cleanup dead projects [puppet] - 10https://gerrit.wikimedia.org/r/1224696 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [15:43:29] (03CR) 10Elukey: [C:03+1] deploy: Add redioscope kubeconfig to aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1224643 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert) [15:43:47] (03CR) 10Elukey: [C:03+1] aux-k8s: Add redioscope namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224644 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert) [15:44:22] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS trixie [15:44:23] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS trixie [15:48:04] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [15:49:32] 06SRE, 10SRE-Access-Requests, 06Release-Engineering-Team, 13Patch-For-Review: Add yubikey ssh key for dancy - https://phabricator.wikimedia.org/T414032#11504220 (10dancy) [15:50:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P86855 and previous config saved to /var/cache/conftool/dbconfig/20260108-155021-marostegui.json [15:50:51] (03PS1) 10Muehlenhoff: durum: Enable Bird 2.18 for all servers [puppet] - 10https://gerrit.wikimedia.org/r/1224704 (https://phabricator.wikimedia.org/T413740) [15:51:20] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [15:54:05] (03CR) 10Jakob: [C:03+1] "LGTM, thanks!" [dumps] - 10https://gerrit.wikimedia.org/r/1224110 (https://phabricator.wikimedia.org/T413869) (owner: 10Silvan Heintze) [15:54:22] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [15:55:10] (03CR) 10Hnowlan: [C:03+1] aux-k8s: Add redioscope namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224644 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert) [15:55:25] (03CR) 10Hnowlan: [C:03+1] deploy: Add redioscope kubeconfig to aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1224643 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert) [15:55:59] (03PS10) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) [15:57:50] (03CR) 10CI reject: [V:04-1] docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [16:00:05] dduvall and dancy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Train log triage . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1600). [16:00:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P86856 and previous config saved to /var/cache/conftool/dbconfig/20260108-160029-marostegui.json [16:00:44] (03PS11) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) [16:02:31] (03CR) 10CI reject: [V:04-1] docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [16:03:30] (03PS1) 10Muehlenhoff: wikidough: Enable Bird 2.18 for all servers [puppet] - 10https://gerrit.wikimedia.org/r/1224708 (https://phabricator.wikimedia.org/T413740) [16:03:51] (03PS12) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) [16:05:47] (03CR) 10Vgutierrez: [C:04-1] "you need to split this... first set enable it in esams and in a following commit unify it" [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [16:06:21] (03PS1) 10Muehlenhoff: hcaptcha proxy: Enable Bird 2.18 for all servers [puppet] - 10https://gerrit.wikimedia.org/r/1224709 (https://phabricator.wikimedia.org/T413740) [16:08:40] (03CR) 10Ssingh: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1224704/7864/durum1001.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/1224704 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [16:08:58] (03CR) 10Ssingh: [C:03+1] wikidough: Enable Bird 2.18 for all servers [puppet] - 10https://gerrit.wikimedia.org/r/1224708 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [16:10:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T413525)', diff saved to https://phabricator.wikimedia.org/P86857 and previous config saved to /var/cache/conftool/dbconfig/20260108-161038-marostegui.json [16:10:42] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [16:11:48] (03PS2) 10Gehel: chore(elasticsearch): cleanup unused hiera regexes [puppet] - 10https://gerrit.wikimedia.org/r/1224697 (https://phabricator.wikimedia.org/T388607) [16:11:48] (03PS1) 10Gehel: chore(elasticsearch): remove references to elasticsearch for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/1224712 (https://phabricator.wikimedia.org/T388607) [16:11:49] (03PS1) 10Gehel: chore(elasticsearch): cloudelastic1001-1004 have been decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/1224713 (https://phabricator.wikimedia.org/T388607) [16:11:51] (03PS1) 10Gehel: chore(elasticsearch): remove references to elasticsearch for cluster config [puppet] - 10https://gerrit.wikimedia.org/r/1224714 (https://phabricator.wikimedia.org/T388607) [16:12:30] (03CR) 10CI reject: [V:04-1] chore(elasticsearch): cleanup unused hiera regexes [puppet] - 10https://gerrit.wikimedia.org/r/1224697 (https://phabricator.wikimedia.org/T388607) (owner: 10Gehel) [16:12:45] (03PS3) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545) [16:13:24] 06SRE, 10Data Pipelines, 06Data-Engineering: Unrecognised file under /srv/deployment-charts - https://phabricator.wikimedia.org/T413433#11504313 (10JMeybohm) 05Open→03Resolved a:03JMeybohm I've moved the file out of the way to `/root/See_T413433` in case someone lost a session. [16:14:12] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1029.eqiad.wmnet with OS trixie [16:14:37] (03PS4) 10Fabfur: cache::text: enable auth, known, bot ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545) [16:14:47] (03PS1) 10Muehlenhoff: Record LDAP access for aghirelli [puppet] - 10https://gerrit.wikimedia.org/r/1224715 [16:15:41] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [16:18:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1159.eqiad.wmnet with reason: Maintenance [16:18:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T413525)', diff saved to https://phabricator.wikimedia.org/P86858 and previous config saved to /var/cache/conftool/dbconfig/20260108-161848-marostegui.json [16:18:52] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [16:21:58] (03CR) 10Elukey: "I think we are in a good place, let's wait for some other review from ServiceOps. We could tentatively deploy this on Monday :)" [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [16:23:29] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for aghirelli [puppet] - 10https://gerrit.wikimedia.org/r/1224715 (owner: 10Muehlenhoff) [16:24:28] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for aghirelli - https://phabricator.wikimedia.org/T414102#11504358 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Access was granted via Wikimedia IDM. [16:29:18] (03CR) 10Muehlenhoff: sre.hosts.reimage: remove puppet 5 support and default to 7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey) [16:29:53] (03PS1) 10C. Scott Ananian: Increase PRV percentage on fawiki/kowiki/azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224719 (https://phabricator.wikimedia.org/T413108) [16:30:30] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11504408 (10JMeybohm) >>! In T413364#11483999, @cmooney wrote: >> Do you currently have shell access (Yes/No): Not sure - how can I check? > > Looking at our... [16:31:26] (03CR) 10Vgutierrez: [C:03+1] cache::text: enable auth, known, bot ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [16:31:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T413525)', diff saved to https://phabricator.wikimedia.org/P86860 and previous config saved to /var/cache/conftool/dbconfig/20260108-163131-marostegui.json [16:31:35] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [16:31:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224719 (https://phabricator.wikimedia.org/T413108) (owner: 10C. Scott Ananian) [16:32:00] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde for martyn.ranyard - https://phabricator.wikimedia.org/T413994#11504421 (10JMeybohm) @KFrancis could you please confirm NDA status? [16:32:25] (03CR) 10Fabfur: [C:03+2] cache::text: enable auth, known, bot ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224624 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [16:32:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224169 (https://phabricator.wikimedia.org/T414019) (owner: 10C. Scott Ananian) [16:32:50] (03PS1) 10AikoChou: ml-services: Update image for revise-tone-task-generator on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224721 (https://phabricator.wikimedia.org/T412210) [16:34:43] (03CR) 10Muehlenhoff: [C:03+1] sre.hosts.reimage: remove puppet 5 support and default to 7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey) [16:35:01] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for aghirelli - https://phabricator.wikimedia.org/T414102#11504512 (10JMeybohm) Access to the wmf group needs to be requested [[ https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access#Using_the_Wikimedia_Identity_Management_System | Using_the_W... [16:35:38] (03CR) 10Ladsgroup: [C:04-1] "Annoyingly it'll break MediaSearch in Commons. https://codesearch.wmcloud.org/search/?q=sdmsThumbRenderMap&files=&excludeFiles=&repos= It " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224071 (https://phabricator.wikimedia.org/T408062) (owner: 10MVernon) [16:36:42] (03CR) 10Elukey: sre.hosts.reimage: remove puppet 5 support and default to 7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey) [16:37:02] andrew@cumin2002 reimage (PID 2547254) is awaiting input [16:37:40] (03CR) 10BryanDavis: "Cause of T414111 in Beta Clustger where the /usr/share/GeoIP/proxy.mmdb file does not exist." [puppet] - 10https://gerrit.wikimedia.org/r/1219882 (owner: 10Slyngshede) [16:37:52] andrew@cumin2002 reimage (PID 2547204) is awaiting input [16:38:15] (03PS1) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) [16:39:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11504536 (10Clement_Goubert) >>! In T408752#11502255, @Jclark-ctr wrote: > @Clement_Goubert Before I start racking these, do you want to verify that they’re correct by row, s... [16:39:09] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:40:12] (03CR) 10Muehlenhoff: [C:03+1] sre.hosts.reimage: remove puppet 5 support and default to 7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey) [16:41:11] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11504543 (10JMeybohm) #release-engineering-team: Could you help with removing +2 ? [16:41:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P86861 and previous config saved to /var/cache/conftool/dbconfig/20260108-164140-marostegui.json [16:41:42] (03Abandoned) 10Fabfur: cache::text: enable unid, browser ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224626 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [16:42:11] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for tgritschacher - https://phabricator.wikimedia.org/T414061#11504547 (10JMeybohm) @KFrancis could you please confirm NDA status? [16:43:53] (03CR) 10Elukey: sre.hosts.reimage: remove puppet 5 support and default to 7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey) [16:44:09] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:44:30] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for tgritschacher - https://phabricator.wikimedia.org/T414061#11504555 (10JMeybohm) [16:45:06] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:45:58] (03PS1) 10Fabfur: cache::text: enable unid, browser ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224725 (https://phabricator.wikimedia.org/T406545) [16:49:09] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:49:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:50:00] (03PS1) 10Fabfur: cache::text: cleanup rate_limiting_flags [puppet] - 10https://gerrit.wikimedia.org/r/1224727 (https://phabricator.wikimedia.org/T406545) [16:51:30] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224725 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [16:51:46] !incidents [16:51:47] 7297 (UNACKED) [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [16:51:47] 7296 (RESOLVED) [2x] ProbeDown sre (text-https:443 probes/service eqsin) [16:51:47] 7294 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [16:51:48] 7295 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [16:51:48] 7290 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [16:51:48] 7292 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [16:51:48] 7291 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [16:51:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P86862 and previous config saved to /var/cache/conftool/dbconfig/20260108-165148-marostegui.json [16:51:49] 7293 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [16:51:49] 7289 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [16:51:50] 7288 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [16:51:50] 7287 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [16:51:57] !ack 7297 [16:51:58] 7260 (RESOLVED) payments2006/check_mysql [16:52:22] !incidents [16:52:23] 7297 (ACKED) [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [16:52:23] 7296 (RESOLVED) [2x] ProbeDown sre (text-https:443 probes/service eqsin) [16:52:23] 7294 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [16:52:23] 7295 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [16:52:24] 7290 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [16:52:24] 7292 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [16:52:24] 7291 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [16:52:24] 7293 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [16:52:25] 7289 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [16:52:25] 7288 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [16:52:26] 7287 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [16:54:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:55:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86863 and previous config saved to /var/cache/conftool/dbconfig/20260108-165501-marostegui.json [16:55:06] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [16:55:07] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [16:55:40] <_joe_> !ack [16:55:40] no value provided for parameter incident and no default available [16:55:41] Incident id must be an integer [16:55:54] <_joe_> uhm rzl ^^ not working apparently [16:55:58] <_joe_> !incidents [16:55:58] 7297 (ACKED) [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [16:55:59] 7296 (RESOLVED) [2x] ProbeDown sre (text-https:443 probes/service eqsin) [16:55:59] 7294 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [16:55:59] 7295 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [16:55:59] 7290 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [16:55:59] 7292 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [16:56:00] 7291 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [16:56:00] 7293 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [16:56:00] 7289 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [16:56:01] 7288 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [16:56:01] 7287 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [16:57:25] _joe_: I think that's what rzl's not in corto from last night refers to [16:57:31] s/not/note/ [16:59:09] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:00:05] jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:06] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:01:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T413525)', diff saved to https://phabricator.wikimedia.org/P86864 and previous config saved to /var/cache/conftool/dbconfig/20260108-170156-marostegui.json [17:02:00] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [17:02:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [17:02:17] !incidents [17:02:18] 7297 (ACKED) [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [17:02:18] 7296 (RESOLVED) [2x] ProbeDown sre (text-https:443 probes/service eqsin) [17:02:18] 7294 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [17:02:18] 7295 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [17:02:18] 7290 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [17:02:19] 7292 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [17:02:19] 7291 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [17:02:19] 7293 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [17:02:19] 7289 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [17:02:20] 7288 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [17:02:20] 7287 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [17:02:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:02:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1161 (T413525)', diff saved to https://phabricator.wikimedia.org/P86865 and previous config saved to /var/cache/conftool/dbconfig/20260108-170241-marostegui.json [17:05:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P86866 and previous config saved to /var/cache/conftool/dbconfig/20260108-170509-marostegui.json [17:10:49] _joe_: yeah sorry, two output issues -- one is because I kept a pointer to the loop variable (:facepalm:) and the other is there should be a better error message when everything is already acked [17:10:59] looking at both this morning [17:11:03] <_joe_> <3 [17:12:42] what's not obvious until you look at the timeline is that the new page at 16:54 was another alert that went into 7297, so it was already acked -- which is why that's not an uncommon situation [17:15:06] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:15:06] (03PS1) 10Bking: DO NOT MERGE: test blackbox integration for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1224738 [17:15:15] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224738 (owner: 10Bking) [17:15:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T413525)', diff saved to https://phabricator.wikimedia.org/P86868 and previous config saved to /var/cache/conftool/dbconfig/20260108-171517-marostegui.json [17:15:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P86867 and previous config saved to /var/cache/conftool/dbconfig/20260108-171517-marostegui.json [17:15:22] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [17:17:47] !incidents [17:17:48] 7297 (ACKED) [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [17:17:48] 7296 (RESOLVED) [2x] ProbeDown sre (text-https:443 probes/service eqsin) [17:17:48] 7294 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [17:17:48] 7295 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [17:17:48] 7290 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [17:17:49] 7292 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [17:17:49] 7291 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [17:17:49] 7293 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [17:17:50] 7289 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [17:17:50] 7288 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [17:17:51] 7287 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [17:20:06] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:20:19] (03CR) 10JHathaway: [C:03+1] git::clone: Get default branch name a different way [puppet] - 10https://gerrit.wikimedia.org/r/1219907 (https://phabricator.wikimedia.org/T413193) (owner: 10Ahmon Dancy) [17:21:20] (03PS2) 10Clément Goubert: wmnet: Add redioscope CNAMES [dns] - 10https://gerrit.wikimedia.org/r/1224652 (https://phabricator.wikimedia.org/T413999) [17:21:29] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [17:21:45] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [17:24:09] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:24:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:25:06] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:25:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P86869 and previous config saved to /var/cache/conftool/dbconfig/20260108-172526-marostegui.json [17:25:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86870 and previous config saved to /var/cache/conftool/dbconfig/20260108-172526-marostegui.json [17:25:33] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [17:25:33] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [17:25:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [17:25:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2155 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86871 and previous config saved to /var/cache/conftool/dbconfig/20260108-172551-marostegui.json [17:26:26] hey folks! I have a somewhat urgent request. Wiki Education Dashboard is suddenly getting 429 errors for OAuth login. I created an issue for it here: https://phabricator.wikimedia.org/T414114 [17:26:57] can we get rate-limits lifted for those IPs? [17:27:34] @fabfur @_joe_ ^ [17:27:56] ragesoss: we are responding to an incident right now, but we will take a look shortly [17:27:59] thanks [17:28:18] thanks! [17:29:35] jouncebot: nowandnext [17:29:35] For the next 0 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1700) [17:29:35] In 0 hour(s) and 30 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1800) [17:29:35] In 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1800) [17:29:59] (03PS2) 10MVernon: Only generate 120,250 thumbs (temporary) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224071 (https://phabricator.wikimedia.org/T408062) [17:31:09] !log restarted wmf_auto_restart_prometheus-mysqld-exporter.service @ db2231 [17:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:11] (03CR) 10CI reject: [V:04-1] Only generate 120,250 thumbs (temporary) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224071 (https://phabricator.wikimedia.org/T408062) (owner: 10MVernon) [17:31:15] (03PS3) 10Ladsgroup: Only generate 120,250 thumbs (temporary) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224071 (https://phabricator.wikimedia.org/T408062) (owner: 10MVernon) [17:31:19] (03CR) 10Ladsgroup: [C:03+2] Only generate 120,250 thumbs (temporary) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224071 (https://phabricator.wikimedia.org/T408062) (owner: 10MVernon) [17:31:28] (03CR) 10MVernon: "ACK, here's that change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224071 (https://phabricator.wikimedia.org/T408062) (owner: 10MVernon) [17:31:43] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup2003.codfw.wmnet with OS trixie [17:32:15] (03Merged) 10jenkins-bot: Only generate 120,250 thumbs (temporary) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224071 (https://phabricator.wikimedia.org/T408062) (owner: 10MVernon) [17:32:33] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS trixie [17:32:42] 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11504733 (10ssingh) [17:32:59] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1224071|Only generate 120,250 thumbs (temporary) (T408062 T412971)]] [17:33:03] T408062: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062 [17:33:03] T412971: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971 [17:34:09] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:35:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P86872 and previous config saved to /var/cache/conftool/dbconfig/20260108-173534-marostegui.json [17:35:41] !log ladsgroup@deploy2002 mvernon, ladsgroup: Backport for [[gerrit:1224071|Only generate 120,250 thumbs (temporary) (T408062 T412971)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:37:42] Is anyone from SRE around for the puppet request window? I have a simple patch but didn't get round to adding it before the window [17:38:39] Tchanders: we're dealing with an incident rn, I'd postpone this unless is super urgent [17:38:49] np - thanks [17:39:05] Tchanders: I am about to leave for the day, but if you add me as reviewer I can have a look at it tomorrow [17:39:34] Tchanders: Jcrespo [17:39:36] jynus: Thank you - done [17:39:59] (assuming it is a trivial generic SRE one, if not I will direct you to the expert) [17:41:46] yeah, thats something that I will be able to take care [17:42:17] let me know on a comment if it is something that can be deployed any time or you want to be around [17:44:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:45:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T413525)', diff saved to https://phabricator.wikimedia.org/P86873 and previous config saved to /var/cache/conftool/dbconfig/20260108-174542-marostegui.json [17:45:46] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [17:45:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1185.eqiad.wmnet with reason: Maintenance [17:46:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1185 (T413525)', diff saved to https://phabricator.wikimedia.org/P86874 and previous config saved to /var/cache/conftool/dbconfig/20260108-174606-marostegui.json [17:46:16] !log ladsgroup@deploy2002 mvernon, ladsgroup: Continuing with sync [17:49:50] !incidents [17:49:51] 7297 (ACKED) [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [17:49:51] 7296 (RESOLVED) [2x] ProbeDown sre (text-https:443 probes/service eqsin) [17:49:51] 7294 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [17:49:51] 7295 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [17:49:52] 7290 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [17:49:52] 7292 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [17:49:52] 7291 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [17:49:52] 7293 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [17:49:53] 7289 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [17:49:53] 7288 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [17:49:54] 7287 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [17:50:32] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224071|Only generate 120,250 thumbs (temporary) (T408062 T412971)]] (duration: 17m 34s) [17:50:37] T408062: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062 [17:50:37] T412971: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971 [17:50:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:51:08] 06SRE: add avishua stein to acl*procurement-review - https://phabricator.wikimedia.org/T414115#11504796 (10Zabe) [17:54:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:56:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T413525)', diff saved to https://phabricator.wikimedia.org/P86875 and previous config saved to /var/cache/conftool/dbconfig/20260108-175602-marostegui.json [17:56:06] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [18:00:05] bd808: May I have your attention please! Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1800) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1800) [18:00:35] o/ I will be updating developer-portal today. [18:02:26] (03PS1) 10BryanDavis: developer-portal: Bump to 2025-12-29-122831-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224764 [18:04:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:06:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission pay-lvs1003.frack.eqiad.wmnet and pay-lvs1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T413986#11504837 (10Jclark-ctr) a:03Jclark-ctr [18:06:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P86876 and previous config saved to /var/cache/conftool/dbconfig/20260108-180611-marostegui.json [18:06:12] (03CR) 10BryanDavis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224764 (owner: 10BryanDavis) [18:08:46] 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11504853 (10Joe) @Ragesoss what is the User-Agent you use when making those requests? [18:08:58] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2025-12-29-122831-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224764 (owner: 10BryanDavis) [18:09:11] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224727 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [18:10:47] (03Merged) 10jenkins-bot: developer-portal: Bump to 2025-12-29-122831-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224764 (owner: 10BryanDavis) [18:11:30] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:11:53] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:12:02] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:12:39] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:14:27] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:15:03] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:15:42] <_joe_> ragesoss: replied on-task [18:15:44] 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11504886 (10Joe) @Ragesoss as far as I can tell, the problem is you are not honoring the wikimedia User-Agent policy, and we have recently started to enforce stricter rat... [18:16:14] 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11504890 (10ssingh) Hi @Ragesoss: We looked through the logs and it seems like requests originating from your end are not respecting our UA policy, documented at https://... [18:16:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P86877 and previous config saved to /var/cache/conftool/dbconfig/20260108-181619-marostegui.json [18:16:21] I'm done with my deploy window now. [18:22:18] (03CR) 10Ssingh: [C:03+1] "Reviewing only from the point of view of the commit and the changed hiera, since I don't have the full context :)" [puppet] - 10https://gerrit.wikimedia.org/r/1224725 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [18:24:06] andrew@cumin2002 reimage (PID 2602474) is awaiting input [18:24:09] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:25:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:25:58] (03CR) 10Ssingh: [C:03+1] "(Basing this off the previous commit, with the same caveat.)" [puppet] - 10https://gerrit.wikimedia.org/r/1224727 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [18:26:08] PROBLEM - Host cp7014 is DOWN: CRITICAL - Time to live exceeded (10.140.1.10) [18:26:14] wow ok, that's new [18:26:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T413525)', diff saved to https://phabricator.wikimedia.org/P86878 and previous config saved to /var/cache/conftool/dbconfig/20260108-182627-marostegui.json [18:26:29] <_joe_> sigh [18:26:30] RECOVERY - Host cp7014 is UP: PING OK - Packet loss = 0%, RTA = 137.56 ms [18:26:31] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [18:26:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1200.eqiad.wmnet with reason: Maintenance [18:26:41] no it's not actually, it's a monitoring thing [18:26:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1200 (T413525)', diff saved to https://phabricator.wikimedia.org/P86879 and previous config saved to /var/cache/conftool/dbconfig/20260108-182641-marostegui.json [18:27:22] <_joe_> !ack [18:27:23] 7260 (RESOLVED) payments2006/check_mysql [18:27:29] <_joe_> uhh [18:27:31] !incidents [18:27:32] 7298 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [18:27:32] 7297 (RESOLVED) [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [18:27:32] 7296 (RESOLVED) [2x] ProbeDown sre (text-https:443 probes/service eqsin) [18:27:32] 7294 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [18:27:32] 7295 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [18:27:33] 7290 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [18:27:33] 7292 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [18:27:33] 7291 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [18:27:34] 7293 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [18:27:34] 7289 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [18:27:35] 7288 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [18:27:35] 7287 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [18:27:40] already ACKed [18:27:44] <_joe_> ah already acked [18:33:49] (03CR) 10Fabfur: [C:03+2] cache::text: enable unid, browser ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1224725 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [18:34:22] PROBLEM - Confd vcl based reload on cp2042 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:26] PROBLEM - Confd vcl based reload on cp6006 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:26] PROBLEM - Confd vcl based reload on cp6008 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:26] PROBLEM - Confd vcl based reload on cp6004 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:26] PROBLEM - Confd vcl based reload on cp6002 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:26] PROBLEM - Confd vcl based reload on cp6007 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:26] PROBLEM - Confd vcl based reload on cp6005 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:28] PROBLEM - Confd vcl based reload on cp7012 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:28] PROBLEM - Confd vcl based reload on cp7009 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:28] PROBLEM - Confd vcl based reload on cp7013 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:28] PROBLEM - Confd vcl based reload on cp1105 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:28] PROBLEM - Confd vcl based reload on cp1101 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:30] PROBLEM - Confd vcl based reload on cp7011 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:32] PROBLEM - Confd vcl based reload on cp7010 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:32] PROBLEM - Confd vcl based reload on cp7016 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:32] PROBLEM - Confd vcl based reload on cp7015 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:32] PROBLEM - Confd vcl based reload on cp7014 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:35] oh boy [18:34:40] PROBLEM - Confd vcl based reload on cp5029 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:40] PROBLEM - Confd vcl based reload on cp5032 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:40] PROBLEM - Confd vcl based reload on cp5028 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:40] PROBLEM - Confd vcl based reload on cp5025 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:40] PROBLEM - Confd vcl based reload on cp1113 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:40] PROBLEM - Confd vcl based reload on cp1115 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:40] PROBLEM - Confd vcl based reload on cp1107 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:41] PROBLEM - Confd vcl based reload on cp1111 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:41] PROBLEM - Confd vcl based reload on cp1103 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:42] PROBLEM - Confd vcl based reload on cp1109 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:42] PROBLEM - Confd vcl based reload on cp5027 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:43] PROBLEM - Confd vcl based reload on cp5026 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:43] PROBLEM - Confd vcl based reload on cp5031 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:44] I am guessing this is the classic reload race condition at play [18:34:44] PROBLEM - Confd vcl based reload on cp5030 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:44] PROBLEM - Confd vcl based reload on cp2028 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:45] PROBLEM - Confd vcl based reload on cp2030 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:45] PROBLEM - Confd vcl based reload on cp2034 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:46] PROBLEM - Confd vcl based reload on cp2032 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:48] PROBLEM - Confd vcl based reload on cp2036 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:50] PROBLEM - Confd vcl based reload on cp2038 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:50] PROBLEM - Confd vcl based reload on cp4051 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:50] PROBLEM - Confd vcl based reload on cp4047 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:52] PROBLEM - Confd vcl based reload on cp4052 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:52] PROBLEM - Confd vcl based reload on cp4048 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:52] PROBLEM - Confd vcl based reload on cp4046 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:52] PROBLEM - Confd vcl based reload on cp4045 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:54] PROBLEM - Confd vcl based reload on cp3076 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:54] PROBLEM - Confd vcl based reload on cp3078 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:54] PROBLEM - Confd vcl based reload on cp3081 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:54] PROBLEM - Confd vcl based reload on cp3075 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:54] PROBLEM - Confd vcl based reload on cp3074 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:54] PROBLEM - Confd vcl based reload on cp3077 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:54] PROBLEM - Confd vcl based reload on cp4050 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:55] PROBLEM - Confd vcl based reload on cp4049 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:55] PROBLEM - Confd vcl based reload on cp3080 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:56] PROBLEM - Confd vcl based reload on cp3079 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:56] PROBLEM - Confd vcl based reload on cp6001 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:57] PROBLEM - Confd vcl based reload on cp6003 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:34:57] PROBLEM - Confd vcl based reload on cp2040 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:36:16] I've just merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1224725 but don't think it's the cause as it really just happened [18:36:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T413525)', diff saved to https://phabricator.wikimedia.org/P86880 and previous config saved to /var/cache/conftool/dbconfig/20260108-183637-marostegui.json [18:36:41] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [18:40:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:41:22] RECOVERY - Confd vcl based reload on cp2042 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:26] RECOVERY - Confd vcl based reload on cp6008 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:26] RECOVERY - Confd vcl based reload on cp6004 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:26] RECOVERY - Confd vcl based reload on cp6002 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:26] RECOVERY - Confd vcl based reload on cp6006 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:26] RECOVERY - Confd vcl based reload on cp6007 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:26] RECOVERY - Confd vcl based reload on cp6005 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:28] RECOVERY - Confd vcl based reload on cp7012 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:28] RECOVERY - Confd vcl based reload on cp7009 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:28] RECOVERY - Confd vcl based reload on cp7013 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:28] RECOVERY - Confd vcl based reload on cp1105 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:28] RECOVERY - Confd vcl based reload on cp1101 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:30] RECOVERY - Confd vcl based reload on cp7011 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:32] RECOVERY - Confd vcl based reload on cp7016 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:32] RECOVERY - Confd vcl based reload on cp7010 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:32] RECOVERY - Confd vcl based reload on cp7015 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:32] RECOVERY - Confd vcl based reload on cp7014 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:40] RECOVERY - Confd vcl based reload on cp5025 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:40] RECOVERY - Confd vcl based reload on cp5028 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:40] RECOVERY - Confd vcl based reload on cp5032 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:40] RECOVERY - Confd vcl based reload on cp5029 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:40] RECOVERY - Confd vcl based reload on cp1113 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:40] RECOVERY - Confd vcl based reload on cp1115 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:40] RECOVERY - Confd vcl based reload on cp1111 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:41] RECOVERY - Confd vcl based reload on cp1103 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:41] RECOVERY - Confd vcl based reload on cp1107 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:42] RECOVERY - Confd vcl based reload on cp5026 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:42] RECOVERY - Confd vcl based reload on cp1109 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:43] RECOVERY - Confd vcl based reload on cp5031 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:43] RECOVERY - Confd vcl based reload on cp5030 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:44] RECOVERY - Confd vcl based reload on cp5027 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:44] RECOVERY - Confd vcl based reload on cp2030 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:45] RECOVERY - Confd vcl based reload on cp2028 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:45] RECOVERY - Confd vcl based reload on cp2034 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:46] RECOVERY - Confd vcl based reload on cp2032 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:48] RECOVERY - Confd vcl based reload on cp2036 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:50] RECOVERY - Confd vcl based reload on cp2038 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:50] RECOVERY - Confd vcl based reload on cp4047 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:50] RECOVERY - Confd vcl based reload on cp4051 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:52] RECOVERY - Confd vcl based reload on cp4052 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:54] RECOVERY - Confd vcl based reload on cp4045 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:54] RECOVERY - Confd vcl based reload on cp4046 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:54] RECOVERY - Confd vcl based reload on cp4048 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:54] RECOVERY - Confd vcl based reload on cp3074 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:54] RECOVERY - Confd vcl based reload on cp3081 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:54] RECOVERY - Confd vcl based reload on cp3078 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:54] RECOVERY - Confd vcl based reload on cp3077 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:55] RECOVERY - Confd vcl based reload on cp3076 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:55] RECOVERY - Confd vcl based reload on cp3075 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:56] RECOVERY - Confd vcl based reload on cp4049 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:56] RECOVERY - Confd vcl based reload on cp4050 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:57] RECOVERY - Confd vcl based reload on cp3080 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:57] RECOVERY - Confd vcl based reload on cp3079 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:58] RECOVERY - Confd vcl based reload on cp6001 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:58] RECOVERY - Confd vcl based reload on cp6003 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:41:59] RECOVERY - Confd vcl based reload on cp2040 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:45:14] (03CR) 10Fabfur: [C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1224725 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [18:46:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P86881 and previous config saved to /var/cache/conftool/dbconfig/20260108-184645-marostegui.json [18:46:59] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224725 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [18:47:58] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224727 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [18:53:42] (03CR) 10Fabfur: [C:03+2] cache::text: cleanup rate_limiting_flags [puppet] - 10https://gerrit.wikimedia.org/r/1224727 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [18:56:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P86882 and previous config saved to /var/cache/conftool/dbconfig/20260108-185654-marostegui.json [18:58:49] 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11505043 (10Ragesoss) Thanks! Unfortunately, the OAuth library we use doesn't support setting the User Agent, so I'm going to have to figure out how to monkey patch it. :-( [19:00:04] dduvall and dancy: That opportune time for a MediaWiki train - Utc-7 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T1900). [19:03:41] (03CR) 10Scott French: [C:03+1] haproxy: proxy mmdb: all 🌍 [puppet] - 10https://gerrit.wikimedia.org/r/1224168 (owner: 10CDanis) [19:06:17] o/ just zeroing my brain on the current error logs and then rolling train [19:07:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T413525)', diff saved to https://phabricator.wikimedia.org/P86883 and previous config saved to /var/cache/conftool/dbconfig/20260108-190702-marostegui.json [19:07:06] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:07:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1207.eqiad.wmnet with reason: Maintenance [19:07:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1207 (T413525)', diff saved to https://phabricator.wikimedia.org/P86884 and previous config saved to /var/cache/conftool/dbconfig/20260108-190727-marostegui.json [19:07:52] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224777 (https://phabricator.wikimedia.org/T408280) [19:07:54] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224777 (https://phabricator.wikimedia.org/T408280) (owner: 10TrainBranchBot) [19:08:42] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224777 (https://phabricator.wikimedia.org/T408280) (owner: 10TrainBranchBot) [19:16:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T413525)', diff saved to https://phabricator.wikimedia.org/P86885 and previous config saved to /var/cache/conftool/dbconfig/20260108-191624-marostegui.json [19:16:28] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:20:09] (03PS1) 10Ebenezer Rao: fixed typo of word initial in the WmfConfig.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224781 (https://phabricator.wikimedia.org/T201491) [19:24:37] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.10 refs T408280 [19:24:41] T408280: 1.46.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T408280 [19:26:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P86886 and previous config saved to /var/cache/conftool/dbconfig/20260108-192633-marostegui.json [19:28:33] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224781 (https://phabricator.wikimedia.org/T201491) (owner: 10Ebenezer Rao) [19:30:33] (03PS1) 10Ebenezer Rao: fixed typo of the word initial in the srllogin file [puppet] - 10https://gerrit.wikimedia.org/r/1224784 (https://phabricator.wikimedia.org/T201491) [19:33:17] (03CR) 10CDanis: [C:03+2] haproxy: proxy mmdb: all 🌍 [puppet] - 10https://gerrit.wikimedia.org/r/1224168 (owner: 10CDanis) [19:34:38] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:36:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P86887 and previous config saved to /var/cache/conftool/dbconfig/20260108-193641-marostegui.json [19:37:07] cccccbbneubnidfrkfhfdlejlcrunfbjfhbldtdrbbbl [19:38:04] * dduvall is good at yubikey [19:39:00] dduvall: may I deploy the fix for https://phabricator.wikimedia.org/T414077? [19:39:03] (03PS1) 10Ebenezer Rao: fixed typo of the word initial in swiftcleanermanager [software] - 10https://gerrit.wikimedia.org/r/1224788 (https://phabricator.wikimedia.org/T201491) [19:39:30] zabe: yes, please do. train looks ok [19:40:11] (03CR) 10Zabe: [C:03+2] MimeAnalyzer: Fix syntax error in MAJOR_MIME_TYPES array [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224679 (https://phabricator.wikimedia.org/T414077) (owner: 10Zabe) [19:40:15] i missed that as a new blocker. sorry about that [19:40:22] Alright, no worries [19:40:36] Its commons which is mostly affected by this anyway [19:40:41] right [19:41:11] (03CR) 10Zabe: [C:03+2] fixed typo of word initial in the WmfConfig.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224781 (https://phabricator.wikimedia.org/T201491) (owner: 10Ebenezer Rao) [19:41:13] (03CR) 10Zabe: [C:03+2] Enable phan on more php files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223289 (owner: 10Zabe) [19:41:57] (03CR) 10Pppery: "This is still spelled wrong." [software] - 10https://gerrit.wikimedia.org/r/1224788 (https://phabricator.wikimedia.org/T201491) (owner: 10Ebenezer Rao) [19:41:58] (03Merged) 10jenkins-bot: fixed typo of word initial in the WmfConfig.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224781 (https://phabricator.wikimedia.org/T201491) (owner: 10Ebenezer Rao) [19:42:00] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup2003.codfw.wmnet with OS trixie [19:42:04] (03Merged) 10jenkins-bot: Enable phan on more php files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223289 (owner: 10Zabe) [19:42:04] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1003.eqiad.wmnet with OS trixie [19:42:47] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1224781|fixed typo of word initial in the WmfConfig.php file (T201491)]], [[gerrit:1223289|Enable phan on more php files]] [19:42:50] T201491: Fix common typos in code - https://phabricator.wikimedia.org/T201491 [19:44:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223283 (https://phabricator.wikimedia.org/T413108) (owner: 10Arlolra) [19:44:45] !log zabe@deploy2002 zabe, ebenezerrao: Backport for [[gerrit:1224781|fixed typo of word initial in the WmfConfig.php file (T201491)]], [[gerrit:1223289|Enable phan on more php files]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:46:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T413525)', diff saved to https://phabricator.wikimedia.org/P86888 and previous config saved to /var/cache/conftool/dbconfig/20260108-194649-marostegui.json [19:46:51] !log zabe@deploy2002 zabe, ebenezerrao: Continuing with sync [19:46:53] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:47:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1216.eqiad.wmnet with reason: Maintenance [19:47:23] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1230.eqiad.wmnet with reason: Maintenance [19:47:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1230 (T413525)', diff saved to https://phabricator.wikimedia.org/P86889 and previous config saved to /var/cache/conftool/dbconfig/20260108-194731-marostegui.json [19:50:56] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224781|fixed typo of word initial in the WmfConfig.php file (T201491)]], [[gerrit:1223289|Enable phan on more php files]] (duration: 08m 09s) [19:51:00] T201491: Fix common typos in code - https://phabricator.wikimedia.org/T201491 [19:52:45] (03Merged) 10jenkins-bot: MimeAnalyzer: Fix syntax error in MAJOR_MIME_TYPES array [core] (wmf/1.46.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1224679 (https://phabricator.wikimedia.org/T414077) (owner: 10Zabe) [19:53:12] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1224679|MimeAnalyzer: Fix syntax error in MAJOR_MIME_TYPES array (T414077)]] [19:53:15] T414077: WAV files being uploaded with wrong MIME type - https://phabricator.wikimedia.org/T414077 [19:55:07] !log zabe@deploy2002 zabe: Backport for [[gerrit:1224679|MimeAnalyzer: Fix syntax error in MAJOR_MIME_TYPES array (T414077)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:55:57] (03PS1) 10Tbodt: Add MultiTitle to extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224793 (https://phabricator.wikimedia.org/T404461) [19:55:59] (03PS1) 10Tbodt: Add config variable for MultiTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224794 [19:55:59] (03PS1) 10Tbodt: Enable MultiTitle on beta cluster testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224795 (https://phabricator.wikimedia.org/T404461) [19:56:01] (03PS1) 10Tbodt: Load MultiTitle on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224796 (https://phabricator.wikimedia.org/T404461) [19:56:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T413525)', diff saved to https://phabricator.wikimedia.org/P86891 and previous config saved to /var/cache/conftool/dbconfig/20260108-195627-marostegui.json [19:56:31] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:56:45] (03PS2) 10Tbodt: Add config variable for MultiTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224794 (https://phabricator.wikimedia.org/T404461) [19:56:47] (03PS2) 10Tbodt: Enable MultiTitle on beta cluster testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224795 (https://phabricator.wikimedia.org/T404461) [19:56:47] (03PS2) 10Tbodt: Load MultiTitle on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224796 (https://phabricator.wikimedia.org/T404461) [20:04:41] !log zabe@deploy2002 zabe: Continuing with sync [20:06:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P86892 and previous config saved to /var/cache/conftool/dbconfig/20260108-200635-marostegui.json [20:08:43] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224679|MimeAnalyzer: Fix syntax error in MAJOR_MIME_TYPES array (T414077)]] (duration: 15m 31s) [20:08:46] T414077: WAV files being uploaded with wrong MIME type - https://phabricator.wikimedia.org/T414077 [20:08:57] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde for martyn.ranyard - https://phabricator.wikimedia.org/T413994#11505355 (10KFrancis) Hi @JMeybohm, it doesn't look like we have an NDA on file for Martyn Ranyard. Would you please provide their email address? [20:09:14] (03PS1) 10Andrew Bogott: cloudbackups: update partman recipes. [puppet] - 10https://gerrit.wikimedia.org/r/1224798 (https://phabricator.wikimedia.org/T375217) [20:11:37] (03CR) 10Andrew Bogott: [C:03+2] cloudbackups: update partman recipes. [puppet] - 10https://gerrit.wikimedia.org/r/1224798 (https://phabricator.wikimedia.org/T375217) (owner: 10Andrew Bogott) [20:14:38] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:15:58] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:16:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P86893 and previous config saved to /var/cache/conftool/dbconfig/20260108-201643-marostegui.json [20:16:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:17:12] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS trixie [20:17:16] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS trixie [20:17:38] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:18:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:19:58] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad Pdu tripped breaker on ps1-c3-eqiad no automated allerts - https://phabricator.wikimedia.org/T414134#11505404 (10Jclark-ctr) [20:21:01] 06SRE, 06Infrastructure-Foundations, 10Puppet CI, 10Puppet-Infrastructure, 13Patch-For-Review: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399#11505413 (10jhathaway) >>! In T367399#11502051, @hashar wrote: > Someth... [20:24:59] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudbackup2003.codfw.wmnet with OS trixie [20:25:03] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudbackup1003.eqiad.wmnet with OS trixie [20:26:32] !log zabe@deploy2002:~$ mwscript refreshImageMetadata.php commonswiki --mediatype AUDIO --mime unknown/wav --force # T414077 [20:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:36] T414077: WAV files being uploaded with wrong MIME type - https://phabricator.wikimedia.org/T414077 [20:26:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T413525)', diff saved to https://phabricator.wikimedia.org/P86894 and previous config saved to /var/cache/conftool/dbconfig/20260108-202652-marostegui.json [20:26:55] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [20:27:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1245.eqiad.wmnet with reason: Maintenance [20:27:11] (03PS1) 10Pppery: Igwiki: add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224799 (https://phabricator.wikimedia.org/T406178) [20:27:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [20:27:54] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS trixie [20:27:55] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS trixie [20:28:00] (03CR) 10CI reject: [V:04-1] Igwiki: add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224799 (https://phabricator.wikimedia.org/T406178) (owner: 10Pppery) [20:30:37] (03PS1) 10CDanis: ip_reputation: tweak interval [puppet] - 10https://gerrit.wikimedia.org/r/1224800 [20:31:01] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224800 (owner: 10CDanis) [20:32:12] (03PS1) 10Ebenezer Rao: fixed typo of the word initial in the test_init.py [software/cumin] - 10https://gerrit.wikimedia.org/r/1224801 (https://phabricator.wikimedia.org/T201491) [20:33:10] !log zabe@deploy2002:~$ foreachwiki refreshImageMetadata.php --mediatype AUDIO --mime unknown/wav --force # T414077 [20:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:13] T414077: WAV files being uploaded with wrong MIME type - https://phabricator.wikimedia.org/T414077 [20:34:41] (03CR) 10Scott French: [C:03+1] ip_reputation: tweak interval [puppet] - 10https://gerrit.wikimedia.org/r/1224800 (owner: 10CDanis) [20:35:02] !log zabe@deploy2002:~$ foreachwiki refreshImageMetadata.php --mediatype AUDIO --mime unknown/wav --force --oldimage # T414077 [20:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:10] (03CR) 10CDanis: [C:03+2] ip_reputation: tweak interval [puppet] - 10https://gerrit.wikimedia.org/r/1224800 (owner: 10CDanis) [20:35:38] (03PS2) 10Pppery: Igwiki: add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224799 (https://phabricator.wikimedia.org/T406178) [20:42:12] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1003.eqiad.wmnet with reason: host reimage [20:43:31] (03CR) 10JHathaway: [C:03+1] puppet: Remove the force_puppet7 parameter [puppet] - 10https://gerrit.wikimedia.org/r/1224605 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [20:46:37] (03PS1) 10Ebenezer Rao: fixed typo of the word initial in the zerrors_windows.go file [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1224802 (https://phabricator.wikimedia.org/T201491) [20:47:05] (03PS1) 10Andrew Bogott: cloudbackup partman: second attempt [puppet] - 10https://gerrit.wikimedia.org/r/1224803 [20:48:36] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1003.eqiad.wmnet with reason: host reimage [20:48:48] (03PS1) 10RLazarus: httpbb: Update the expected error message for auth.wm.o page views [puppet] - 10https://gerrit.wikimedia.org/r/1224804 (https://phabricator.wikimedia.org/T414132) [20:49:36] (03CR) 10RLazarus: "Fails without, passes with:" [puppet] - 10https://gerrit.wikimedia.org/r/1224804 (https://phabricator.wikimedia.org/T414132) (owner: 10RLazarus) [20:49:45] (03CR) 10Andrew Bogott: [C:03+2] cloudbackup partman: second attempt [puppet] - 10https://gerrit.wikimedia.org/r/1224803 (owner: 10Andrew Bogott) [20:50:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224799 (https://phabricator.wikimedia.org/T406178) (owner: 10Pppery) [20:51:40] (03CR) 10MVernon: [C:03+1] httpbb: Update the expected error message for auth.wm.o page views [puppet] - 10https://gerrit.wikimedia.org/r/1224804 (https://phabricator.wikimedia.org/T414132) (owner: 10RLazarus) [20:52:03] (03CR) 10RLazarus: [C:03+2] httpbb: Update the expected error message for auth.wm.o page views [puppet] - 10https://gerrit.wikimedia.org/r/1224804 (https://phabricator.wikimedia.org/T414132) (owner: 10RLazarus) [20:52:26] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1003.eqiad.wmnet with OS trixie [20:52:26] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudbackup2003.codfw.wmnet with OS trixie [20:53:39] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS trixie [21:00:04] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudbackup1003.eqiad.wmnet with OS trixie [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T2100). [21:00:05] sbassett, JSherman, arlolra, and Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:08] Here [21:00:12] here [21:00:16] o/ [21:00:26] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS trixie [21:00:26] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS trixie [21:01:23] o/ [21:01:40] lmk if anyone needs a deployer - otherwise please self-deploy at will [21:02:08] I'm not a deployer [21:02:11] ok. my config change might blow up logstash. not sure. [21:02:19] * sbassett can also deploy for anyone [21:02:32] just fixed an issue that would have caused scap to give you a spurious httpbb test failure, but please ping me if you see anything unexpected and httpbb-flavored :) [21:02:55] tx rzl [21:03:15] I’m ready to deploy my cfg change if there are no objections... [21:03:21] (unless it's actually an httpbb failure caused by your change, in which case, I guess do the usual thing about that) [21:03:27] (03CR) 10SBassett: [C:03+1] Set CSP Report Only mode for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224144 (https://phabricator.wikimedia.org/T291867) (owner: 10SBassett) [21:04:53] sbassett: thanks - all you [21:05:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224144 (https://phabricator.wikimedia.org/T291867) (owner: 10SBassett) [21:05:55] (03Merged) 10jenkins-bot: Set CSP Report Only mode for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224144 (https://phabricator.wikimedia.org/T291867) (owner: 10SBassett) [21:06:15] !log sbassett@deploy2002 Started scap sync-world: Backport for [[gerrit:1224144|Set CSP Report Only mode for all wikis (T291867)]] [21:07:47] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad Pdu tripped breaker on ps1-c3-eqiad no automated allerts - https://phabricator.wikimedia.org/T414134#11505510 (10Reedy) [21:07:56] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad Pdu tripped breaker on ps1-c3-eqiad no automated alerts - https://phabricator.wikimedia.org/T414134#11505511 (10Reedy) [21:08:23] !log sbassett@deploy2002 sbassett: Backport for [[gerrit:1224144|Set CSP Report Only mode for all wikis (T291867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:08:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:08:42] PROBLEM - ganeti-noded running on ganeti1024 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [21:09:42] RECOVERY - ganeti-noded running on ganeti1024 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [21:10:32] !log sbassett@deploy2002 sbassett: Continuing with sync [21:12:18] (03PS1) 10Kgraessle: When filtering for edits with high Revert Risk, Recent Changes shouldn't display edits from non-main namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224814 (https://phabricator.wikimedia.org/T404200) [21:14:36] !log sbassett@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224144|Set CSP Report Only mode for all wikis (T291867)]] (duration: 08m 20s) [21:14:38] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:15:58] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:16:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:17:00] Done with my cfg patch. That… is definitely introducing a lot more traffic to logstash but is maybe ok for now. [21:17:31] (03PS1) 10Kgraessle: When filtering for edits with high Revert Risk, Recent Changes shouldn't display edits from non-main namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224814 (https://phabricator.wikimedia.org/T404200) [21:17:38] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:17:52] sbassett: am I good to proceed? [21:18:10] yes [21:18:18] thanks! [21:18:24] JSherman: Ping me when you're done, I'll go next [21:18:37] arlolra: wilco [21:18:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217786 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman) [21:19:51] (03Merged) 10jenkins-bot: extension-list: Add PersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217786 (https://phabricator.wikimedia.org/T412528) (owner: 10Jsn.sherman) [21:20:10] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1217786|extension-list: Add PersonalDashboard (T412528)]] [21:20:13] T412528: Deploy the PersonalDashboard extension to Beta Cluster - https://phabricator.wikimedia.org/T412528 [21:21:19] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudbackup1003.eqiad.wmnet with OS trixie [21:24:09] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:25:12] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup2003.codfw.wmnet with OS trixie [21:25:27] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup2003'] [21:27:32] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup2003'] [21:27:58] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudbackup2003'] [21:28:08] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup1003'] [21:34:38] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:35:11] !log andrew@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudbackup2003'] [21:41:17] made it through the image registry push; I was starting to get antsy [21:42:34] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS trixie [21:44:08] !log jsn@deploy2002 jsn: Backport for [[gerrit:1217786|extension-list: Add PersonalDashboard (T412528)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:44:11] T412528: Deploy the PersonalDashboard extension to Beta Cluster - https://phabricator.wikimedia.org/T412528 [21:45:16] !log jsn@deploy2002 jsn: Continuing with sync [21:49:49] (03PS1) 10Eevans: WIP hoard chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224817 (https://phabricator.wikimedia.org/T414112) [21:50:40] I guess adding something extension-list causes a full i18n rebuild, syncing that takes quite a bit [21:50:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:51:17] (03CR) 10CI reject: [V:04-1] WIP hoard chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224817 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [21:51:28] ugh, I'm sorry, I should have gone last then [21:57:50] !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1217786|extension-list: Add PersonalDashboard (T412528)]] (duration: 37m 41s) [21:57:54] T412528: Deploy the PersonalDashboard extension to Beta Cluster - https://phabricator.wikimedia.org/T412528 [21:58:09] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup2003.codfw.wmnet with reason: host reimage [21:58:12] arlolra: done [21:58:17] ty [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260108T2200) [22:01:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223283 (https://phabricator.wikimedia.org/T413108) (owner: 10Arlolra) [22:02:24] (03Merged) 10jenkins-bot: Deploy PRV to 27 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223283 (https://phabricator.wikimedia.org/T413108) (owner: 10Arlolra) [22:02:45] !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1223283|Deploy PRV to 27 wikis (T413108)]] [22:02:48] T413108: Parsoid Read Views to deploy ~2026-01-01 - https://phabricator.wikimedia.org/T413108 [22:03:08] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup2003.codfw.wmnet with reason: host reimage [22:09:07] !log arlolra@deploy2002 arlolra: Backport for [[gerrit:1223283|Deploy PRV to 27 wikis (T413108)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:09:10] T413108: Parsoid Read Views to deploy ~2026-01-01 - https://phabricator.wikimedia.org/T413108 [22:15:23] !log arlolra@deploy2002 arlolra: Continuing with sync [22:16:29] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11505648 (10bd808) >>! In T413634#11504543, @JMeybohm wrote: > #release-engineering-team: Could you help with removing +2 ? I [[https://gerrit... [22:20:28] 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11505652 (10Ragesoss) @ssingh I've just deployed an update that should fix it. Now the user agent is `Wiki Education Dashboard/1.0 (dashboard.wikiedu.org; sage@wikiedu.or... [22:21:16] !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1223283|Deploy PRV to 27 wikis (T413108)]] (duration: 18m 32s) [22:21:19] T413108: Parsoid Read Views to deploy ~2026-01-01 - https://phabricator.wikimedia.org/T413108 [22:24:09] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:24:24] I'm done Pppery if you want to go [22:24:28] Not a deployer [22:24:44] Not the first time people thought I was one, though [22:25:02] Did you want me to do that for you> [22:25:27] If you mean "do I want you do deploy my patch", then sure [22:25:41] Alrighty [22:26:00] * cjming thanks arlolra [22:26:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224799 (https://phabricator.wikimedia.org/T406178) (owner: 10Pppery) [22:26:46] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11505670 (10bd808) @DannyS712 In addition to your MediaWiki +2 which I just revoked, do you want to give up other rights in Gerrit such as your... [22:27:26] Someone should run namespaceDupes after it is deployed (although I checked and don't see any conflicts): https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#namespaceDupes [22:27:30] (03Merged) 10jenkins-bot: Igwiki: add draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224799 (https://phabricator.wikimedia.org/T406178) (owner: 10Pppery) [22:27:48] !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1224799|Igwiki: add draft namespace (T406178)]] [22:27:51] T406178: Add draft namespace to Igbo Wikipedia - https://phabricator.wikimedia.org/T406178 [22:29:51] !log arlolra@deploy2002 pppery, arlolra: Backport for [[gerrit:1224799|Igwiki: add draft namespace (T406178)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:29:55] Looking [22:30:28] Seems to work [22:30:35] Thanks [22:30:40] !log arlolra@deploy2002 pppery, arlolra: Continuing with sync [22:34:53] !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1224799|Igwiki: add draft namespace (T406178)]] (duration: 07m 04s) [22:34:56] T406178: Add draft namespace to Igbo Wikipedia - https://phabricator.wikimedia.org/T406178 [22:35:14] All done [22:35:25] What about namespaceDupes? [22:35:31] https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#namespaceDupes [22:35:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:36:04] arlolra: do you have access to run that script? otherwise i can do it [22:36:14] !incidents [22:36:15] 7299 (UNACKED) [2x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [22:36:15] 7298 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [22:36:15] 7297 (RESOLVED) [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [22:36:15] 7296 (RESOLVED) [2x] ProbeDown sre (text-https:443 probes/service eqsin) [22:36:16] 7294 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [22:36:16] 7295 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [22:36:16] 7290 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [22:36:16] 7292 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [22:36:17] 7291 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [22:36:17] 7293 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [22:36:18] 7289 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [22:36:18] 7288 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [22:36:19] 7287 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [22:36:25] !ack 7299 [22:36:26] 7272 (RESOLVED) fransw2001/check_memory [22:36:47] (known bug, sorry -- it acked the correct incident and then gave the wrong reply) [22:37:15] (03PS2) 10Ryan Kemper: Move opensearch-ipoid to production state [puppet] - 10https://gerrit.wikimedia.org/r/1224738 (owner: 10Bking) [22:38:20] (03CR) 10Bking: [C:03+1] Move opensearch-ipoid to production state [puppet] - 10https://gerrit.wikimedia.org/r/1224738 (owner: 10Bking) [22:38:50] (03PS3) 10Ryan Kemper: Move opensearch-ipoid to production state [puppet] - 10https://gerrit.wikimedia.org/r/1224738 (owner: 10Bking) [22:38:59] cjming: oh, can you take of that? [22:39:05] sure np [22:39:22] Thanks [22:39:27] (03PS4) 10Ryan Kemper: Move opensearch-ipoid to production state [puppet] - 10https://gerrit.wikimedia.org/r/1224738 (owner: 10Bking) [22:40:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:41:12] I do have access, just less experienced with running maintenance scripts [22:41:44] !log cjming@deploy2002 mwscript-k8s job started: namespaceDupes igwiki --fix # T406178 [22:41:47] T406178: Add draft namespace to Igbo Wikipedia - https://phabricator.wikimedia.org/T406178 [22:41:49] Anyone who can deploy patches has access to run maintenance scripts AFAIK [22:42:00] Pppery: done! [22:42:04] Thanks [22:43:10] (03CR) 10Ryan Kemper: [C:03+2] Move opensearch-ipoid to production state [puppet] - 10https://gerrit.wikimedia.org/r/1224738 (owner: 10Bking) [22:45:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:46:02] !incidents [22:46:02] 7299 (ACKED) [2x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [22:46:02] 7298 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [22:46:02] 7297 (RESOLVED) [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [22:46:03] 7296 (RESOLVED) [2x] ProbeDown sre (text-https:443 probes/service eqsin) [22:46:03] 7294 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [22:46:03] 7295 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [22:46:03] 7290 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [22:46:04] 7292 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [22:46:04] 7291 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [22:46:05] 7293 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [22:46:05] 7289 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [22:46:06] 7288 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [22:46:06] 7287 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [22:49:09] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:54:09] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:58:48] (03PS1) 10Ryan Kemper: Revert "Move opensearch-ipoid to production state" [puppet] - 10https://gerrit.wikimedia.org/r/1224827 [23:00:23] (03PS2) 10Ryan Kemper: Revert "Move opensearch-ipoid to production state" [puppet] - 10https://gerrit.wikimedia.org/r/1224827 (https://phabricator.wikimedia.org/T414037) [23:00:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:01:16] (03CR) 10Ryan Kemper: [C:03+2] Revert "Move opensearch-ipoid to production state" [puppet] - 10https://gerrit.wikimedia.org/r/1224827 (https://phabricator.wikimedia.org/T414037) (owner: 10Ryan Kemper) [23:06:52] !incidents [23:06:52] 7299 (ACKED) [2x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [23:06:53] 7298 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [23:06:53] 7297 (RESOLVED) [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [23:06:53] 7296 (RESOLVED) [2x] ProbeDown sre (text-https:443 probes/service eqsin) [23:06:53] 7294 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [23:06:53] 7295 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [23:06:54] 7290 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [23:06:54] 7292 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [23:06:54] 7291 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [23:06:55] 7293 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [23:06:55] 7289 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [23:06:56] 7288 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [23:06:56] 7287 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [23:09:09] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:14:09] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:15:51] RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:17:50] 10ops-drmrs: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T413005#11505824 (10phaultfinder)