[00:17:58] (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224833 (https://phabricator.wikimedia.org/T219903) [00:18:14] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [00:18:17] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:18:19] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [00:18:22] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [00:18:23] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [00:18:25] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [00:20:16] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224833 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [00:22:20] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224833 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [00:22:43] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [00:22:46] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:22:48] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [00:22:51] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [00:22:52] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [00:22:54] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [00:23:51] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [00:24:09] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:24:11] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [00:24:27] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [00:24:29] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [00:24:45] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [00:25:33] (03CR) 10Arlolra: [C:03+1] Increase PRV percentage on fawiki/kowiki/azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224719 (https://phabricator.wikimedia.org/T413108) (owner: 10C. Scott Ananian) [00:40:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1224836 [00:40:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1224836 (owner: 10TrainBranchBot) [00:43:52] (03PS1) 10Arlolra: Support incremental roll out of Parsoid Read Views [extensions/ParserMigration] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1224837 (https://phabricator.wikimedia.org/T391881) [00:52:14] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1224836 (owner: 10TrainBranchBot) [01:00:41] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:07:49] (03PS1) 10Aaron Schulz: rest-gateway: changed REST sandbox rerouting to redirection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224838 (https://phabricator.wikimedia.org/T396807) [01:10:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1224839 [01:10:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1224839 (owner: 10TrainBranchBot) [01:18:59] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 18m 17s) [01:23:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223261 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz) [01:23:41] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11506021 (10Papaul) ` Hi Papaul, During testing in our lab we noticed that SPT (Spanning Tree Protocol) packets are being counted as “in-error” packe... [01:24:09] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:33:08] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1224839 (owner: 10TrainBranchBot) [01:50:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:17:11] (03PS1) 10Andrew Bogott: cloudbackup: use default postgres dir for data files [puppet] - 10https://gerrit.wikimedia.org/r/1224840 [02:22:14] (03PS2) 10Andrew Bogott: cloudbackup: use default postgres dir for data files [puppet] - 10https://gerrit.wikimedia.org/r/1224840 [02:24:09] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:25:17] (03CR) 10Andrew Bogott: [C:03+2] cloudbackup: use default postgres dir for data files [puppet] - 10https://gerrit.wikimedia.org/r/1224840 (owner: 10Andrew Bogott) [02:25:59] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup1003'] [02:27:28] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS trixie [02:32:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86895 and previous config saved to /var/cache/conftool/dbconfig/20260109-023246-marostegui.json [02:32:51] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:32:51] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:42:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P86896 and previous config saved to /var/cache/conftool/dbconfig/20260109-024254-marostegui.json [02:53:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P86897 and previous config saved to /var/cache/conftool/dbconfig/20260109-025303-marostegui.json [02:53:55] andrew@cumin2002 reimage (PID 2869160) is awaiting input [03:03:06] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudbackup2003.codfw.wmnet with OS trixie [03:03:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86898 and previous config saved to /var/cache/conftool/dbconfig/20260109-030311-marostegui.json [03:03:17] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [03:03:17] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [03:03:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [03:03:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1199 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86899 and previous config saved to /var/cache/conftool/dbconfig/20260109-030336-marostegui.json [03:03:42] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup2003.codfw.wmnet with OS trixie [03:21:40] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup2003.codfw.wmnet with reason: host reimage [03:28:47] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup2003.codfw.wmnet with reason: host reimage [03:45:30] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup2003.codfw.wmnet with OS trixie [04:19:08] PROBLEM - Host 195.200.68.37 is DOWN: CRITICAL - Time to live exceeded (195.200.68.37) [04:19:30] RECOVERY - Host 195.200.68.37 is UP: PING OK - Packet loss = 0%, RTA = 137.53 ms [04:28:44] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [04:28:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86900 and previous config saved to /var/cache/conftool/dbconfig/20260109-042845-marostegui.json [04:28:51] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:28:51] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:30:06] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:31:30] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [04:34:10] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:38:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P86901 and previous config saved to /var/cache/conftool/dbconfig/20260109-043854-marostegui.json [04:49:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P86902 and previous config saved to /var/cache/conftool/dbconfig/20260109-044902-marostegui.json [04:59:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86903 and previous config saved to /var/cache/conftool/dbconfig/20260109-045910-marostegui.json [04:59:16] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:59:16] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:59:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [04:59:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2172 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86904 and previous config saved to /var/cache/conftool/dbconfig/20260109-045935-marostegui.json [05:09:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:24:09] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:34:10] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:50:00] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:50:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:51:50] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:09:09] FIRING: SLOMetricAbsent: wdqs-main-update-lag eqiad - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [06:18:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1210.eqiad.wmnet with reason: Maintenance [06:23:15] 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11506171 (10Joe) @Ragesoss I see you still get blocked from time to time; I will add an exception, per https://wikitech.wikimedia.org/wiki/Robot_policy#What_to_do_if_thes... [06:24:10] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:31:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2157.codfw.wmnet with reason: Maintenance [06:31:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2157 (T413525)', diff saved to https://phabricator.wikimedia.org/P86905 and previous config saved to /var/cache/conftool/dbconfig/20260109-063154-marostegui.json [06:31:58] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [06:36:24] (03PS1) 10Marostegui: installserver: Do not format db2249 [puppet] - 10https://gerrit.wikimedia.org/r/1224855 (https://phabricator.wikimedia.org/T411570) [06:37:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1022:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:38:15] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2249 [puppet] - 10https://gerrit.wikimedia.org/r/1224855 (https://phabricator.wikimedia.org/T411570) (owner: 10Marostegui) [06:40:47] (03PS1) 10Marostegui: installserver: Fix duplicate reuse-db-efi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1224856 (https://phabricator.wikimedia.org/T411570) [06:42:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:45:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T413525)', diff saved to https://phabricator.wikimedia.org/P86906 and previous config saved to /var/cache/conftool/dbconfig/20260109-064541-marostegui.json [06:45:45] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [06:47:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:52:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:52:44] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157 (10kimpham) 03NEW [06:55:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P86907 and previous config saved to /var/cache/conftool/dbconfig/20260109-065549-marostegui.json [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260109T0700) [07:01:24] (03PS1) 10Dzahn: eventgate-analytics-external: add wikipedia25.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224858 (https://phabricator.wikimedia.org/T408592) [07:03:24] 06SRE, 06Traffic: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11506217 (10Joe) 05Open→03Resolved p:05Triage→03High a:03Joe Exception added. I allowed a generous amount of requests; please let us know if you still run i... [07:05:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P86908 and previous config saved to /var/cache/conftool/dbconfig/20260109-070558-marostegui.json [07:07:41] FIRING: [20x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [07:12:00] (03CR) 10Dzahn: [C:04-1] "as a reminder for later - once it's ready - need to define a useful string" [puppet] - 10https://gerrit.wikimedia.org/r/1224575 (owner: 10Dzahn) [07:12:41] FIRING: [112x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [07:16:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T413525)', diff saved to https://phabricator.wikimedia.org/P86909 and previous config saved to /var/cache/conftool/dbconfig/20260109-071606-marostegui.json [07:16:12] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [07:16:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [07:16:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2171 (T413525)', diff saved to https://phabricator.wikimedia.org/P86910 and previous config saved to /var/cache/conftool/dbconfig/20260109-071621-marostegui.json [07:16:22] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157#11506239 (10Dzahn) Hello @kimpham please send an email from your WMDE address to [[ https://gerrit.wikimedia.org/r/1224858 | Katie Francis ]] of WMF Legal and let her know you would like to start... [07:17:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:17:16] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157#11506242 (10Dzahn) @WMDE-leszek Could you please approve? Thank you. [07:20:07] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde for martyn.ranyard - https://phabricator.wikimedia.org/T413994#11506244 (10Dzahn) Hello @Martyn.ranyard Please send an email from your WMDE address to [[ https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) | Katie Francis ]] of WMF Legal and let her know you... [07:24:53] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11506246 (10Dzahn) a:03DannyS712 [07:30:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T413525)', diff saved to https://phabricator.wikimedia.org/P86911 and previous config saved to /var/cache/conftool/dbconfig/20260109-073008-marostegui.json [07:30:12] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [07:30:31] (03CR) 10Slyngshede: [C:03+2] Account linking: hide message box when linked [software/bitu] - 10https://gerrit.wikimedia.org/r/1224660 (owner: 10Slyngshede) [07:30:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11506251 (10Dzahn) Since the request is described as "access and update dashboards", Hadoop is not mentioned and per: https://wikitech.wikimedia.org/wiki/Data_... [07:33:03] (03Merged) 10jenkins-bot: Account linking: hide message box when linked [software/bitu] - 10https://gerrit.wikimedia.org/r/1224660 (owner: 10Slyngshede) [07:40:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P86912 and previous config saved to /var/cache/conftool/dbconfig/20260109-074017-marostegui.json [07:42:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:44:10] FIRING: [16x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:50:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P86913 and previous config saved to /var/cache/conftool/dbconfig/20260109-075025-marostegui.json [07:52:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:54:10] FIRING: [20x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:55:28] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1013:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [07:56:41] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1224856 (https://phabricator.wikimedia.org/T411570) (owner: 10Marostegui) [07:57:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:59:10] FIRING: [24x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260109T0800) [08:00:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T413525)', diff saved to https://phabricator.wikimedia.org/P86914 and previous config saved to /var/cache/conftool/dbconfig/20260109-080033-marostegui.json [08:00:37] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:00:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2178.codfw.wmnet with reason: Maintenance [08:00:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2178 (T413525)', diff saved to https://phabricator.wikimedia.org/P86915 and previous config saved to /var/cache/conftool/dbconfig/20260109-080058-marostegui.json [08:02:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:06:00] (03PS2) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) [08:08:06] (03CR) 10Muehlenhoff: [C:03+1] sre.hosts.reimage: remove puppet 5 support and default to 7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey) [08:10:06] (03PS2) 10Muehlenhoff: Rename stale_certs_exporter and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224687 (https://phabricator.wikimedia.org/T365798) [08:10:33] (03PS1) 10Dzahn: admin: upgrade tgritschacher to analytics-privatedata without shell [puppet] - 10https://gerrit.wikimedia.org/r/1224862 (https://phabricator.wikimedia.org/T414061) [08:12:31] (03PS2) 10Dzahn: admin: upgrade tgritschacher to analytics-privatedata without shell [puppet] - 10https://gerrit.wikimedia.org/r/1224862 (https://phabricator.wikimedia.org/T414061) [08:13:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224687 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:13:57] (03PS3) 10Muehlenhoff: Rename stale_certs_exporter and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224687 (https://phabricator.wikimedia.org/T365798) [08:15:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1013:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:17:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T413525)', diff saved to https://phabricator.wikimedia.org/P86916 and previous config saved to /var/cache/conftool/dbconfig/20260109-081708-marostegui.json [08:17:12] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:18:18] (03CR) 10Elukey: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:22:02] FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:24:15] (03CR) 10Elukey: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:25:20] (03CR) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:26:08] (03CR) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:27:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P86917 and previous config saved to /var/cache/conftool/dbconfig/20260109-082717-marostegui.json [08:29:44] (03CR) 10Krinkle: Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) (owner: 10Tchanders) [08:30:18] Dreamy_Jazz: in case you know ^ :D [08:30:23] (03CR) 10Elukey: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:30:39] !log restarting blazegraph on wdqs-main@eqiad - high thread count [08:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:46] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1018 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:32:46] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1018 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:33:34] (03CR) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:33:50] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:34:00] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:34:10] FIRING: [24x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:34:10] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag eqiad - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:34:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1224687 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:35:06] FIRING: [24x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:35:28] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1013:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:37:00] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:37:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P86918 and previous config saved to /var/cache/conftool/dbconfig/20260109-083725-marostegui.json [08:38:50] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:44:10] FIRING: [18x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:45:06] FIRING: [18x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:47:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T413525)', diff saved to https://phabricator.wikimedia.org/P86919 and previous config saved to /var/cache/conftool/dbconfig/20260109-084733-marostegui.json [08:47:37] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:47:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2201.codfw.wmnet with reason: Maintenance [08:48:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2211.codfw.wmnet with reason: Maintenance [08:48:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2211 (T413525)', diff saved to https://phabricator.wikimedia.org/P86920 and previous config saved to /var/cache/conftool/dbconfig/20260109-084834-marostegui.json [08:49:10] FIRING: [18x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:49:54] (03PS3) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) [08:52:02] FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:52:50] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:55:00] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:57:02] FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:59:31] (03CR) 10Jcrespo: [C:03+1] "Looks fine, let me know when you want to deploy." [puppet] - 10https://gerrit.wikimedia.org/r/1219619 (https://phabricator.wikimedia.org/T413101) (owner: 10Tchanders) [09:00:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T413525)', diff saved to https://phabricator.wikimedia.org/P86921 and previous config saved to /var/cache/conftool/dbconfig/20260109-090050-marostegui.json [09:00:54] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:05:50] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1013.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:06:00] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:06:07] (03PS1) 10Muehlenhoff: Remove dead Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/1224889 [09:07:02] FIRING: [12x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:07:18] !log [urbanecm@deploy2002 ~]$ kubectl delete job/growthexperiments-updatementeedata-s1-29460615 # T414167 [09:07:20] (03CR) 10Filippo Giunchedi: [C:03+1] Rename stale_certs_exporter and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224687 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:21] T414167: Do not alert about a failed cron job when logs are already discarded - https://phabricator.wikimedia.org/T414167 [09:07:54] (03PS1) 10Muehlenhoff: Remove dead Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/1224891 [09:09:11] (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: Update image for revise-tone-task-generator on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224721 (https://phabricator.wikimedia.org/T412210) (owner: 10AikoChou) [09:10:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P86922 and previous config saved to /var/cache/conftool/dbconfig/20260109-091058-marostegui.json [09:12:52] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11506446 (10elukey) Tried to replicate Alex's test with the following: On registry1004 (not serving live traffic): - `sudo iptables -A INPUT -p tcp -s 10.192.32.7... [09:14:10] FIRING: [14x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:15:51] (03CR) 10Marostegui: [C:03+2] installserver: Fix duplicate reuse-db-efi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1224856 (https://phabricator.wikimedia.org/T411570) (owner: 10Marostegui) [09:19:10] FIRING: [11x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:21:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P86923 and previous config saved to /var/cache/conftool/dbconfig/20260109-092105-marostegui.json [09:22:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:24:00] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157#11506477 (10WMDE-leszek) [09:24:10] FIRING: [8x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:24:10] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:24:16] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157#11506478 (10WMDE-leszek) I approve this request on WMDE behalf. Thank you [09:27:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:31:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T413525)', diff saved to https://phabricator.wikimedia.org/P86924 and previous config saved to /var/cache/conftool/dbconfig/20260109-093114-marostegui.json [09:31:17] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:31:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2213.codfw.wmnet with reason: Maintenance [09:31:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2213 (T413525)', diff saved to https://phabricator.wikimedia.org/P86925 and previous config saved to /var/cache/conftool/dbconfig/20260109-093138-marostegui.json [09:33:47] (03CR) 10Elukey: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:35:31] (03PS4) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) [09:37:20] (03PS1) 10Gmodena: sup: register rdf updater with wdp [alerts] - 10https://gerrit.wikimedia.org/r/1224893 (https://phabricator.wikimedia.org/T414169) [09:37:27] (03PS1) 10DCausse: airflow-search: add enterprise extra_secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066) [09:37:37] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11506526 (10JMeybohm) [09:39:11] (03CR) 10CI reject: [V:04-1] airflow-search: add enterprise extra_secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066) (owner: 10DCausse) [09:39:23] (03CR) 10DCausse: "@Ben/Balthazar: this is mainly up for discussion and I'm not yet clear on how to make use of this secret from the airflow side" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066) (owner: 10DCausse) [09:41:40] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 66%, RTA = 5470.19 ms [09:44:10] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:44:17] (03PS12) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) [09:45:06] FIRING: SLOMetricAbsent: wdqs-main-update-lag eqiad - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:45:30] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [09:45:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T413525)', diff saved to https://phabricator.wikimedia.org/P86926 and previous config saved to /var/cache/conftool/dbconfig/20260109-094558-marostegui.json [09:46:02] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:46:18] (03CR) 10Elukey: [C:03+1] Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:46:20] (03CR) 10Federico Ceratto: "Ok, I updated parsercache logging, see https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1215575/12/tests/unit/sre/mysql/parsercache" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [09:46:23] (03CR) 10Trueg: [C:03+2] sup: register rdf updater with wdp [alerts] - 10https://gerrit.wikimedia.org/r/1224893 (https://phabricator.wikimedia.org/T414169) (owner: 10Gmodena) [09:47:05] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11506544 (10ABran-WMF) [09:47:54] (03CR) 10Trueg: [V:03+2 C:03+2] sup: register rdf updater with wdp [alerts] - 10https://gerrit.wikimedia.org/r/1224893 (https://phabricator.wikimedia.org/T414169) (owner: 10Gmodena) [09:48:00] (03Merged) 10jenkins-bot: sup: register rdf updater with wdp [alerts] - 10https://gerrit.wikimedia.org/r/1224893 (https://phabricator.wikimedia.org/T414169) (owner: 10Gmodena) [09:49:10] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:49:13] (03CR) 10Muehlenhoff: Remove Puppet 5 settings from late_command.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1224722 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:50:07] (03CR) 10Marostegui: "Thanks - I am wondering why do we force people to run a depool inside of a screen? Isn't it a bit overkill?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [09:50:27] !log marostegui@cumin1003 START - Cookbook sre.mysql.newdepool depool es1049: test [09:50:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:50:54] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newdepool (exit_code=0) depool es1049: test [09:51:06] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool es1049: test [09:51:21] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11506551 (10cmooney) As you would expect the unmanaged mgmt switches do send STP frames ` A:cmooney@lswtest-d8-eqiad# bash network-instance mgmt tcpdu... [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:51:41] (03CR) 10Marostegui: "Ah I guess it is because it is required for the repool." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [09:52:10] (03CR) 10Marostegui: "Is it easy to require it only for the repooling and NOT for the depooling?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [09:52:52] (03CR) 10Dreamy Jazz: [C:03+1] Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) (owner: 10Tchanders) [09:53:30] Krinkle: Replied to your comment, let me know if you want to discuss about it on IRC to avoid the slower replies that tend to happen on Gerrit :D [09:56:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P86929 and previous config saved to /var/cache/conftool/dbconfig/20260109-095606-marostegui.json [09:56:23] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11506561 (10elukey) The problem seems an exact replica of https://github.com/distribution/distribution/issues/2225, so I tried to add the following snippet to the r... [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:57:39] (03CR) 10Majavah: [C:03+1] Rename stale_certs_exporter and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1224687 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:02:19] (03CR) 10Elukey: [C:03+1] Remove dead Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/1224889 (owner: 10Muehlenhoff) [10:06:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P86930 and previous config saved to /var/cache/conftool/dbconfig/20260109-100614-marostegui.json [10:08:01] (03PS1) 10Slyngshede: P:cache::haproxy: check existance of mmdb files [puppet] - 10https://gerrit.wikimedia.org/r/1224897 (https://phabricator.wikimedia.org/T414111) [10:09:47] (03CR) 10CI reject: [V:04-1] P:cache::haproxy: check existance of mmdb files [puppet] - 10https://gerrit.wikimedia.org/r/1224897 (https://phabricator.wikimedia.org/T414111) (owner: 10Slyngshede) [10:10:09] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11506589 (10elukey) The only thing that I found on the docker distribution logs is was: ` Jan 09 09:52:25 registry1004 docker-registry[676]: time="2026-01-09T09:52... [10:12:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:16:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T413525)', diff saved to https://phabricator.wikimedia.org/P86932 and previous config saved to /var/cache/conftool/dbconfig/20260109-101622-marostegui.json [10:16:26] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [10:16:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2223.codfw.wmnet with reason: Maintenance [10:16:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2223 (T413525)', diff saved to https://phabricator.wikimedia.org/P86933 and previous config saved to /var/cache/conftool/dbconfig/20260109-101648-marostegui.json [10:17:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:19:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T413525)', diff saved to https://phabricator.wikimedia.org/P86934 and previous config saved to /var/cache/conftool/dbconfig/20260109-101917-marostegui.json [10:19:20] (03CR) 10Muehlenhoff: [C:03+2] Remove dead Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/1224889 (owner: 10Muehlenhoff) [10:19:50] (03PS1) 10Jelto: varnish: add wikipedia25 frontend vcl. [puppet] - 10https://gerrit.wikimedia.org/r/1224901 (https://phabricator.wikimedia.org/T408592) [10:21:32] (03CR) 10Dzahn: [C:03+1] "ooooh! great find:) that looks like it's needed indeed. another special case because it's not just a wikimedia.org sub" [puppet] - 10https://gerrit.wikimedia.org/r/1224901 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto) [10:24:10] FIRING: [6x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:24:14] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:24:50] (03CR) 10Filippo Giunchedi: [C:03+2] Remove spurious 'diff' file [alerts] - 10https://gerrit.wikimedia.org/r/1224585 (owner: 10Filippo Giunchedi) [10:29:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P86936 and previous config saved to /var/cache/conftool/dbconfig/20260109-102925-marostegui.json [10:30:02] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [10:32:02] FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:32:30] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [10:33:16] (03PS1) 10Filippo Giunchedi: README.md: mention Trixie and standalone promtool package [alerts] - 10https://gerrit.wikimedia.org/r/1224907 [10:35:49] (03PS1) 10Dzahn: zookeeper: add ssl.keyStore.passwordPath is TLS is enabled (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1224908 (https://phabricator.wikimedia.org/T405119) [10:36:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool es1049: test [10:36:59] (03CR) 10Vgutierrez: [C:03+1] varnish: add wikipedia25 frontend vcl. [puppet] - 10https://gerrit.wikimedia.org/r/1224901 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto) [10:38:36] (03CR) 10Dzahn: [C:03+2] "especially the part that the else has "return (synth(400, ""));" is convincing that this is the cause:)" [puppet] - 10https://gerrit.wikimedia.org/r/1224901 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto) [10:38:46] !log depooling / repooling wdqs-main@eqiad servers one by one to allow time to recover and catch up on updates. [10:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:10] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag eqiad - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [10:39:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P86938 and previous config saved to /var/cache/conftool/dbconfig/20260109-103934-marostegui.json [10:42:02] FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:47:41] RESOLVED: [112x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:49:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T413525)', diff saved to https://phabricator.wikimedia.org/P86939 and previous config saved to /var/cache/conftool/dbconfig/20260109-104942-marostegui.json [10:49:46] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [10:49:56] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:50:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2228.codfw.wmnet with reason: Maintenance [10:50:02] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 2266.53 ms [10:50:03] (03CR) 10Federico Ceratto: "Various cookbooks seem to require it by default but it would be easy to do the check only when pooling. The only issue is that even if dep" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [10:50:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2228 (T413525)', diff saved to https://phabricator.wikimedia.org/P86940 and previous config saved to /var/cache/conftool/dbconfig/20260109-105008-marostegui.json [10:50:30] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [10:51:18] 06SRE, 10MediaWiki-Action-API, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11506731 (10Xqt) [10:51:41] (03CR) 10Muehlenhoff: "Looks good, suggestion inline" [alerts] - 10https://gerrit.wikimedia.org/r/1224907 (owner: 10Filippo Giunchedi) [10:52:30] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11506742 (10elukey) The very interesting thing is that after a few tries I got: ` elukey@build2001:~$ sudo docker push docker-registry.svc.eqiad.wmnet/test/istio/b... [10:52:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T413525)', diff saved to https://phabricator.wikimedia.org/P86941 and previous config saved to /var/cache/conftool/dbconfig/20260109-105237-marostegui.json [10:54:56] 10SRE-swift-storage, 10Ceph, 06serviceops, 06Release-Engineering-Team (Radar): Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11506760 (10elukey) To keep archives happy - I am working in T394476 to properly onboard ceph apu... [10:55:30] (03PS2) 10Filippo Giunchedi: README.md: mention Trixie and standalone promtool package [alerts] - 10https://gerrit.wikimedia.org/r/1224907 [10:55:34] (03CR) 10Filippo Giunchedi: README.md: mention Trixie and standalone promtool package (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1224907 (owner: 10Filippo Giunchedi) [10:55:55] (03CR) 10Filippo Giunchedi: [C:03+2] "Thank you! Added your suggestion" [alerts] - 10https://gerrit.wikimedia.org/r/1224907 (owner: 10Filippo Giunchedi) [10:55:57] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] README.md: mention Trixie and standalone promtool package [alerts] - 10https://gerrit.wikimedia.org/r/1224907 (owner: 10Filippo Giunchedi) [10:59:56] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:01:07] (03PS1) 10Jelto: miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224913 (https://phabricator.wikimedia.org/T408592) [11:02:02] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:02:30] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [11:02:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P86942 and previous config saved to /var/cache/conftool/dbconfig/20260109-110245-marostegui.json [11:02:53] /win/win 8 [11:04:10] FIRING: [8x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:04:28] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:07:38] PROBLEM - Memcached on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached [11:08:28] RECOVERY - Memcached on titan1002 is OK: TCP OK - 0.010 second response time on 10.64.48.167 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [11:09:10] FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:09:25] !log revoked legacy config-master discovery cert T365798 [11:09:26] (03CR) 10Dzahn: [C:03+1] miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224913 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto) [11:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:28] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1018:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:09:29] T365798: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798 [11:09:50] (03CR) 10Jelto: [C:03+2] miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224913 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto) [11:10:06] RESOLVED: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:05] (03Merged) 10jenkins-bot: miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224913 (https://phabricator.wikimedia.org/T408592) (owner: 10Jelto) [11:12:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P86943 and previous config saved to /var/cache/conftool/dbconfig/20260109-111254-marostegui.json [11:13:33] 06SRE, 10MediaWiki-Action-API, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11506781 (10Xqt) [11:13:51] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [11:14:11] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [11:14:18] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [11:14:38] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [11:14:47] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [11:15:06] FIRING: [10x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:15:07] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [11:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:17:56] 10SRE-swift-storage, 10Ceph, 06serviceops, 06Release-Engineering-Team (Radar): Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11506803 (10elukey) [11:19:09] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11506805 (10taavi) [11:19:10] FIRING: [14x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:20:06] FIRING: [15x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:22:03] (03PS1) 10Dzahn: point wikipedia25.org to ncredir [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) [11:22:33] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11506822 (10Jelto) We were able to solve the loadbalancer issues and the site is reachable and returns 200 and the correct content. We will d... [11:22:47] (03CR) 10CI reject: [V:04-1] point wikipedia25.org to ncredir [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [11:23:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T413525)', diff saved to https://phabricator.wikimedia.org/P86944 and previous config saved to /var/cache/conftool/dbconfig/20260109-112302-marostegui.json [11:23:06] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [11:24:10] FIRING: [19x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:25:06] FIRING: [21x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:27:02] FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:29:10] FIRING: [22x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:30:48] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11506858 (10elukey) I had a chat with Matthew about apus, and they confirmed that there is no explicit rate/bw limit in place for the docker-registry account. I obs... [11:30:56] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179 (10MoritzMuehlenhoff) 03NEW [11:31:26] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11506869 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:32:19] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11506876 (10Tgr) What user agent are you using? [11:33:45] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [11:33:59] (03CR) 10Dzahn: "this is to be reverted next week" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [11:34:28] FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:44:33] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11506948 (10Joe) I'm not sure having your CI depend on external resources is a good policy; I encourage you to change that long-term, but anyways, we don't want to... [11:49:28] FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:54:28] FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:57:54] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11506993 (10Xqt) >>! In T414173#11506876, @Tgr wrote: > What user agent are you using? `pwb.py version` for a sample test taks gives ` Pywikibot: [https] wikimedi... [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260109T0800) [12:00:06] jelto, arnoldokoth, mutante, and arnaudb: Time to snap out of that daydream and deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260109T1200). [12:01:55] (03CR) 10AikoChou: [C:03+2] ml-services: Update image for revise-tone-task-generator on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224721 (https://phabricator.wikimedia.org/T412210) (owner: 10AikoChou) [12:02:42] 10SRE-Access-Requests: Grafana and Logstash access for trueg - https://phabricator.wikimedia.org/T414187 (10trueg) 03NEW [12:03:14] that gitlab version upgrade already happened. nothing now. [12:03:41] (03Merged) 10jenkins-bot: ml-services: Update image for revise-tone-task-generator on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224721 (https://phabricator.wikimedia.org/T412210) (owner: 10AikoChou) [12:04:28] FIRING: [4x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:08:32] 10SRE-Access-Requests: Requesting access to Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11507065 (10trueg) [12:08:48] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507067 (10Xqt) I also get that blocker for my normal bot during running redirect.py script: ` >>> Talk:Licentiate (Pontifical Degree) <<< Links to: [[en:Talk... [12:11:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2192.codfw.wmnet with reason: Maintenance [12:13:25] 10SRE-Access-Requests: Requesting access to Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11507110 (10trueg) [12:14:00] !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [12:16:24] !log Deploy schema change on s7 primary master T414178 [12:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:27] T414178: Remove default value from gb_by_wiki in globalblocks table on WMF wikis - https://phabricator.wikimedia.org/T414178 [12:16:45] !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [12:19:02] !log Deploy schema change on s6 primary master T414183 [12:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:05] T414183: Remove default value from gbw_by in global_block_whitelist table on WMF wikis - https://phabricator.wikimedia.org/T414183 [12:19:44] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11507159 (10elukey) Since having nginx is not really needed for this test, I went back to testing with a direct push to registry1004.eqiad.wmnet:5002: ` elukey@bui... [12:21:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:21:38] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:21:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [12:21:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1156 (T413525)', diff saved to https://phabricator.wikimedia.org/P86946 and previous config saved to /var/cache/conftool/dbconfig/20260109-122145-marostegui.json [12:21:49] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [12:21:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2148 (T413525)', diff saved to https://phabricator.wikimedia.org/P86947 and previous config saved to /var/cache/conftool/dbconfig/20260109-122157-marostegui.json [12:22:06] 10SRE-Access-Requests: Requesting access to Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11507167 (10trueg) [12:23:47] !log Deploy schema change on s2 primary master T414183 [12:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:50] 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192 (10trueg) 03NEW [12:27:02] 06SRE, 10SRE-Access-Requests: Requesting access to Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11507189 (10trueg) [12:30:49] (03PS1) 10Daniel Kinzler: rest-gateway: generate retry-after header for rate-limited requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) [12:30:58] (03CR) 10AikoChou: [C:03+1] revert-risk: Deploy on prod and staging new model version for both language-agnosting and multingual. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224604 (https://phabricator.wikimedia.org/T411786) (owner: 10Gkyziridis) [12:32:32] !log Deploy schema change on s7 primary master T414183 [12:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:35] T414183: Remove default value from gbw_by in global_block_whitelist table on WMF wikis - https://phabricator.wikimedia.org/T414183 [12:34:13] (03CR) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [12:36:11] (03CR) 10Daniel Kinzler: "is this for access fromt he outside or from within our network? We don't need access to redioscope from the outside..." [dns] - 10https://gerrit.wikimedia.org/r/1224652 (https://phabricator.wikimedia.org/T413999) (owner: 10Clément Goubert) [12:37:20] (03CR) 10JMeybohm: [C:03+2] admin: upgrade tgritschacher to analytics-privatedata without shell [puppet] - 10https://gerrit.wikimedia.org/r/1224862 (https://phabricator.wikimedia.org/T414061) (owner: 10Dzahn) [12:37:58] !log Deploy schema change on s5 primary master T414183 [12:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:01] T414183: Remove default value from gbw_by in global_block_whitelist table on WMF wikis - https://phabricator.wikimedia.org/T414183 [12:39:10] FIRING: [24x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:39:12] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tgritschacher - https://phabricator.wikimedia.org/T414061#11507254 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Merged the patch prepared by @Dzahn (thanks). [12:39:43] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tgritschacher - https://phabricator.wikimedia.org/T414061#11507258 (10JMeybohm) [12:40:10] (03CR) 10Marostegui: "Yeah maybe, ok, let's leave that for now" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [12:40:59] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507270 (10Joe) This user agent is not compliat with our user-agent policy: https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Foundation_User-Agent_Policy... [12:48:12] 06SRE, 10SRE-Access-Requests: Requesting access to Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11507289 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Welcome! Grafana access is granted by having an LDAP account. Please request access to logstash via Wikimedia IDM at http... [12:49:29] 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11507293 (10JMeybohm) [12:52:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T413525)', diff saved to https://phabricator.wikimedia.org/P86949 and previous config saved to /var/cache/conftool/dbconfig/20260109-125250-marostegui.json [12:52:54] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [12:53:19] (03Abandoned) 10Blake: service: add excluded_services helper function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1224041 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [12:56:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T413525)', diff saved to https://phabricator.wikimedia.org/P86950 and previous config saved to /var/cache/conftool/dbconfig/20260109-125611-marostegui.json [12:59:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:01:53] (03PS2) 10Slyngshede: P:cache::haproxy: check existance of mmdb files [puppet] - 10https://gerrit.wikimedia.org/r/1224897 (https://phabricator.wikimedia.org/T414111) [13:02:13] 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11507333 (10JMeybohm) [13:02:48] !log Deploy schema change on s1 primary master T414183 [13:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:51] T414183: Remove default value from gbw_by in global_block_whitelist table on WMF wikis - https://phabricator.wikimedia.org/T414183 [13:02:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P86951 and previous config saved to /var/cache/conftool/dbconfig/20260109-130258-marostegui.json [13:03:23] !log Deploy schema change on s8 primary master T414183 [13:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:07] !log Deploy schema change on s4 primary master T414183 [13:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:05:12] (03CR) 10Tchanders: "This can be deployed any time, from my perspective. Do we need to wait for a puppet window?" [puppet] - 10https://gerrit.wikimedia.org/r/1219619 (https://phabricator.wikimedia.org/T413101) (owner: 10Tchanders) [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P86952 and previous config saved to /var/cache/conftool/dbconfig/20260109-130619-marostegui.json [13:10:45] !log Deploy schema change on s3 primary master (this will take a few hours) T414183 [13:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:48] T414183: Remove default value from gbw_by in global_block_whitelist table on WMF wikis - https://phabricator.wikimedia.org/T414183 [13:13:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P86953 and previous config saved to /var/cache/conftool/dbconfig/20260109-131306-marostegui.json [13:16:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P86954 and previous config saved to /var/cache/conftool/dbconfig/20260109-131628-marostegui.json [13:17:04] (03CR) 10Gkyziridis: [C:03+2] revert-risk: Deploy on prod and staging new model version for both language-agnosting and multingual. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224604 (https://phabricator.wikimedia.org/T411786) (owner: 10Gkyziridis) [13:18:05] (03PS1) 10Federico Ceratto: sre.mysql.clone: More uniform logging syntax [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052) [13:18:22] (03CR) 10Federico Ceratto: sre.mysql.clone: More uniform logging syntax (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052) (owner: 10Federico Ceratto) [13:18:59] (03Merged) 10jenkins-bot: revert-risk: Deploy on prod and staging new model version for both language-agnosting and multingual. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224604 (https://phabricator.wikimedia.org/T411786) (owner: 10Gkyziridis) [13:19:53] 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11507408 (10JMeybohm) @trueg could you please specify what access level you're requesting/what you need access to (see https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#What_access_... [13:20:05] (03CR) 10Marostegui: sre.mysql.clone: More uniform logging syntax (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052) (owner: 10Federico Ceratto) [13:21:19] (03PS2) 10Federico Ceratto: sre.mysql.clone: More uniform logging syntax [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052) [13:22:33] (03CR) 10Marostegui: [C:03+1] sre.mysql.clone: More uniform logging syntax [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052) (owner: 10Federico Ceratto) [13:23:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T413525)', diff saved to https://phabricator.wikimedia.org/P86955 and previous config saved to /var/cache/conftool/dbconfig/20260109-132316-marostegui.json [13:23:20] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [13:23:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [13:23:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1162 (T413525)', diff saved to https://phabricator.wikimedia.org/P86956 and previous config saved to /var/cache/conftool/dbconfig/20260109-132340-marostegui.json [13:24:10] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:26:03] (03CR) 10Jelto: [C:03+1] "lgtm as soon as CI is happy" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [13:26:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T413525)', diff saved to https://phabricator.wikimedia.org/P86957 and previous config saved to /var/cache/conftool/dbconfig/20260109-132636-marostegui.json [13:26:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance [13:26:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2175 (T413525)', diff saved to https://phabricator.wikimedia.org/P86958 and previous config saved to /var/cache/conftool/dbconfig/20260109-132651-marostegui.json [13:32:06] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507457 (10Xqt) >>! In T414173#11507270, @Joe wrote: > This user agent is not compliat with our user-agent policy: > > https://foundation.wikimedia.org/wiki/Poli... [13:32:25] (03PS1) 10Blake: sre.discovery.datacenter: use service registry for exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/1224945 (https://phabricator.wikimedia.org/T412211) [13:34:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:35:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T413525)', diff saved to https://phabricator.wikimedia.org/P86959 and previous config saved to /var/cache/conftool/dbconfig/20260109-133514-marostegui.json [13:35:18] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [13:39:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:42:41] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507482 (10Fabfur) >>! In T414173#11507457, @Xqt wrote: >>>! In T414173#11507270, @Joe wrote: >> This user agent is not compliat with our user-agent policy: >> >... [13:42:46] (03PS2) 10Tchanders: Don't collect CheckUser-specific temp account patrolling metrics on labs [puppet] - 10https://gerrit.wikimedia.org/r/1219619 (https://phabricator.wikimedia.org/T413101) [13:44:06] (03CR) 10Jcrespo: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1219619 (https://phabricator.wikimedia.org/T413101) (owner: 10Tchanders) [13:45:05] 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 05Goal, 06Release-Engineering-Team (Seen): Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759#11507488 (10WMDE-leszek) Bumping this ticket as I maybe might have an interest in seeing this progressing. I... [13:45:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P86960 and previous config saved to /var/cache/conftool/dbconfig/20260109-134522-marostegui.json [13:48:06] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507521 (10Xqt) [13:52:13] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.clone: More uniform logging syntax [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052) (owner: 10Federico Ceratto) [13:53:56] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.clone: More uniform logging syntax (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052) (owner: 10Federico Ceratto) [13:53:58] (03CR) 10Federico Ceratto: [V:03+2 C:03+2] sre.mysql.clone: More uniform logging syntax [cookbooks] - 10https://gerrit.wikimedia.org/r/1224940 (https://phabricator.wikimedia.org/T414052) (owner: 10Federico Ceratto) [13:55:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P86961 and previous config saved to /var/cache/conftool/dbconfig/20260109-135531-marostegui.json [13:55:53] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507526 (10Xqt) @Fabfur: Currently we have this UA: `version (wikipedia:de; User:Xqtest) Pywikibot/11.0.0.dev10 (g20136) requests/2.32.5 Python/3.13.0.final.0` W... [13:57:18] (03PS1) 10Dzahn: trafficserver: disable wikipedia25 [puppet] - 10https://gerrit.wikimedia.org/r/1224957 (https://phabricator.wikimedia.org/T408592) [14:00:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T413525)', diff saved to https://phabricator.wikimedia.org/P86962 and previous config saved to /var/cache/conftool/dbconfig/20260109-140016-marostegui.json [14:00:20] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [14:05:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T413525)', diff saved to https://phabricator.wikimedia.org/P86963 and previous config saved to /var/cache/conftool/dbconfig/20260109-140539-marostegui.json [14:05:44] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [14:05:54] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1224957 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [14:05:56] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [14:06:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1182 (T413525)', diff saved to https://phabricator.wikimedia.org/P86964 and previous config saved to /var/cache/conftool/dbconfig/20260109-140604-marostegui.json [14:08:05] (03CR) 10Dzahn: "we are using this instead: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1224957" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [14:08:15] (03CR) 10Dzahn: [C:03+2] trafficserver: disable wikipedia25 [puppet] - 10https://gerrit.wikimedia.org/r/1224957 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [14:08:57] (03PS1) 10Dzahn: Revert "trafficserver: disable wikipedia25" [puppet] - 10https://gerrit.wikimedia.org/r/1224959 [14:09:10] FIRING: SLOMetricAbsent: wdqs-main-update-lag eqiad - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:09:28] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs1014:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:10:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P86965 and previous config saved to /var/cache/conftool/dbconfig/20260109-141024-marostegui.json [14:10:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86966 and previous config saved to /var/cache/conftool/dbconfig/20260109-141052-marostegui.json [14:10:57] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [14:10:57] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [14:11:10] (03Abandoned) 10Dzahn: point wikipedia25.org to ncredir [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [14:11:16] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507562 (10Fabfur) >>! In T414173#11507526, @Xqt wrote: > @Fabfur: Currently we have this UA: > `version (wikipedia:de; User:Xqtest) Pywikibot/11.0.0.dev10 (g201... [14:16:15] (03PS2) 10C. Scott Ananian: Increase PRV percentage on fawiki/kowiki/azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224719 (https://phabricator.wikimedia.org/T413108) [14:17:54] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507588 (10JAnD) >>! In T414173#11507562, @Fabfur wrote: > I think this is better, if you want to add an email as contact you can do it right after the URL, separ... [14:19:10] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag eqiad - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:20:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P86967 and previous config saved to /var/cache/conftool/dbconfig/20260109-142033-marostegui.json [14:21:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P86968 and previous config saved to /var/cache/conftool/dbconfig/20260109-142100-marostegui.json [14:21:41] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507597 (10Fabfur) >>! In T414173#11507588, @JAnD wrote: >>>! In T414173#11507562, @Fabfur wrote: >> I think this is better, if you want to add an email as contac... [14:24:10] FIRING: [24x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:14] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:28:47] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1003.eqiad.wmnet with OS trixie [14:29:10] FIRING: [22x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:58] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup1003'] [14:30:06] FIRING: [22x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T413525)', diff saved to https://phabricator.wikimedia.org/P86970 and previous config saved to /var/cache/conftool/dbconfig/20260109-143040-marostegui.json [14:30:44] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [14:30:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2189.codfw.wmnet with reason: Maintenance [14:31:03] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudbackup1003'] [14:31:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2189 (T413525)', diff saved to https://phabricator.wikimedia.org/P86971 and previous config saved to /var/cache/conftool/dbconfig/20260109-143105-marostegui.json [14:33:28] (03PS6) 10Federico Ceratto: mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) [14:33:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission pay-lvs1003.frack.eqiad.wmnet and pay-lvs1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T413986#11507633 (10Jclark-ctr) [14:34:10] FIRING: [16x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:34:52] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudbackup1003'] [14:34:58] (03CR) 10CI reject: [V:04-1] mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto) [14:35:06] FIRING: [14x] ProbeDown: Service wdqs1017:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:08] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507634 (10Xqt) @Joe, @Tgr: Could you please consider postponing the newly introduced restriction until the Pywikibot User-Agent has been updated? As far as I kno... [14:35:21] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudbackup1003'] [14:35:50] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS trixie [14:39:10] FIRING: [10x] ProbeDown: Service wdqs1017:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:39:30] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11507647 (10elukey) Retried the same op that led to the HTTP 500 after lunch: ` elukey@build2001:~$ sudo docker push registry1004.eqiad.wmnet:5002/test/cert-manage... [14:39:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T413525)', diff saved to https://phabricator.wikimedia.org/P86972 and previous config saved to /var/cache/conftool/dbconfig/20260109-143940-marostegui.json [14:39:44] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [14:41:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86973 and previous config saved to /var/cache/conftool/dbconfig/20260109-144116-marostegui.json [14:41:21] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [14:41:22] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [14:41:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [14:41:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on 6 hosts with reason: Maintenance [14:41:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1221 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86974 and previous config saved to /var/cache/conftool/dbconfig/20260109-144151-marostegui.json [14:43:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission pay-lvs1003.frack.eqiad.wmnet and pay-lvs1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T413986#11507665 (10Jclark-ctr) 05Open→03Resolved [14:44:57] (03CR) 10Pmiazga: [C:03+1] "LGTM, couple nitpicks, temporarily not being able to test it locally." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [14:45:53] !log gkyziridis@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:46:09] !log gkyziridis@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:47:16] (03PS7) 10Federico Ceratto: mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) [14:48:55] (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1224891 (owner: 10Muehlenhoff) [14:49:37] (03PS1) 10Btullis: Revert "Failover the hive server2 and metastore services to the standby" [dns] - 10https://gerrit.wikimedia.org/r/1224967 [14:49:44] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1003.eqiad.wmnet with reason: host reimage [14:49:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P86975 and previous config saved to /var/cache/conftool/dbconfig/20260109-144948-marostegui.json [14:50:19] (03CR) 10Marostegui: "let me know when pushed, so we can test it" [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto) [14:50:35] (03CR) 10CI reject: [V:04-1] Revert "Failover the hive server2 and metastore services to the standby" [dns] - 10https://gerrit.wikimedia.org/r/1224967 (owner: 10Btullis) [14:53:01] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T413559#11507693 (10Jclark-ctr) 05Open→03Resolved [14:54:57] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1003.eqiad.wmnet with reason: host reimage [14:55:50] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:56:00] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:56:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11507696 (10Jgreen) * NIC.Embedded.1-1-1 pxe disabled * NIC.Integrated.1-1-1 pxe enabled * boot method changed from UEFI to BIOS [14:56:40] (03PS1) 10Gkyziridis: revert-risk: Roll back mulkilingual model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224969 (https://phabricator.wikimedia.org/T411786) [14:58:56] (03CR) 10Btullis: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1224967 (owner: 10Btullis) [14:59:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T413525)', diff saved to https://phabricator.wikimedia.org/P86976 and previous config saved to /var/cache/conftool/dbconfig/20260109-145910-marostegui.json [14:59:14] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [14:59:38] (03CR) 10Gkyziridis: [C:03+2] revert-risk: Roll back mulkilingual model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224969 (https://phabricator.wikimedia.org/T411786) (owner: 10Gkyziridis) [14:59:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P86977 and previous config saved to /var/cache/conftool/dbconfig/20260109-145956-marostegui.json [15:00:06] FIRING: [4x] ProbeDown: Service wdqs1017:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:01:38] (03Merged) 10jenkins-bot: revert-risk: Roll back mulkilingual model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224969 (https://phabricator.wikimedia.org/T411786) (owner: 10Gkyziridis) [15:02:02] FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:02:44] !log gkyziridis@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:02:50] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:03:00] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:03:08] !log gkyziridis@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:05:49] (03CR) 10Federico Ceratto: [C:03+2] mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto) [15:06:18] (03CR) 10Federico Ceratto: [V:03+2 C:03+2] mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto) [15:07:24] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507718 (10revi) >>! In T414173#11507526, @Xqt wrote: > @Fabfur: Currently we have this UA: > `version (wikipedia:de; User:Xqtest) Pywikibot/11.0.0.dev10 (g20136... [15:09:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P86978 and previous config saved to /var/cache/conftool/dbconfig/20260109-150918-marostegui.json [15:10:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T413525)', diff saved to https://phabricator.wikimedia.org/P86979 and previous config saved to /var/cache/conftool/dbconfig/20260109-151005-marostegui.json [15:10:09] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [15:10:21] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [15:10:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1188 (T413525)', diff saved to https://phabricator.wikimedia.org/P86980 and previous config saved to /var/cache/conftool/dbconfig/20260109-151029-marostegui.json [15:10:50] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:11:00] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:11:40] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1168 - https://phabricator.wikimedia.org/T413704#11507727 (10BTullis) 05Open→03Resolved I checked the physical disks. ` btullis@an-worker1168:~$ sudo perccli64 /c0 show all PD LIST : ======= --------------------------------------------... [15:14:15] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205 (10MoritzMuehlenhoff) 03NEW [15:14:18] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507752 (10Joe) >>! In T414173#11507634, @Xqt wrote: > @Joe, @Tgr: Could you please consider postponing the newly introduced restriction until the Pywikibot User-... [15:14:20] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11507753 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:17:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:17:09] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1003.eqiad.wmnet with OS trixie [15:18:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413336#11507764 (10BTullis) 05Open→03Resolved I checked the physical disks with: ` sudo perccli64 /c0 show all PD LIST : ======= -----------------... [15:18:58] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11507766 (10Papaul) ` Thanks for confirming. We recommend blocking those packets, if possible, using dst mac 01:80:c2:00:00:00 on the mgmt switch, whi... [15:19:06] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1198 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [15:19:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P86981 and previous config saved to /var/cache/conftool/dbconfig/20260109-151927-marostegui.json [15:19:36] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11507767 (10Joe) To clarify: - Users on toolsforge or cloud VPS are exempt from the limit - I only see about 5% of all requests with UA containing `User:XXX`... [15:21:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T413525)', diff saved to https://phabricator.wikimedia.org/P86982 and previous config saved to /var/cache/conftool/dbconfig/20260109-152120-marostegui.json [15:21:24] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [15:21:55] (03PS1) 10Andrew Bogott: wmcs cinder backups: move all backups to 2003 so 2004 can be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/1224974 (https://phabricator.wikimedia.org/T375217) [15:21:57] (03PS1) 10Andrew Bogott: cloudbackup: flip all backups from cloudbackup1004 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/1224975 (https://phabricator.wikimedia.org/T375217) [15:22:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:23:28] (03CR) 10Andrew Bogott: [C:03+2] wmcs cinder backups: move all backups to 2003 so 2004 can be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/1224974 (https://phabricator.wikimedia.org/T375217) (owner: 10Andrew Bogott) [15:25:08] (03CR) 10Andrew Bogott: [C:03+2] cloudbackup: flip all backups from cloudbackup1004 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/1224975 (https://phabricator.wikimedia.org/T375217) (owner: 10Andrew Bogott) [15:27:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:29:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T413525)', diff saved to https://phabricator.wikimedia.org/P86983 and previous config saved to /var/cache/conftool/dbconfig/20260109-152935-marostegui.json [15:29:39] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [15:29:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance [15:31:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P86984 and previous config saved to /var/cache/conftool/dbconfig/20260109-153128-marostegui.json [15:31:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1191 - https://phabricator.wikimedia.org/T411209#11507818 (10BTullis) 05Open→03Resolved Checked the current state of the disks. ` btullis@an-worker1191:~$ sudo perccli64 /c0 show all PD L... [15:32:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:34:04] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1191 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [15:34:10] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:22] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11507843 (10elukey) Tried another test, this time on build2002 (bookworm, with a more up-to-date version of dockerd). I tried to push the calico typha's image (less... [15:37:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:41:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P86985 and previous config saved to /var/cache/conftool/dbconfig/20260109-154136-marostegui.json [15:42:02] RESOLVED: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:42:12] (03PS1) 10Fabfur: cache:haproxy: add new contact type [puppet] - 10https://gerrit.wikimedia.org/r/1224977 (https://phabricator.wikimedia.org/T414173) [15:42:51] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11507887 (10BTullis) 05Open→03Resolved Created the new VD. ` btullis@an-worker1148:~$ sudo megacli -CfgLdAdd -r0 [32:1] -a0 Adap... [15:43:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86986 and previous config saved to /var/cache/conftool/dbconfig/20260109-154301-marostegui.json [15:43:06] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [15:43:06] 06SRE, 10SRE-Access-Requests: Requesting access to DataPlatform for trueg - https://phabricator.wikimedia.org/T414192#11507890 (10trueg) @gmodena , could you please help here. I assume that I need full access but I really do not know. [15:43:07] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [15:45:37] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11507899 (10Dzahn) Due to an unrelated temp issue with the DNS repo we changed the plan slightly and disabled the site at trafficserver level... [15:48:00] (03PS1) 10Giuseppe Lavagetto: cache::text: add eqiad and codfw WMCS public addresses to extra_trust [puppet] - 10https://gerrit.wikimedia.org/r/1224978 (https://phabricator.wikimedia.org/T406545) [15:49:37] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1224978 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [15:50:44] (03CR) 10CDanis: [C:03+1] cache::text: add eqiad and codfw WMCS public addresses to extra_trust [puppet] - 10https://gerrit.wikimedia.org/r/1224978 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [15:51:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T413525)', diff saved to https://phabricator.wikimedia.org/P86987 and previous config saved to /var/cache/conftool/dbconfig/20260109-155143-marostegui.json [15:51:48] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [15:52:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [15:52:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1197 (T413525)', diff saved to https://phabricator.wikimedia.org/P86988 and previous config saved to /var/cache/conftool/dbconfig/20260109-155207-marostegui.json [15:53:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P86989 and previous config saved to /var/cache/conftool/dbconfig/20260109-155309-marostegui.json [15:54:13] (03CR) 10Scott French: [C:03+1] cache::text: add eqiad and codfw WMCS public addresses to extra_trust [puppet] - 10https://gerrit.wikimedia.org/r/1224978 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [15:55:42] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:57:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2207.codfw.wmnet with reason: Maintenance [15:57:39] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] cache::text: add eqiad and codfw WMCS public addresses to extra_trust [puppet] - 10https://gerrit.wikimedia.org/r/1224978 (https://phabricator.wikimedia.org/T406545) (owner: 10Giuseppe Lavagetto) [15:57:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2207 (T413525)', diff saved to https://phabricator.wikimedia.org/P86990 and previous config saved to /var/cache/conftool/dbconfig/20260109-155743-marostegui.json [15:57:50] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [15:58:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:00:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11507990 (10Jgreen) 05Open→03Resolved Done! [16:01:49] (03Restored) 10Dzahn: point wikipedia25.org to ncredir [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:01:57] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:02:20] (03CR) 10Dzahn: "testing CI after netbox was synced" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:02:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T413525)', diff saved to https://phabricator.wikimedia.org/P86991 and previous config saved to /var/cache/conftool/dbconfig/20260109-160255-marostegui.json [16:02:59] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [16:03:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P86992 and previous config saved to /var/cache/conftool/dbconfig/20260109-160318-marostegui.json [16:04:30] (03CR) 10Thcipriani: [C:03+1] "Should be safe to merge now. The last two train branches (1.46.0-wmf.7 and 1.46.0-wmf.10) both contain TestKitchen." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216847 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [16:07:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216847 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [16:09:02] PROBLEM - MegaRAID on an-worker1148 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:13:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P86993 and previous config saved to /var/cache/conftool/dbconfig/20260109-161304-marostegui.json [16:13:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86994 and previous config saved to /var/cache/conftool/dbconfig/20260109-161326-marostegui.json [16:13:31] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [16:13:32] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [16:13:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance [16:15:47] (03CR) 10Dzahn: [C:04-2] "do not merge - just here for testing CI" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:18:32] (03CR) 10Dillon: "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224814 (https://phabricator.wikimedia.org/T404200) (owner: 10Kgraessle) [16:18:38] (03CR) 10Dillon: [C:03+1] When filtering for edits with high Revert Risk, Recent Changes shouldn't display edits from non-main namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1224814 (https://phabricator.wikimedia.org/T404200) (owner: 10Kgraessle) [16:23:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P86995 and previous config saved to /var/cache/conftool/dbconfig/20260109-162312-marostegui.json [16:25:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T413525)', diff saved to https://phabricator.wikimedia.org/P86996 and previous config saved to /var/cache/conftool/dbconfig/20260109-162529-marostegui.json [16:25:34] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [16:27:35] (03PS2) 10DCausse: airflow-search: add enterprise extra_secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224894 (https://phabricator.wikimedia.org/T414066) [16:33:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T413525)', diff saved to https://phabricator.wikimedia.org/P86997 and previous config saved to /var/cache/conftool/dbconfig/20260109-163320-marostegui.json [16:33:25] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [16:33:33] (03PS2) 10Fabfur: cache:haproxy: add new contact type [puppet] - 10https://gerrit.wikimedia.org/r/1224977 (https://phabricator.wikimedia.org/T414173) [16:33:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [16:34:35] (03PS1) 10Btullis: Configure the kyuubi-defaults.conf file with kerberos details [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224984 (https://phabricator.wikimedia.org/T413977) [16:34:44] (03PS2) 10Btullis: Configure the kyuubi-defaults.conf file with kerberos details [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224984 (https://phabricator.wikimedia.org/T413977) [16:34:47] (03CR) 10CI reject: [V:04-1] Configure the kyuubi-defaults.conf file with kerberos details [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224984 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [16:35:02] (03CR) 10Btullis: [C:03+2] Fix malformed networkpolicy for spark-support and kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224694 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [16:35:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P86998 and previous config saved to /var/cache/conftool/dbconfig/20260109-163538-marostegui.json [16:36:49] (03Merged) 10jenkins-bot: Fix malformed networkpolicy for spark-support and kyuubi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224694 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [16:39:32] (03PS3) 10Btullis: Configure the kyuubi-defaults.conf file with kerberos details [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224984 (https://phabricator.wikimedia.org/T413977) [16:40:24] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install dse-k8s-worker10[20-22] - https://phabricator.wikimedia.org/T414216 (10RobH) 03NEW [16:40:53] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install dse-k8s-worker10[20-22] - https://phabricator.wikimedia.org/T414216#11508121 (10RobH) a:03BTullis Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add... [16:41:10] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install dse-k8s-worker10[20-22] - https://phabricator.wikimedia.org/T414216#11508129 (10RobH) [16:42:41] (03CR) 10Btullis: [C:03+2] Configure the kyuubi-defaults.conf file with kerberos details [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224984 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [16:44:23] (03Merged) 10jenkins-bot: Configure the kyuubi-defaults.conf file with kerberos details [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224984 (https://phabricator.wikimedia.org/T413977) (owner: 10Btullis) [16:45:16] (03PS3) 10Fabfur: cache:haproxy: add new contact type [puppet] - 10https://gerrit.wikimedia.org/r/1224977 (https://phabricator.wikimedia.org/T414173) [16:45:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P86999 and previous config saved to /var/cache/conftool/dbconfig/20260109-164546-marostegui.json [16:49:07] (03CR) 10Scott French: [C:03+1] "Thanks, Fabrizio!" [puppet] - 10https://gerrit.wikimedia.org/r/1224977 (https://phabricator.wikimedia.org/T414173) (owner: 10Fabfur) [16:49:46] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:52:47] 06SRE, 10SRE-Access-Requests: Requesting access to Grafana and Logstash for trueg - https://phabricator.wikimedia.org/T414187#11508155 (10trueg) I am sorry, I do not know what this means: "Grafana access is granted by having an LDAP account." Is the LDAP account not my dev account? [16:54:50] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add missing DNS for IPV6 - pt1979@cumin2002" [16:54:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add missing DNS for IPV6 - pt1979@cumin2002" [16:54:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:55:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T413525)', diff saved to https://phabricator.wikimedia.org/P87000 and previous config saved to /var/cache/conftool/dbconfig/20260109-165554-marostegui.json [16:55:58] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [16:56:12] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2225.codfw.wmnet with reason: Maintenance [16:56:15] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [16:56:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2225 (T413525)', diff saved to https://phabricator.wikimedia.org/P87001 and previous config saved to /var/cache/conftool/dbconfig/20260109-165619-marostegui.json [16:56:25] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [16:57:51] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:00:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1229.eqiad.wmnet with reason: Maintenance [17:00:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1229 (T413525)', diff saved to https://phabricator.wikimedia.org/P87002 and previous config saved to /var/cache/conftool/dbconfig/20260109-170033-marostegui.json [17:02:47] (03CR) 10Fabfur: [C:03+2] cache:haproxy: add new contact type [puppet] - 10https://gerrit.wikimedia.org/r/1224977 (https://phabricator.wikimedia.org/T414173) (owner: 10Fabfur) [17:04:24] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [17:04:32] pt1979@cumin2002 netbox (PID 3302416) is awaiting input [17:07:06] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add missing DNS for IPV6 - pt1979@cumin2002" [17:07:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add missing DNS for IPV6 - pt1979@cumin2002" [17:07:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:10:21] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:10:55] 06SRE, 06Traffic, 13Patch-For-Review: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11508207 (10Fabfur) We're now allowing this new type of contact information in User-Agent string, this change should be propagated shortly. P... [17:13:47] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add missing DNS for IPV6 - pt1979@cumin2002" [17:13:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add missing DNS for IPV6 - pt1979@cumin2002" [17:13:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:16:29] (03CR) 10Dzahn: [C:04-2] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [17:17:07] (03CR) 10Papaul: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [17:19:55] (03Abandoned) 10Dzahn: point wikipedia25.org to ncredir [dns] - 10https://gerrit.wikimedia.org/r/1224917 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [17:23:00] !log jhathaway@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [17:24:10] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:25:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T413525)', diff saved to https://phabricator.wikimedia.org/P87003 and previous config saved to /var/cache/conftool/dbconfig/20260109-172519-marostegui.json [17:25:23] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [17:29:37] (03PS1) 10Jasmine: helmfile.d: add sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224998 [17:29:45] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [17:30:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T413525)', diff saved to https://phabricator.wikimedia.org/P87004 and previous config saved to /var/cache/conftool/dbconfig/20260109-173023-marostegui.json [17:30:27] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [17:31:58] (03CR) 10CI reject: [V:04-1] helmfile.d: add sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224998 (owner: 10Jasmine) [17:32:40] (03PS2) 10Jasmine: helmfile.d: add sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224998 [17:33:23] (03PS1) 10Bking: opensearch-ipoid: move service to "production" status. [puppet] - 10https://gerrit.wikimedia.org/r/1224999 (https://phabricator.wikimedia.org/T412447) [17:35:06] (03CR) 10Scott French: [C:03+1] "Thanks, Blake!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1224945 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [17:35:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P87005 and previous config saved to /var/cache/conftool/dbconfig/20260109-173527-marostegui.json [17:38:06] (03CR) 10Blake: [C:03+2] sre.discovery.datacenter: use service registry for exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/1224945 (https://phabricator.wikimedia.org/T412211) (owner: 10Blake) [17:38:51] (03PS2) 10Bking: opensearch-ipoid: move service to "production" status. [puppet] - 10https://gerrit.wikimedia.org/r/1224999 (https://phabricator.wikimedia.org/T412447) [17:40:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P87006 and previous config saved to /var/cache/conftool/dbconfig/20260109-174031-marostegui.json [17:41:58] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11508332 (10elukey) The sequence of events before the blob unknown seems to be the following on the docker registry: 1) "PUT /v2/test/calico/node/blobs/uploads/...... [17:45:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P87007 and previous config saved to /var/cache/conftool/dbconfig/20260109-174536-marostegui.json [17:46:59] 06SRE: New SRE manager - Get emails sent to noc - https://phabricator.wikimedia.org/T414223 (10MLechvien-WMF) 03NEW [17:48:14] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1029.eqiad.wmnet with OS trixie [17:48:46] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11508364 (10elukey) Looks like it worked! ` elukey@build2002:~$ sudo docker push registry1004.eqiad.wmnet:5002/test/restricted/mediawiki-webserver:2025-03-04-10595... [17:50:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P87008 and previous config saved to /var/cache/conftool/dbconfig/20260109-175039-marostegui.json [17:55:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T413525)', diff saved to https://phabricator.wikimedia.org/P87009 and previous config saved to /var/cache/conftool/dbconfig/20260109-175544-marostegui.json [17:55:48] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [17:56:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2226.codfw.wmnet with reason: Maintenance [17:56:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2226 (T413525)', diff saved to https://phabricator.wikimedia.org/P87010 and previous config saved to /var/cache/conftool/dbconfig/20260109-175609-marostegui.json [17:58:44] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:59:02] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [18:00:06] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:00:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T413525)', diff saved to https://phabricator.wikimedia.org/P87011 and previous config saved to /var/cache/conftool/dbconfig/20260109-180047-marostegui.json [18:00:51] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde for martyn.ranyard - https://phabricator.wikimedia.org/T413994#11508410 (10KFrancis) Hi all, the NDA is out for signatures. I'll confirm when it's complete. [18:00:51] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [18:01:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1233.eqiad.wmnet with reason: Maintenance [18:01:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1233 (T413525)', diff saved to https://phabricator.wikimedia.org/P87012 and previous config saved to /var/cache/conftool/dbconfig/20260109-180112-marostegui.json [18:04:10] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:04:41] (03PS5) 10Clare Ming: Deploy TestKitchen to Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) [18:05:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [18:09:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T413525)', diff saved to https://phabricator.wikimedia.org/P87013 and previous config saved to /var/cache/conftool/dbconfig/20260109-180900-marostegui.json [18:09:04] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [18:09:13] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157#11508477 (10KFrancis) Hi all, the NDA is out for signatures. I'll confirm when it's complete. [18:12:22] (03PS1) 10Clare Ming: Deploy TestKitchen to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) [18:13:05] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [18:19:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P87014 and previous config saved to /var/cache/conftool/dbconfig/20260109-181908-marostegui.json [18:24:10] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:24:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [18:29:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P87015 and previous config saved to /var/cache/conftool/dbconfig/20260109-182917-marostegui.json [18:31:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T413525)', diff saved to https://phabricator.wikimedia.org/P87016 and previous config saved to /var/cache/conftool/dbconfig/20260109-183103-marostegui.json [18:31:07] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [18:31:58] !log jhathaway@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [18:35:50] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11508591 (10cmooney) If we configured the mgmt switches we'd just have spanning-tree portfast or something on access interfaces, so they wouldn't be s... [18:39:08] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [18:39:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T413525)', diff saved to https://phabricator.wikimedia.org/P87017 and previous config saved to /var/cache/conftool/dbconfig/20260109-183926-marostegui.json [18:39:30] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [18:39:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2238.codfw.wmnet with reason: Maintenance [18:39:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2238 (T413525)', diff saved to https://phabricator.wikimedia.org/P87018 and previous config saved to /var/cache/conftool/dbconfig/20260109-183939-marostegui.json [18:41:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P87019 and previous config saved to /var/cache/conftool/dbconfig/20260109-184111-marostegui.json [18:41:18] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:42:44] (03PS1) 10Cathal Mooney: Add include statement for netbox snippet for 2620:0:860:137::/64 [dns] - 10https://gerrit.wikimedia.org/r/1225007 (https://phabricator.wikimedia.org/T410717) [18:43:27] (03CR) 10CI reject: [V:04-1] Add include statement for netbox snippet for 2620:0:860:137::/64 [dns] - 10https://gerrit.wikimedia.org/r/1225007 (https://phabricator.wikimedia.org/T410717) (owner: 10Cathal Mooney) [18:44:20] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add DNS for codfw lsw1-a4 to mr1-codfw IPv6 IPs - cmooney@cumin1003" [18:44:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add DNS for codfw lsw1-a4 to mr1-codfw IPv6 IPs - cmooney@cumin1003" [18:44:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:44:48] (03PS2) 10Cathal Mooney: Add include statement for netbox snippet for 2620:0:860:137::/64 [dns] - 10https://gerrit.wikimedia.org/r/1225007 (https://phabricator.wikimedia.org/T410717) [18:45:50] (03CR) 10Cathal Mooney: [C:03+2] Add include statement for netbox snippet for 2620:0:860:137::/64 [dns] - 10https://gerrit.wikimedia.org/r/1225007 (https://phabricator.wikimedia.org/T410717) (owner: 10Cathal Mooney) [18:46:02] !log cmooney@dns2005 START - running authdns-update [18:46:51] !log cmooney@dns2005 END - running authdns-update [18:47:06] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:49:51] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:51:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P87020 and previous config saved to /var/cache/conftool/dbconfig/20260109-185120-marostegui.json [18:57:18] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1029.eqiad.wmnet with OS trixie [19:01:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T413525)', diff saved to https://phabricator.wikimedia.org/P87021 and previous config saved to /var/cache/conftool/dbconfig/20260109-190128-marostegui.json [19:01:32] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:01:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [19:04:10] FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:09:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T413525)', diff saved to https://phabricator.wikimedia.org/P87022 and previous config saved to /var/cache/conftool/dbconfig/20260109-190936-marostegui.json [19:09:40] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:13:46] (03CR) 10Santiago Faci: [C:03+1] "That change as already merged so we can resolve this comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216847 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [19:14:36] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11508669 (10Papaul) I think the quick fix here is for us to go with your option (2) exclude any interface called "mgmt0" for the time being and when... [19:17:04] (03PS2) 10Santiago Faci: Deploy TestKitchen to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [19:19:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P87023 and previous config saved to /var/cache/conftool/dbconfig/20260109-191944-marostegui.json [19:25:36] (03CR) 10Santiago Faci: [C:03+1] Deploy TestKitchen to Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217360 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [19:25:50] (03CR) 10Santiago Faci: [C:03+1] Deploy TestKitchen to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225005 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [19:28:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1254.eqiad.wmnet with reason: Maintenance [19:28:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1254 (T413525)', diff saved to https://phabricator.wikimedia.org/P87024 and previous config saved to /var/cache/conftool/dbconfig/20260109-192844-marostegui.json [19:28:48] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:29:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P87025 and previous config saved to /var/cache/conftool/dbconfig/20260109-192953-marostegui.json [19:36:03] !log jhathaway@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [19:40:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T413525)', diff saved to https://phabricator.wikimedia.org/P87026 and previous config saved to /var/cache/conftool/dbconfig/20260109-194001-marostegui.json [19:40:05] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:43:10] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [19:54:53] !log jhathaway@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [19:57:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T413525)', diff saved to https://phabricator.wikimedia.org/P87027 and previous config saved to /var/cache/conftool/dbconfig/20260109-195731-marostegui.json [19:57:35] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:59:21] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1008-dev.eqiad.wmnet with reason: host reimage [20:00:08] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [20:02:47] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1008-dev.eqiad.wmnet with reason: host reimage [20:07:04] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11508763 (10SerDIDG) Maybe it's somehow related. I use [[https://github.com/siddharthvp/mwn|siddharthvp/mwn]] for deploying my gadget. But a couple of days ago my... [20:07:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P87028 and previous config saved to /var/cache/conftool/dbconfig/20260109-200739-marostegui.json [20:17:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P87029 and previous config saved to /var/cache/conftool/dbconfig/20260109-201748-marostegui.json [20:19:06] !log jhathaway@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1029.eqiad.wmnet with OS trixie [20:19:54] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11508769 (10taavi) >>! In T414173#11508763, @SerDIDG wrote: > Maybe it's somehow related. I use [[https://github.com/siddharthvp/mwn|siddharthvp/mwn]] for deployin... [20:21:04] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [20:21:19] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:23:41] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on wdqs1029.eqiad.wmnet with reason: T412451 [20:23:44] T412451: 4 failed reimages on wdqs1029, 1030, 1031, 1032 - https://phabricator.wikimedia.org/T412451 [20:26:09] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:27:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:27:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T413525)', diff saved to https://phabricator.wikimedia.org/P87030 and previous config saved to /var/cache/conftool/dbconfig/20260109-202756-marostegui.json [20:28:00] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [20:28:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1259.eqiad.wmnet with reason: Maintenance [20:28:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1259 (T413525)', diff saved to https://phabricator.wikimedia.org/P87031 and previous config saved to /var/cache/conftool/dbconfig/20260109-202821-marostegui.json [20:29:19] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:32:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:33:09] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:33:31] (03PS1) 10JHathaway: debian installer: format EFI partions [puppet] - 10https://gerrit.wikimedia.org/r/1225021 (https://phabricator.wikimedia.org/T412451) [20:33:46] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1225021 (https://phabricator.wikimedia.org/T412451) (owner: 10JHathaway) [20:37:17] (03PS2) 10JHathaway: debian installer: format EFI partions [puppet] - 10https://gerrit.wikimedia.org/r/1225021 (https://phabricator.wikimedia.org/T412451) [20:37:18] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1225021 (https://phabricator.wikimedia.org/T412451) (owner: 10JHathaway) [20:47:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:58:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T413525)', diff saved to https://phabricator.wikimedia.org/P87032 and previous config saved to /var/cache/conftool/dbconfig/20260109-205805-marostegui.json [20:58:09] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [20:58:54] (03CR) 10Bking: [C:03+2] "self-merging now so I'll have a couple of hours to make sure it doesn't set off alerts before we head out for the weekend." [puppet] - 10https://gerrit.wikimedia.org/r/1224999 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [20:59:19] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:02:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:02:56] (03PS1) 10BryanDavis: extension-list: add a bogus extension to test l10n-update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225023 (https://phabricator.wikimedia.org/T411516) [21:05:17] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:07:31] (03CR) 10BryanDavis: "Patch to merge as part of the 1.46-wmf.11 train process. See the commit message for an explanation of what is being tested and why." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225023 (https://phabricator.wikimedia.org/T411516) (owner: 10BryanDavis) [21:08:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P87033 and previous config saved to /var/cache/conftool/dbconfig/20260109-210813-marostegui.json [21:16:19] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:18:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P87034 and previous config saved to /var/cache/conftool/dbconfig/20260109-211822-marostegui.json [21:19:10] RESOLVED: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:24:10] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:27:15] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:27:15] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:28:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T413525)', diff saved to https://phabricator.wikimedia.org/P87035 and previous config saved to /var/cache/conftool/dbconfig/20260109-212830-marostegui.json [21:28:34] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [21:28:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:33:17] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:37:11] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55565 bytes in 6.740 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:37:11] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 6.884 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:53:19] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:59:58] (03PS1) 10Bking: [DO NOT MERGE] opensearch-ipoid: Add a path and timeout to blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1225029 (https://phabricator.wikimedia.org/T414037) [22:00:20] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1225029 (https://phabricator.wikimedia.org/T414037) (owner: 10Bking) [22:02:11] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:11:22] (03Abandoned) 10Bking: [DO NOT MERGE] opensearch-ipoid: Add a path and timeout to blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1225029 (https://phabricator.wikimedia.org/T414037) (owner: 10Bking) [22:11:47] (03CR) 10Ryan Kemper: "Checking grafana explore for `probe_ssl_earliest_cert_expiry{module=~'.*ipoid.*'}`, this had the intended effect" [puppet] - 10https://gerrit.wikimedia.org/r/1224999 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [22:12:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:24:10] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:27:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:29:19] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:33:13] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:51:19] (03PS1) 10Zabe: Close kywikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225034 (https://phabricator.wikimedia.org/T413845) [23:09:45] (03PS1) 10Gerrit maintenance bot: Add kai to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1225036 (https://phabricator.wikimedia.org/T414234) [23:31:19] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:32:17] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:33:26] (03PS1) 10Jasmine: charts: add sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225042 [23:35:01] (03CR) 10CI reject: [V:04-1] charts: add sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225042 (owner: 10Jasmine) [23:35:19] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:36:55] (03PS2) 10Jasmine: charts: add sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225042 [23:38:09] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:47:53] (03Abandoned) 10Jasmine: charts: add Sophroid deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217570 (owner: 10Jasmine) [23:51:03] (03PS3) 10Jasmine: helmfile.d: add sophroid helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224998