[00:01:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T407997)', diff saved to https://phabricator.wikimedia.org/P84810 and previous config saved to /var/cache/conftool/dbconfig/20251105-000151-marostegui.json [00:01:55] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [00:31:21] (03PS1) 10Samwilson: mediawiki tables-catalog: Add watchlist labels tables [puppet] - 10https://gerrit.wikimedia.org/r/1201842 (https://phabricator.wikimedia.org/T406843) [00:38:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1201843 [00:38:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1201843 (owner: 10TrainBranchBot) [00:45:35] (03PS1) 10Scott French: P:cache::haproxy: ensure x-requestctl is updated [puppet] - 10https://gerrit.wikimedia.org/r/1201844 (https://phabricator.wikimedia.org/T403220) [00:51:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1201843 (owner: 10TrainBranchBot) [00:56:08] (03PS1) 10BryanDavis: toolforge: Handle interwiki redirects in front proxy [puppet] - 10https://gerrit.wikimedia.org/r/1201847 (https://phabricator.wikimedia.org/T247432) [00:58:21] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:00:01] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1201319 (owner: 10TrainBranchBot) [01:06:27] (03CR) 10BryanDavis: "I tested a cut-and-paste version of this in toolsbeta which seemed to work as hoped." [puppet] - 10https://gerrit.wikimedia.org/r/1201847 (https://phabricator.wikimedia.org/T247432) (owner: 10BryanDavis) [01:08:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1201849 [01:08:43] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1201849 (owner: 10TrainBranchBot) [01:09:05] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:29:48] (03PS1) 10Tim Starling: admin: Add FIDO key for tstarling [puppet] - 10https://gerrit.wikimedia.org/r/1201850 [01:33:53] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1201849 (owner: 10TrainBranchBot) [01:40:16] (03CR) 10CDanis: [C:03+1] P:cache::haproxy: ensure x-requestctl is updated [puppet] - 10https://gerrit.wikimedia.org/r/1201844 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [01:43:50] (03PS1) 10CDanis: Add discovery-conftool-state to ignored stale texts [alerts] - 10https://gerrit.wikimedia.org/r/1201852 [01:50:35] (03PS1) 10Tim Starling: recentchanges: Fix watchlistactivity=all, i.e. seen/unseen conflict [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1201853 (https://phabricator.wikimedia.org/T408167) [02:09:05] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:13:21] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:42:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1201853 (https://phabricator.wikimedia.org/T408167) (owner: 10Tim Starling) [02:46:34] (03Merged) 10jenkins-bot: recentchanges: Fix watchlistactivity=all, i.e. seen/unseen conflict [core] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1201853 (https://phabricator.wikimedia.org/T408167) (owner: 10Tim Starling) [02:47:32] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1201853|recentchanges: Fix watchlistactivity=all, i.e. seen/unseen conflict (T408167)]] [02:47:35] T408167: Selecting both "Unseen changes" and "Seen changes" filters shows nothing at all - https://phabricator.wikimedia.org/T408167 [02:50:02] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1201853|recentchanges: Fix watchlistactivity=all, i.e. seen/unseen conflict (T408167)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [02:53:54] !log tstarling@deploy2002 tstarling: Continuing with sync [02:58:11] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1201853|recentchanges: Fix watchlistactivity=all, i.e. seen/unseen conflict (T408167)]] (duration: 10m 39s) [02:58:14] T408167: Selecting both "Unseen changes" and "Seen changes" filters shows nothing at all - https://phabricator.wikimedia.org/T408167 [03:04:05] FIRING: [8x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:45:12] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T409060#11343263 (10VRiley-WMF) Resubmitted SR with TSR attached. [04:53:21] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:08:21] FIRING: [10x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:05] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:33:21] FIRING: [10x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:43:39] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11343310 (10Krd) My personal opinion is that we should disable notifications completely, but this perhaps isn't consensu... [06:09:05] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:17:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1158.eqiad.wmnet with reason: Maintenance [06:17:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:17:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1158 (T407997)', diff saved to https://phabricator.wikimedia.org/P84814 and previous config saved to /var/cache/conftool/dbconfig/20251105-061737-marostegui.json [06:17:41] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [06:19:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T407997)', diff saved to https://phabricator.wikimedia.org/P84815 and previous config saved to /var/cache/conftool/dbconfig/20251105-061950-marostegui.json [06:22:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2191 with weight 0 T409168', diff saved to https://phabricator.wikimedia.org/P84816 and previous config saved to /var/cache/conftool/dbconfig/20251105-062230-marostegui.json [06:22:34] T409168: Switchover x1 master (db2215 -> db2191) - https://phabricator.wikimedia.org/T409168 [06:22:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: Primary switchover x1 T409168 [06:23:28] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2191 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1201593 (https://phabricator.wikimedia.org/T409168) (owner: 10Gerrit maintenance bot) [06:27:08] a!log Starting x1 codfw failover from db2215 to db2191 - T409168 [06:27:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set x1 codfw as read-only for maintenance - T409168', diff saved to https://phabricator.wikimedia.org/P84817 and previous config saved to /var/cache/conftool/dbconfig/20251105-062723-marostegui.json [06:27:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2191 to x1 primary and set section read-write T409168', diff saved to https://phabricator.wikimedia.org/P84818 and previous config saved to /var/cache/conftool/dbconfig/20251105-062745-marostegui.json [06:27:49] T409168: Switchover x1 master (db2215 -> db2191) - https://phabricator.wikimedia.org/T409168 [06:28:15] (03CR) 10Marostegui: [C:03+2] wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1201594 (https://phabricator.wikimedia.org/T409168) (owner: 10Gerrit maintenance bot) [06:28:21] !log marostegui@dns1006 START - running authdns-update [06:29:10] !log marostegui@dns1006 END - running authdns-update [06:29:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2215 T409168', diff saved to https://phabricator.wikimedia.org/P84819 and previous config saved to /var/cache/conftool/dbconfig/20251105-062920-marostegui.json [06:29:58] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11343349 (10Krd) All double checked now and appears correct, user reports that it still doesn't work. Please see: https://vrt-wiki.wikimedia.org/w/index.php?diff=134462 [06:30:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2215 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P84820 and previous config saved to /var/cache/conftool/dbconfig/20251105-063009-root.json [06:32:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: Primary switchover x1 T409168 [06:32:29] !log marostegui@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1:00:00 on 14 hosts with reason: Primary switchover x1 T409168 [06:35:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P84821 and previous config saved to /var/cache/conftool/dbconfig/20251105-063458-marostegui.json [06:38:46] (03PS1) 10Marostegui: db1203: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1201868 [06:39:30] (03CR) 10Marostegui: [C:03+2] db1203: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1201868 (owner: 10Marostegui) [06:40:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1203.eqiad.wmnet with reason: Maintenance [06:40:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1203 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84822 and previous config saved to /var/cache/conftool/dbconfig/20251105-064028-marostegui.json [06:41:31] (03PS1) 10Marostegui: mariadb: Decommission es1034 [puppet] - 10https://gerrit.wikimedia.org/r/1201870 (https://phabricator.wikimedia.org/T409025) [06:42:10] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1034.eqiad.wmnet [06:42:53] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission es1034 [puppet] - 10https://gerrit.wikimedia.org/r/1201870 (https://phabricator.wikimedia.org/T409025) (owner: 10Marostegui) [06:45:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2215 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P84823 and previous config saved to /var/cache/conftool/dbconfig/20251105-064515-root.json [06:48:22] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [06:48:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1203 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84824 and previous config saved to /var/cache/conftool/dbconfig/20251105-064829-root.json [06:49:18] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11343365 (10Volker_E) Hi @Ladsgroup et al, I've re-read the L3 and like to hereby formalize my agreement with it. Please note, that I can't resign the document. {F69903469} >>! In T406243#1127... [06:50:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P84825 and previous config saved to /var/cache/conftool/dbconfig/20251105-065008-marostegui.json [06:51:48] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1034.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:52:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1034.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:52:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:52:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1034.eqiad.wmnet [06:52:36] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1034.eqiad.wmnet - https://phabricator.wikimedia.org/T409025#11343367 (10Marostegui) a:05Marostegui→03None [06:52:47] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1034.eqiad.wmnet - https://phabricator.wikimedia.org/T409025#11343372 (10Marostegui) This is ready for #dc-ops [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251105T0700) [07:00:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2215 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P84826 and previous config saved to /var/cache/conftool/dbconfig/20251105-070021-root.json [07:03:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1203 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84827 and previous config saved to /var/cache/conftool/dbconfig/20251105-070335-root.json [07:05:12] (03PS1) 10Marostegui: db2212: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1201898 (https://phabricator.wikimedia.org/T407463) [07:05:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T407997)', diff saved to https://phabricator.wikimedia.org/P84828 and previous config saved to /var/cache/conftool/dbconfig/20251105-070516-marostegui.json [07:05:20] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [07:05:30] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2212 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1201900 (https://phabricator.wikimedia.org/T409255) [07:05:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance [07:05:35] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1201901 (https://phabricator.wikimedia.org/T409255) [07:05:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1170 (T407997)', diff saved to https://phabricator.wikimedia.org/P84830 and previous config saved to /var/cache/conftool/dbconfig/20251105-070540-marostegui.json [07:06:04] (03CR) 10Marostegui: [C:03+2] db2212: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1201898 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [07:07:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2212.codfw.wmnet with reason: Maintenance [07:07:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2212 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84831 and previous config saved to /var/cache/conftool/dbconfig/20251105-070707-marostegui.json [07:12:50] (03PS1) 10Marostegui: instances.yaml: Add es1033 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1201996 (https://phabricator.wikimedia.org/T409257) [07:13:26] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1033 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1201996 (https://phabricator.wikimedia.org/T409257) (owner: 10Marostegui) [07:15:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2212 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84832 and previous config saved to /var/cache/conftool/dbconfig/20251105-071510-root.json [07:15:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2215 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P84833 and previous config saved to /var/cache/conftool/dbconfig/20251105-071527-root.json [07:16:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add es1033 to es2 depooled T409257 T407472', diff saved to https://phabricator.wikimedia.org/P84834 and previous config saved to /var/cache/conftool/dbconfig/20251105-071605-marostegui.json [07:16:11] T409257: Move es1033 (es2 Debian Trixie) to es7 - https://phabricator.wikimedia.org/T409257 [07:16:11] T407472: Install a testing db with Debian Trixie - https://phabricator.wikimedia.org/T407472 [07:16:59] (03PS1) 10Marostegui: es1033: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1202002 (https://phabricator.wikimedia.org/T407472) [07:18:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1203 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84835 and previous config saved to /var/cache/conftool/dbconfig/20251105-071841-root.json [07:18:53] (03CR) 10Marostegui: [C:03+2] es1033: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1202002 (https://phabricator.wikimedia.org/T407472) (owner: 10Marostegui) [07:21:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 1%: Testing Debian Trixie in es2', diff saved to https://phabricator.wikimedia.org/P84836 and previous config saved to /var/cache/conftool/dbconfig/20251105-072145-root.json [07:23:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T407997)', diff saved to https://phabricator.wikimedia.org/P84837 and previous config saved to /var/cache/conftool/dbconfig/20251105-072326-marostegui.json [07:23:30] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [07:30:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2212 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84838 and previous config saved to /var/cache/conftool/dbconfig/20251105-073016-root.json [07:30:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2215 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P84839 and previous config saved to /var/cache/conftool/dbconfig/20251105-073033-root.json [07:33:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1203 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84840 and previous config saved to /var/cache/conftool/dbconfig/20251105-073347-root.json [07:36:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 2%: Testing Debian Trixie in es2', diff saved to https://phabricator.wikimedia.org/P84841 and previous config saved to /var/cache/conftool/dbconfig/20251105-073651-root.json [07:38:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P84842 and previous config saved to /var/cache/conftool/dbconfig/20251105-073833-marostegui.json [07:45:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2212 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84843 and previous config saved to /var/cache/conftool/dbconfig/20251105-074521-root.json [07:45:33] (03CR) 10Stevemunene: "Ack, thanks Brian." [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) (owner: 10Stevemunene) [07:51:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 3%: Testing Debian Trixie in es2', diff saved to https://phabricator.wikimedia.org/P84844 and previous config saved to /var/cache/conftool/dbconfig/20251105-075156-root.json [07:53:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P84845 and previous config saved to /var/cache/conftool/dbconfig/20251105-075341-marostegui.json [08:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251105T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2212 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84846 and previous config saved to /var/cache/conftool/dbconfig/20251105-080027-root.json [08:07:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 4%: Testing Debian Trixie in es2', diff saved to https://phabricator.wikimedia.org/P84847 and previous config saved to /var/cache/conftool/dbconfig/20251105-080702-root.json [08:08:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T407997)', diff saved to https://phabricator.wikimedia.org/P84848 and previous config saved to /var/cache/conftool/dbconfig/20251105-080849-marostegui.json [08:08:53] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:08:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [08:11:54] (03PS1) 10Marostegui: installserver: Remove es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1202047 (https://phabricator.wikimedia.org/T408777) [08:13:05] (03PS1) 10Ryan Kemper: (wip) wdqs: add availability sli recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1202049 [08:13:16] !log brouberol@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on an-launcher1002.eqiad.wmnet with reason: host is being decommissioned [08:13:47] (03PS2) 10Ryan Kemper: (wip) wdqs: add availability sli recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966) [08:14:21] (03CR) 10Marostegui: [C:03+2] installserver: Remove es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1202047 (https://phabricator.wikimedia.org/T408777) (owner: 10Marostegui) [08:15:52] (03PS2) 10Brouberol: Define the growthbook-backend domain [dns] - 10https://gerrit.wikimedia.org/r/1201075 (https://phabricator.wikimedia.org/T408903) [08:16:02] (03CR) 10Elukey: (wip) wdqs: add availability sli recording rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [08:16:30] (03PS2) 10Brouberol: trafficserver: rediredct growthbook-backend from public to private domains [puppet] - 10https://gerrit.wikimedia.org/r/1201074 (https://phabricator.wikimedia.org/T408903) [08:17:12] (03PS2) 10Brouberol: growthbook: set the APP_ORIGIN and API_HOST env vars to the public domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201081 (https://phabricator.wikimedia.org/T408903) [08:17:24] (03CR) 10Brouberol: growthbook: set the APP_ORIGIN and API_HOST env vars to the public domains (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201081 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [08:17:31] (03CR) 10Brouberol: dse-k8s-eqiad: add the backend domain to the certificate SANs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201080 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [08:17:47] (03CR) 10Brouberol: [C:03+1] trafficserver: rediredct growthbook-backend from public to private domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1201074 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [08:18:00] (03CR) 10Brouberol: Define the growthbook-backend domain (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1201075 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [08:20:41] (03PS1) 10Stevemunene: stat: Remove the airflow package from stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1202050 (https://phabricator.wikimedia.org/T409262) [08:21:28] !log run gitlab-package-puller by hand on apt-staging2001 [08:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 5%: Testing Debian Trixie in es2', diff saved to https://phabricator.wikimedia.org/P84849 and previous config saved to /var/cache/conftool/dbconfig/20251105-082209-root.json [08:25:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1174.eqiad.wmnet with reason: Maintenance [08:25:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1174 (T407997)', diff saved to https://phabricator.wikimedia.org/P84850 and previous config saved to /var/cache/conftool/dbconfig/20251105-082533-marostegui.json [08:25:37] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:26:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T407997)', diff saved to https://phabricator.wikimedia.org/P84851 and previous config saved to /var/cache/conftool/dbconfig/20251105-082642-marostegui.json [08:29:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P84852 and previous config saved to /var/cache/conftool/dbconfig/20251105-082920-root.json [08:29:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1181.eqiad.wmnet with reason: Maintenance [08:31:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2218.codfw.wmnet with reason: Maintenance [08:35:29] (03CR) 10Brouberol: growthbook: define public configuration for s3 file uploads (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201086 (https://phabricator.wikimedia.org/T408415) (owner: 10Brouberol) [08:36:21] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1202050 (https://phabricator.wikimedia.org/T409262) (owner: 10Stevemunene) [08:40:42] (03CR) 10Elukey: [C:03+2] data.yaml: change wiki replica to mediawiki replica [puppet] - 10https://gerrit.wikimedia.org/r/1183247 (owner: 10Novem Linguae) [08:41:29] (03CR) 10Novem Linguae: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1183247 (owner: 10Novem Linguae) [08:44:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 10%: Testing Debian Trixie in es2', diff saved to https://phabricator.wikimedia.org/P84853 and previous config saved to /var/cache/conftool/dbconfig/20251105-084407-root.json [08:44:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P84854 and previous config saved to /var/cache/conftool/dbconfig/20251105-084426-root.json [08:45:58] (03CR) 10Brouberol: [C:03+2] postgresql-growthbook: add additional PG parameters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201082 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [08:47:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:47:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply [08:47:43] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook: apply [08:51:09] (03PS1) 10Marostegui: db2249: Make a note about 1P testing host [puppet] - 10https://gerrit.wikimedia.org/r/1202051 (https://phabricator.wikimedia.org/T407991) [08:53:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1157.eqiad.wmnet with reason: Maintenance [08:53:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1157 (T407997)', diff saved to https://phabricator.wikimedia.org/P84855 and previous config saved to /var/cache/conftool/dbconfig/20251105-085347-marostegui.json [08:53:50] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:54:49] (03CR) 10Brouberol: [C:03+1] "You'll need to uninstall the deb via cumin as well" [puppet] - 10https://gerrit.wikimedia.org/r/1202050 (https://phabricator.wikimedia.org/T409262) (owner: 10Stevemunene) [08:56:25] (03CR) 10Marostegui: "This is a NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1202051 (https://phabricator.wikimedia.org/T407991) (owner: 10Marostegui) [08:56:29] (03CR) 10Marostegui: [C:03+2] db2249: Make a note about 1P testing host [puppet] - 10https://gerrit.wikimedia.org/r/1202051 (https://phabricator.wikimedia.org/T407991) (owner: 10Marostegui) [08:58:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T407997)', diff saved to https://phabricator.wikimedia.org/P84856 and previous config saved to /var/cache/conftool/dbconfig/20251105-085844-marostegui.json [08:59:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 15%: Testing Debian Trixie in es2', diff saved to https://phabricator.wikimedia.org/P84857 and previous config saved to /var/cache/conftool/dbconfig/20251105-085913-root.json [08:59:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P84858 and previous config saved to /var/cache/conftool/dbconfig/20251105-085932-root.json [09:07:03] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11343711 (10Krd) It appears the password recovery is sent from "otrs-admins@lists.wikimedia.org", which is entire nonsense. Why is that, and how can it be changed? The issue regarding the m... [09:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:09:05] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:13:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P84859 and previous config saved to /var/cache/conftool/dbconfig/20251105-091352-marostegui.json [09:14:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 20%: Testing Debian Trixie in es2', diff saved to https://phabricator.wikimedia.org/P84860 and previous config saved to /var/cache/conftool/dbconfig/20251105-091419-root.json [09:14:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P84861 and previous config saved to /var/cache/conftool/dbconfig/20251105-091438-root.json [09:22:29] (03PS1) 10Bartosz Wójtowicz: inference-services: Add revise-tone-task-generator deployment and namespace. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202057 (https://phabricator.wikimedia.org/T408538) [09:29:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P84862 and previous config saved to /var/cache/conftool/dbconfig/20251105-092859-marostegui.json [09:29:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 25%: Testing Debian Trixie in es2', diff saved to https://phabricator.wikimedia.org/P84863 and previous config saved to /var/cache/conftool/dbconfig/20251105-092925-root.json [09:31:12] (03PS1) 10Muehlenhoff: Don't configure an updates file for the staging repo [puppet] - 10https://gerrit.wikimedia.org/r/1202064 [09:31:41] (03CR) 10CI reject: [V:04-1] Don't configure an updates file for the staging repo [puppet] - 10https://gerrit.wikimedia.org/r/1202064 (owner: 10Muehlenhoff) [09:33:21] FIRING: [8x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:36:39] (03PS2) 10Muehlenhoff: Don't configure an updates file for the staging repo [puppet] - 10https://gerrit.wikimedia.org/r/1202064 [09:41:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1202064 (owner: 10Muehlenhoff) [09:44:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T407997)', diff saved to https://phabricator.wikimedia.org/P84864 and previous config saved to /var/cache/conftool/dbconfig/20251105-094408-marostegui.json [09:44:10] (03PS1) 10JavierMonton: mediawiki-event-enrichment: Deploy new version 1.43.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202068 (https://phabricator.wikimedia.org/T408850) [09:44:12] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [09:44:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:44:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 30%: Testing Debian Trixie in es2', diff saved to https://phabricator.wikimedia.org/P84865 and previous config saved to /var/cache/conftool/dbconfig/20251105-094431-root.json [09:44:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1166 (T407997)', diff saved to https://phabricator.wikimedia.org/P84866 and previous config saved to /var/cache/conftool/dbconfig/20251105-094431-marostegui.json [09:48:06] (03CR) 10Muehlenhoff: "Thanks, one comment inline, but this looks good to me" [alerts] - 10https://gerrit.wikimedia.org/r/1201852 (owner: 10CDanis) [09:48:38] (03PS2) 10Clément Goubert: trafficserver: action api to rest-gateway group1 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198934 (https://phabricator.wikimedia.org/T408223) [09:49:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T407997)', diff saved to https://phabricator.wikimedia.org/P84867 and previous config saved to /var/cache/conftool/dbconfig/20251105-094926-marostegui.json [09:49:29] (03PS3) 10Muehlenhoff: Don't configure a repo sync for the staging repo [puppet] - 10https://gerrit.wikimedia.org/r/1202064 (https://phabricator.wikimedia.org/T409253) [09:49:31] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [09:52:45] (03CR) 10Muehlenhoff: "(PCC for P5 is broken for unrelated reasons)" [puppet] - 10https://gerrit.wikimedia.org/r/1202064 (https://phabricator.wikimedia.org/T409253) (owner: 10Muehlenhoff) [09:57:12] (03PS1) 10Marostegui: repl_prepare_schema.sh: Remove afl_ip [software] - 10https://gerrit.wikimedia.org/r/1202071 (https://phabricator.wikimedia.org/T408780) [09:59:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 40%: Testing Debian Trixie in es2', diff saved to https://phabricator.wikimedia.org/P84868 and previous config saved to /var/cache/conftool/dbconfig/20251105-095936-root.json [10:01:02] (03CR) 10Marostegui: [C:03+2] repl_prepare_schema.sh: Remove afl_ip [software] - 10https://gerrit.wikimedia.org/r/1202071 (https://phabricator.wikimedia.org/T408780) (owner: 10Marostegui) [10:01:42] (03Merged) 10jenkins-bot: repl_prepare_schema.sh: Remove afl_ip [software] - 10https://gerrit.wikimedia.org/r/1202071 (https://phabricator.wikimedia.org/T408780) (owner: 10Marostegui) [10:04:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P84869 and previous config saved to /var/cache/conftool/dbconfig/20251105-100434-marostegui.json [10:05:45] (03CR) 10Pmiazga: api-gateway: Make x-ratelimit response header configurable. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) (owner: 10Pmiazga) [10:06:22] !log disabling Puppet on buster maps nodes for pending decom T381565 [10:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:25] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [10:09:05] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:12:25] FIRING: SystemdUnitFailed: prometheus-pg-replication-lag.service on maps1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:25] (03PS2) 10Bartosz Wójtowicz: inference-services: Add revise-tone-task-generator deployment and namespace. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202057 (https://phabricator.wikimedia.org/T408538) [10:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:14:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 50%: Testing Debian Trixie in es2', diff saved to https://phabricator.wikimedia.org/P84870 and previous config saved to /var/cache/conftool/dbconfig/20251105-101442-root.json [10:15:35] (03CR) 10Majavah: [C:03+2] toolforge: Handle interwiki redirects in front proxy [puppet] - 10https://gerrit.wikimedia.org/r/1201847 (https://phabricator.wikimedia.org/T247432) (owner: 10BryanDavis) [10:17:25] FIRING: [10x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:19:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P84871 and previous config saved to /var/cache/conftool/dbconfig/20251105-101942-marostegui.json [10:29:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 60%: Testing Debian Trixie in es2', diff saved to https://phabricator.wikimedia.org/P84872 and previous config saved to /var/cache/conftool/dbconfig/20251105-102948-root.json [10:34:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T407997)', diff saved to https://phabricator.wikimedia.org/P84873 and previous config saved to /var/cache/conftool/dbconfig/20251105-103449-marostegui.json [10:34:54] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:35:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:35:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1175 (T407997)', diff saved to https://phabricator.wikimedia.org/P84874 and previous config saved to /var/cache/conftool/dbconfig/20251105-103513-marostegui.json [10:40:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T407997)', diff saved to https://phabricator.wikimedia.org/P84875 and previous config saved to /var/cache/conftool/dbconfig/20251105-104010-marostegui.json [10:40:14] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:40:35] !log btullis@deploy2002 Started deploy [analytics/refinery@39e92e9]: Updating the deployment on an-launcher1003 [10:41:31] !log btullis@deploy2002 Finished deploy [analytics/refinery@39e92e9]: Updating the deployment on an-launcher1003 (duration: 01m 06s) [10:44:07] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1201850 (owner: 10Tim Starling) [10:44:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 75%: Testing Debian Trixie in es2', diff saved to https://phabricator.wikimedia.org/P84876 and previous config saved to /var/cache/conftool/dbconfig/20251105-104454-root.json [10:45:41] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-launcher1002.eqiad.wmnet [10:49:15] btullis@cumin1003 decommission (PID 3886853) is awaiting input [10:50:59] (03PS1) 10DCausse: cirrus: enable default_sort on en, fr and he wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202086 (https://phabricator.wikimedia.org/T404858) [10:52:20] (03PS1) 10Elukey: Add name_filters support to the k8s browser [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1202087 [10:55:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P84878 and previous config saved to /var/cache/conftool/dbconfig/20251105-105517-marostegui.json [10:59:12] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Wikimedia-Mailing-lists: lists.wikimedia.org subscription email rejected by DKIM - https://phabricator.wikimedia.org/T409137#11343973 (10LSobanski) cc @jhathaway [10:59:21] (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway group1 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198934 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:59:59] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [11:00:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'es1033 (re)pooling @ 100%: Testing Debian Trixie in es2', diff saved to https://phabricator.wikimedia.org/P84879 and previous config saved to /var/cache/conftool/dbconfig/20251105-110000-root.json [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251105T1100) [11:00:32] (03CR) 10Elukey: [C:03+1] Don't configure a repo sync for the staging repo [puppet] - 10https://gerrit.wikimedia.org/r/1202064 (https://phabricator.wikimedia.org/T409253) (owner: 10Muehlenhoff) [11:01:29] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad Patch panels for variance from Netbox - https://phabricator.wikimedia.org/T408197#11343983 (10cmooney) >>! In T408197#11340158, @Jclark-ctr wrote: > This might be the missing free link > > https://netbox.wikimedia.org/circuits/circuit-terminations/157/trace/ > >... [11:04:07] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-launcher1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [11:04:44] (03PS1) 10Elukey: Remove buster-based images that are currently failing to build [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1202089 [11:04:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-launcher1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [11:04:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:04:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-launcher1002.eqiad.wmnet [11:05:55] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Decommission an-launcher1002 - https://phabricator.wikimedia.org/T353786#11343990 (10BTullis) a:05BTullis→03None [11:07:36] (03PS1) 10Btullis: Remove last reference to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1202090 (https://phabricator.wikimedia.org/T353786) [11:10:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P84880 and previous config saved to /var/cache/conftool/dbconfig/20251105-111025-marostegui.json [11:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:14:30] (03PS1) 10DCausse: cirrus: enable alt index with default_sort on a set of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202094 (https://phabricator.wikimedia.org/T404858) [11:15:02] (03CR) 10Effie Mouzeli: [C:03+2] mw-experimental: remove update lock if older than 6hrs [puppet] - 10https://gerrit.wikimedia.org/r/1200381 (owner: 10Effie Mouzeli) [11:15:24] (03CR) 10CI reject: [V:04-1] cirrus: enable alt index with default_sort on a set of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202094 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [11:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:17:59] (03PS2) 10DCausse: cirrus: enable alt index with default_sort on a set of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202094 (https://phabricator.wikimedia.org/T404858) [11:18:21] FIRING: [8x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:20:00] (03CR) 10JMeybohm: [C:03+2] P:conftool::hiddenparma: enable ipblock and ipblock_source policies [puppet] - 10https://gerrit.wikimedia.org/r/1201574 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm) [11:20:13] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1202089 (owner: 10Elukey) [11:25:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T407997)', diff saved to https://phabricator.wikimedia.org/P84881 and previous config saved to /var/cache/conftool/dbconfig/20251105-112532-marostegui.json [11:25:36] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [11:25:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1198.eqiad.wmnet with reason: Maintenance [11:25:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1198 (T407997)', diff saved to https://phabricator.wikimedia.org/P84882 and previous config saved to /var/cache/conftool/dbconfig/20251105-112556-marostegui.json [11:30:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T407997)', diff saved to https://phabricator.wikimedia.org/P84883 and previous config saved to /var/cache/conftool/dbconfig/20251105-113053-marostegui.json [11:30:57] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [11:33:42] (03PS1) 10Marostegui: db1167: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202096 [11:34:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Migration [11:34:33] (03CR) 10Marostegui: [C:03+2] db1167: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202096 (owner: 10Marostegui) [11:35:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1167.eqiad.wmnet with reason: Maintenance [11:35:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1167 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84884 and previous config saved to /var/cache/conftool/dbconfig/20251105-113522-marostegui.json [11:37:20] 06SRE, 10SRE-Access-Requests: Update to FIDO backed production SSH key for btullis - https://phabricator.wikimedia.org/T409279 (10BTullis) 03NEW [11:42:21] 06SRE, 06Infrastructure-Foundations, 10netops: Rancid network backups not being synced to git properly - https://phabricator.wikimedia.org/T409217#11344196 (10cmooney) Yup that seemed to fix it: ` cmooney@netmon1003:/var/log/rancid$ tail -f core.20251105.112816 starting: Wed Nov 5 11:28:16 AM UTC 2025... [11:42:25] 06SRE, 06Infrastructure-Foundations, 10netops: Rancid network backups not being synced to git properly - https://phabricator.wikimedia.org/T409217#11344199 (10cmooney) 05Open→03Resolved [11:43:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1167 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84885 and previous config saved to /var/cache/conftool/dbconfig/20251105-114311-root.json [11:45:00] (03PS1) 10Btullis: Add a FIDO backed SSH key for btullis [puppet] - 10https://gerrit.wikimedia.org/r/1202100 (https://phabricator.wikimedia.org/T409279) [11:46:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P84886 and previous config saved to /var/cache/conftool/dbconfig/20251105-114600-marostegui.json [11:46:55] (03CR) 10Btullis: [C:03+2] Remove last reference to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1202090 (https://phabricator.wikimedia.org/T353786) (owner: 10Btullis) [11:50:05] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1202089 (owner: 10Elukey) [11:51:07] (03PS1) 10Stevemunene: Upgrade the superset-production-memcached image to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202101 (https://phabricator.wikimedia.org/T409151) [11:52:11] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1220.eqiad.wmnet with reason: Maintenance [11:54:30] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2215.codfw.wmnet with reason: Maintenance [11:54:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2215 (T403362)', diff saved to https://phabricator.wikimedia.org/P84887 and previous config saved to /var/cache/conftool/dbconfig/20251105-115437-ladsgroup.json [11:54:41] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [11:56:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), and 2 others: Decommission an-launcher1002 - https://phabricator.wikimedia.org/T353786#11344214 (10BTullis) [11:58:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1167 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84888 and previous config saved to /var/cache/conftool/dbconfig/20251105-115817-root.json [11:58:44] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and key has also been verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1202100 (https://phabricator.wikimedia.org/T409279) (owner: 10Btullis) [11:59:17] (03PS5) 10Btullis: Update the apt components used for elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1198287 (https://phabricator.wikimedia.org/T407199) [11:59:38] (03CR) 10Btullis: [C:03+2] Add a FIDO backed SSH key for btullis [puppet] - 10https://gerrit.wikimedia.org/r/1202100 (https://phabricator.wikimedia.org/T409279) (owner: 10Btullis) [12:00:05] mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251105T1200). [12:01:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P84889 and previous config saved to /var/cache/conftool/dbconfig/20251105-120108-marostegui.json [12:10:15] (03CR) 10Ladsgroup: "Since it's private, deploying it requires some work and it should be done before tables getting created in production." [puppet] - 10https://gerrit.wikimedia.org/r/1201842 (https://phabricator.wikimedia.org/T406843) (owner: 10Samwilson) [12:12:43] (03CR) 10Clément Goubert: api-gateway: Make x-ratelimit response header configurable. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) (owner: 10Pmiazga) [12:13:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1167 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84890 and previous config saved to /var/cache/conftool/dbconfig/20251105-121323-root.json [12:13:35] (03CR) 10Clément Goubert: "Daniel's proposed changes seem right, LGTM provided they're implemented." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) (owner: 10Pmiazga) [12:16:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T407997)', diff saved to https://phabricator.wikimedia.org/P84891 and previous config saved to /var/cache/conftool/dbconfig/20251105-121616-marostegui.json [12:16:19] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:16:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1212.eqiad.wmnet with reason: Maintenance [12:16:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:16:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1212 (T407997)', diff saved to https://phabricator.wikimedia.org/P84892 and previous config saved to /var/cache/conftool/dbconfig/20251105-121647-marostegui.json [12:21:30] (03PS1) 10Stevemunene: Deploy airflow images from airflow-dags repository build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202106 (https://phabricator.wikimedia.org/T408711) [12:21:38] (03CR) 10Muehlenhoff: [C:03+2] Don't configure a repo sync for the staging repo [puppet] - 10https://gerrit.wikimedia.org/r/1202064 (https://phabricator.wikimedia.org/T409253) (owner: 10Muehlenhoff) [12:22:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T407997)', diff saved to https://phabricator.wikimedia.org/P84893 and previous config saved to /var/cache/conftool/dbconfig/20251105-122203-marostegui.json [12:22:08] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:23:54] (03CR) 10Cparle: [C:03+1] mediawiki tables-catalog: Add watchlist labels tables [puppet] - 10https://gerrit.wikimedia.org/r/1201842 (https://phabricator.wikimedia.org/T406843) (owner: 10Samwilson) [12:28:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1167 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84894 and previous config saved to /var/cache/conftool/dbconfig/20251105-122828-root.json [12:30:48] (03PS2) 10Samwilson: mediawiki tables-catalog: Add watchlist labels tables [puppet] - 10https://gerrit.wikimedia.org/r/1201842 (https://phabricator.wikimedia.org/T406843) [12:30:51] (03CR) 10Ladsgroup: [C:03+2] mediawiki tables-catalog: Add watchlist labels tables [puppet] - 10https://gerrit.wikimedia.org/r/1201842 (https://phabricator.wikimedia.org/T406843) (owner: 10Samwilson) [12:30:54] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mediawiki tables-catalog: Add watchlist labels tables [puppet] - 10https://gerrit.wikimedia.org/r/1201842 (https://phabricator.wikimedia.org/T406843) (owner: 10Samwilson) [12:37:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P84895 and previous config saved to /var/cache/conftool/dbconfig/20251105-123711-marostegui.json [12:37:25] FIRING: [10x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:39:46] (03CR) 10Btullis: "This will update *all* deployments with the new image, as soon as they are re-deployed. The commit message says that this is a test, so ar" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202106 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [12:40:41] (03CR) 10Btullis: [C:03+2] Update the apt components used for elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1198287 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [12:42:25] FIRING: [10x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:50:53] (03PS1) 10Muehlenhoff: apt-staging: Relax the cleanup for the incoming queue [puppet] - 10https://gerrit.wikimedia.org/r/1202113 (https://phabricator.wikimedia.org/T409253) [12:52:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P84896 and previous config saved to /var/cache/conftool/dbconfig/20251105-125219-marostegui.json [12:55:40] !log Deploy schema change on s3 master for vewikimedia T409282 T396130 [12:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:44] T409282: vewikimedia.abuse_filter_log doesn't have afl_ip_hex - https://phabricator.wikimedia.org/T409282 [12:55:44] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:57:37] (03PS1) 10Muehlenhoff: Remove otto from ops group [puppet] - 10https://gerrit.wikimedia.org/r/1202114 [12:59:28] (03CR) 10Muehlenhoff: [C:03+2] ganeti-ca: Adapt to change of logged clustername for the expity metric [alerts] - 10https://gerrit.wikimedia.org/r/1201066 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [12:59:58] (03CR) 10Muehlenhoff: [C:03+2] ganeti-ca-exporter: Log the cluster name as part of the metric [puppet] - 10https://gerrit.wikimedia.org/r/1201057 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [13:04:15] (03CR) 10Stevemunene: "This is just for the `airflow-test-k8s` instance for now. The first test is from a locally checked out version of deployment charts and if" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202106 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [13:05:56] !log brouberol@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 9 hosts with reason: rebalancing [13:07:11] (03PS3) 10Bartosz Wójtowicz: inference-services: Add revise-tone-task-generator experimental deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202057 (https://phabricator.wikimedia.org/T408538) [13:07:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T407997)', diff saved to https://phabricator.wikimedia.org/P84897 and previous config saved to /var/cache/conftool/dbconfig/20251105-130726-marostegui.json [13:07:30] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:07:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1223.eqiad.wmnet with reason: Maintenance [13:07:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1223 (T407997)', diff saved to https://phabricator.wikimedia.org/P84898 and previous config saved to /var/cache/conftool/dbconfig/20251105-130750-marostegui.json [13:08:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:08:21] FIRING: [8x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:08:26] (03PS3) 10Kamila Součková: deployment-server: generate clusterinfo for helm [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) [13:08:52] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [13:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:09:05] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:10:30] (03PS4) 10Bartosz Wójtowicz: inference-services: Add revise-tone-task-generator experimental deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202057 (https://phabricator.wikimedia.org/T408538) [13:13:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T407997)', diff saved to https://phabricator.wikimedia.org/P84899 and previous config saved to /var/cache/conftool/dbconfig/20251105-131308-marostegui.json [13:13:12] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:13:49] 06SRE, 06Infrastructure-Foundations: Nokia L3 bugs [Oct 2025] - https://phabricator.wikimedia.org/T409286 (10cmooney) 03NEW p:05Triage→03Low [13:14:36] 06SRE, 06Infrastructure-Foundations: Nokia L3 bugs [Oct 2025] - https://phabricator.wikimedia.org/T409286#11344447 (10cmooney) [13:14:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11344448 (10cmooney) [13:14:51] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux ARP resolution bug on v24.10.x+ - https://phabricator.wikimedia.org/T409178#11344450 (10cmooney) [13:14:54] 06SRE, 06Infrastructure-Foundations: Nokia L3 bugs [Oct 2025] - https://phabricator.wikimedia.org/T409286#11344451 (10cmooney) [13:15:15] (03CR) 10Kamila Součková: "@jmeybohm@wikimedia.org I am not done with fixing the CI (tasty!), but aside from that, do you think this approach is reasonable?" [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [13:16:17] (03CR) 10Kamila Součková: "(I mean the deployment-charts CI)" [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [13:23:02] (03CR) 10TChin: [C:03+1] mediawiki-event-enrichment: Deploy new version 1.43.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202068 (https://phabricator.wikimedia.org/T408850) (owner: 10JavierMonton) [13:27:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Connect spare Nokia D2L switches to Spines in eqiad - https://phabricator.wikimedia.org/T409288 (10cmooney) 03NEW p:05Triage→03Low [13:27:36] 06SRE, 06Infrastructure-Foundations: Nokia L3 bugs [Oct 2025] - https://phabricator.wikimedia.org/T409286#11344517 (10cmooney) [13:27:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Connect spare Nokia D2L switches to Spines in eqiad - https://phabricator.wikimedia.org/T409288#11344516 (10cmooney) [13:28:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P84900 and previous config saved to /var/cache/conftool/dbconfig/20251105-132816-marostegui.json [13:28:25] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps2010.codfw.wmnet [13:32:50] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:34:30] (03PS1) 10Muehlenhoff: Remove historic comments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202127 [13:36:41] (03CR) 10Brouberol: [C:03+1] Deploy airflow images from airflow-dags repository build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202106 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [13:37:06] (03CR) 10MVernon: [C:03+2] aptrepo: add conftool-trixie [puppet] - 10https://gerrit.wikimedia.org/r/1201557 (https://phabricator.wikimedia.org/T407513) (owner: 10MVernon) [13:38:00] (03CR) 10Kamila Součková: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1184158 (owner: 10BryanDavis) [13:38:15] (03CR) 10Elukey: [C:03+1] Upgrade the superset-production-memcached image to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202101 (https://phabricator.wikimedia.org/T409151) (owner: 10Stevemunene) [13:38:33] jmm@cumin2002 decommission (PID 3895881) is awaiting input [13:38:44] (03CR) 10Btullis: [C:03+1] "Ah yes, silly me. I thought that this was the symlink target file, but I was wrong." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202106 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [13:39:06] (03CR) 10Btullis: [C:03+1] Upgrade the superset-production-memcached image to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202101 (https://phabricator.wikimedia.org/T409151) (owner: 10Stevemunene) [13:39:18] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps2010.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:41:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps2010.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:41:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:41:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps2010.codfw.wmnet [13:42:05] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11344567 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps2010.codfw.wmnet` - maps2010.codfw.wmnet (**PASS**) - Downtimed host on Ic... [13:42:25] FIRING: [9x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:43:12] (03PS1) 10Clément Goubert: api-gateway: Fix ratelimiter_metrics statsd mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202131 [13:43:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P84901 and previous config saved to /var/cache/conftool/dbconfig/20251105-134323-marostegui.json [13:43:53] (03CR) 10Kamila Součková: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1184157 (owner: 10BryanDavis) [13:44:20] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps2005.codfw.wmnet [13:45:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Connect spare Nokia D2L switches to Spines in eqiad - https://phabricator.wikimedia.org/T409288#11344589 (10Jclark-ctr) a:03Jclark-ctr [13:48:30] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:48:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Connect spare Nokia D2L switches to Spines in eqiad - https://phabricator.wikimedia.org/T409288#11344599 (10cmooney) [13:49:36] (03CR) 10JavierMonton: [V:03+2] mediawiki-event-enrichment: Deploy new version 1.43.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202068 (https://phabricator.wikimedia.org/T408850) (owner: 10JavierMonton) [13:49:51] (03CR) 10JavierMonton: [V:03+2 C:03+2] mediawiki-event-enrichment: Deploy new version 1.43.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202068 (https://phabricator.wikimedia.org/T408850) (owner: 10JavierMonton) [13:50:11] !log cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists katesdb; (T297297) [13:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:14] T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297 [13:51:44] (03Merged) 10jenkins-bot: mediawiki-event-enrichment: Deploy new version 1.43.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202068 (https://phabricator.wikimedia.org/T408850) (owner: 10JavierMonton) [13:52:06] (03CR) 10AikoChou: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202057 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [13:53:22] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps2005.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:53:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps2005.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:53:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:53:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps2005.codfw.wmnet [13:53:57] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11344617 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps2005.codfw.wmnet` - maps2005.codfw.wmnet (**PASS**) - Downtimed host on Ic... [13:55:27] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11344620 (10Arnoldokoth) @Krd From what I've gathered from the docs, that's configured via the `AdminEmail` in the System Configuration. I don't have the historical context as to why it's cu... [13:56:37] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11344641 (10MoritzMuehlenhoff) [13:56:41] 06SRE, 10decommission-hardware: decommission maps2005/maps2006/maps2007/maps2008/maps2009/map2010 - https://phabricator.wikimedia.org/T409291#11344644 (10MoritzMuehlenhoff) [13:58:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T407997)', diff saved to https://phabricator.wikimedia.org/P84902 and previous config saved to /var/cache/conftool/dbconfig/20251105-135831-marostegui.json [13:58:35] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:58:44] 06SRE, 10decommission-hardware: decommission maps2005/maps2006/maps2007/maps2008/maps2009/map2010 - https://phabricator.wikimedia.org/T409291#11344651 (10MoritzMuehlenhoff) [13:58:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251105T1400). nyaa~ [14:00:05] codders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:15] it is! I'm here o/ [14:00:18] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps2006.codfw.wmnet [14:00:31] (03CR) 10JMeybohm: "So your plan is to include this new values file in all the helmfile's we have, right? That sounds like a plan (if it works :)). What I wou" [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [14:00:37] who's running the window? Should I try self-service with spiderpig? [14:01:16] o/ [14:01:25] I’m in a meeting, I can do it afterwards (ca. 14:30 UTC) [14:01:35] (03PS1) 10Jforrester: Enable embedded Wikifunctions calls on bnwiki and seven Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202143 (https://phabricator.wikimedia.org/T406342) [14:01:50] I would be interested to try doing it myself, but happy to wait until 14h30 [14:02:07] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1220* gradually with 4 steps - Work done [14:02:12] (03PS5) 10Bartosz Wójtowicz: inference-services: Add revise-tone-task-generator experimental deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202057 (https://phabricator.wikimedia.org/T408538) [14:04:38] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:05:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:05:15] (03PS1) 10Abijeet Patro: Remove SpecialContributeSkinsEnabled for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202144 (https://phabricator.wikimedia.org/T400067) [14:09:05] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2215 (T403362)', diff saved to https://phabricator.wikimedia.org/P84904 and previous config saved to /var/cache/conftool/dbconfig/20251105-140934-ladsgroup.json [14:09:38] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [14:10:26] jmm@cumin2002 decommission (PID 3902625) is awaiting input [14:11:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1189.eqiad.wmnet with reason: Maintenance [14:12:25] !log cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists jamestemp; (T297297) [14:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:28] T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297 [14:12:45] (03CR) 10MVernon: "No technical qualms here, but people seem quite touchy about thumbnail sizes, and 260 was particularly requested in T215106; so should we " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201584 (https://phabricator.wikimedia.org/T408715) (owner: 10Ladsgroup) [14:13:54] o/ now I’m free ^^ [14:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:25] codders: feel free to start the spiderpig [14:14:29] do you have shell access too? [14:14:44] i do. should I follow the logspam? [14:14:52] yup, exactly [14:14:58] (I've never done this before for-realz, so I'll just be following the instructions) [14:15:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2149.codfw.wmnet with reason: Maintenance [14:15:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2149 (T407997)', diff saved to https://phabricator.wikimedia.org/P84905 and previous config saved to /var/cache/conftool/dbconfig/20251105-141507-marostegui.json [14:15:11] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:15:30] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable the MEX / wbui2025 beta feature on testwikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [14:15:51] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdf) failed in thanos-be2008 - https://phabricator.wikimedia.org/T409036#11344756 (10MatthewVernon) :( IME the iDRAC basically never notices a bad disk. The kernel log above (and the Media error reported by perccli64) are all the errors I have. If t... [14:16:39] okay. well that failed fast [14:17:00] *looks* [14:17:25] we need to backport the feature to wmf.24? [14:17:32] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [14:17:35] I guess that means https://phabricator.wikimedia.org/T397931 is working again, hooray [14:17:41] FIRING: [21x] ConfdResourceFailed: confd resource _etc_haproxy_ipblocks.d_all.map.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:17:56] that’s a pretty huge change to backport though [14:17:59] (03CR) 10Ladsgroup: "When it was requested, the default thumbnail size was 220px. Now it's 250px which is much closer to the community's wish. Plus we recently" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201584 (https://phabricator.wikimedia.org/T408715) (owner: 10Ladsgroup) [14:18:09] yeah. I thought the whole thing would have gone out last week [14:18:18] I think it would be reasonable to drop the Depends-On (or change it to some other trailer name that scap doesn’t check for) [14:18:35] and rely on the comment being present on testwikidatawiki (group0) [14:18:56] even if the train gets rolled back it shouldn’t be a problem, worst case the beta feature will confusingly do nothing on one test wiki [14:19:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2085.codfw.wmnet with OS bullseye [14:19:32] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v12.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1202146 [14:19:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11344762 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2085.codfw.wmnet with OS bullseye [14:19:43] ah. because scap wants this to reach group2 [14:19:48] got it [14:19:57] yeah, scap doesn’t know that it only affects one wiki [14:20:18] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable the MEX / wbui2025 beta feature on testwikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [14:20:55] why does the message say it's not present in wmf.24? seems like the wikis are all at wmf.25 or later [14:21:12] uh [14:21:21] oh, huh [14:21:28] I thought the change was merged *this* monday [14:21:29] not last monday [14:21:44] then that seems… wrong? I think? [14:22:13] why is wmf.24 “deployable” [14:22:19] because it hasn’t been wiped from /srv/mediawiki-staging yet? [14:22:41] FIRING: [112x] ConfdResourceFailed: confd resource _etc_haproxy_ipblocks.d_all.map.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:22:50] ^ we are looking at this [14:22:57] (03PS4) 10Arthur taylor: Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) [14:24:13] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.sanitarium_restart [14:24:13] !log ladsgroup@cumin1003 END (FAIL) - Cookbook sre.mysql.sanitarium_restart (exit_code=99) [14:24:21] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.sanitarium_restart [14:24:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2215', diff saved to https://phabricator.wikimedia.org/P84907 and previous config saved to /var/cache/conftool/dbconfig/20251105-142441-ladsgroup.json [14:25:09] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-10-28-150053 to 2025-11-05-063501 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202148 (https://phabricator.wikimedia.org/T406625) [14:25:17] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-10-28-205854 to 2025-11-04-215809 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202149 (https://phabricator.wikimedia.org/T406625) [14:25:30] let me ask on the task [14:26:18] (03CR) 10Kamila Součková: "Potentially, yes. I'm focusing on mw-\* right now, but nothing is preventing including it everywhere. Then we could eventually also remove" [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [14:26:22] I've updated the patch to drop the 'Depends-On', so we're anyway unblocked [14:26:25] should I retry? [14:26:46] (03CR) 10Ottomata: "Thanks! Could I also get access to:" [puppet] - 10https://gerrit.wikimedia.org/r/1202114 (owner: 10Muehlenhoff) [14:26:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Connect spare Nokia D2L switches to Spines in eqiad - https://phabricator.wikimedia.org/T409288#11344803 (10Jclark-ctr) @cmooney lswtest1 Racked / cabled /. Netbox has been updated with Temp cableid https://netbox.wikimedia.org/dcim... [14:27:06] (03PS1) 10Ladsgroup: mysql: Rename cookbooks to be kebab-case instead of snake_case [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 [14:27:10] * Lucas_WMDE looks [14:27:35] looks good to me, let’s just wait for the diffConfig [14:27:41] RESOLVED: [112x] ConfdResourceFailed: confd resource _etc_haproxy_ipblocks.d_all.map.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:27:43] ok https://integration.wikimedia.org/ci/job/operations-mw-config-php81-composer-diffConfig/2457/console looks nice and tidy [14:27:48] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [14:27:52] good to go [14:27:57] I'll try... [14:28:09] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v12.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1202146 (owner: 10Elukey) [14:28:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arthurtaylor@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [14:28:56] (03PS1) 10CDanis: tcpproxy: haproxy: make stats work [puppet] - 10https://gerrit.wikimedia.org/r/1202152 (https://phabricator.wikimedia.org/T408532) [14:29:17] (03CR) 10Mahmoud-abdelsattar: [C:03+1] Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [14:29:52] (03PS1) 10Elukey: Upstream release v12.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1202153 [14:29:57] (03Merged) 10jenkins-bot: Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [14:30:01] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/6 UP : 7 v2 P2P interfaces vs. 6 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:30:02] (03CR) 10CDanis: [C:03+2] "tested by hand on tcpproxy1001, which prom scraped successfully" [puppet] - 10https://gerrit.wikimedia.org/r/1202152 (https://phabricator.wikimedia.org/T408532) (owner: 10CDanis) [14:30:06] (03CR) 10Muehlenhoff: "Let me clarify that, I'm not sure if approvals are needed, we never really had the case where priviliges were reduced from ops." [puppet] - 10https://gerrit.wikimedia.org/r/1202114 (owner: 10Muehlenhoff) [14:30:09] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v12.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1202153 (owner: 10Elukey) [14:30:09] PROBLEM - OSPF status on cr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:30:46] !log arthurtaylor@deploy2002 Started scap sync-world: Backport for [[gerrit:1197613|Enable the MEX / wbui2025 beta feature on testwikidata (T407737)]] [14:30:48] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:30:49] T407737: [MEX] Add mobile editing for statments on Test Wikidata - https://phabricator.wikimedia.org/T407737 [14:30:55] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:31:31] (by the way, logspam-watch isn’t *usually* as busy as it currently is, the fix for T408052 should hopefully improve that soon) [14:31:31] T408052: PHP Warning: Trying to access array offset on value of type null (via GrowthExperiments listTaskCounts) - https://phabricator.wikimedia.org/T408052 [14:31:56] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2085.codfw.wmnet with reason: host reimage [14:31:56] yeah. was wondering about that [14:32:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:32:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T407997)', diff saved to https://phabricator.wikimedia.org/P84908 and previous config saved to /var/cache/conftool/dbconfig/20251105-143215-marostegui.json [14:32:19] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:32:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:32:49] (03PS1) 10Marostegui: db1209: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202154 [14:33:13] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1034.eqiad.wmnet - https://phabricator.wikimedia.org/T409025#11344849 (10Jclark-ctr) [14:33:15] !log arthurtaylor@deploy2002 arthurtaylor: Backport for [[gerrit:1197613|Enable the MEX / wbui2025 beta feature on testwikidata (T407737)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:33:26] (03CR) 10Marostegui: [C:03+2] db1209: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202154 (owner: 10Marostegui) [14:33:27] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1034.eqiad.wmnet - https://phabricator.wikimedia.org/T409025#11344860 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:33:45] testing now... [14:34:03] (03CR) 10CI reject: [V:04-1] mysql: Rename cookbooks to be kebab-case instead of snake_case [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup) [14:34:15] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1209.eqiad.wmnet with reason: Maintenance [14:34:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1209 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84910 and previous config saved to /var/cache/conftool/dbconfig/20251105-143419-marostegui.json [14:34:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Decommission an-launcher1002 - https://phabricator.wikimedia.org/T353786#11344872 (10Jclark-ctr) [14:34:43] hmpf. I don't actually see the feature in beta features on mwdebug/test.wikidata.org [14:34:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Decommission an-launcher1002 - https://phabricator.wikimedia.org/T353786#11344876 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:34:51] @Lucas_WMDE can you confirm? [14:35:06] (03CR) 10Elukey: [C:03+1] Remove historic comments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202127 (owner: 10Muehlenhoff) [14:35:11] looksing [14:35:12] *looking [14:35:29] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513#11344878 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon The necessary packages are now all available, and ms-be1088 managed to run puppet OK as a t... [14:35:31] (tricksy hobbitses) [14:35:41] hm, I only see Improved Syntax Highlighting [14:35:45] same [14:35:50] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.sanitarium_restart (exit_code=0) [14:35:57] I wonder if it has a cache somewhere [14:36:13] * Lucas_WMDE looks at some docs [14:36:42] I tried logging out and in, but that didn't change anything [14:37:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:38:20] does test.wikidata also get a debug server? [14:38:21] FIRING: [8x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:35] yes, that should work there [14:38:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2085.codfw.wmnet with reason: host reimage [14:38:55] aha! https://test.wikidata.org/w/api.php?action=query&format=json&list=betafeatures&formatversion=2 returns it [14:39:14] yay? [14:39:29] so why is it missing visually… [14:39:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2215', diff saved to https://phabricator.wikimedia.org/P84911 and previous config saved to /var/cache/conftool/dbconfig/20251105-143949-ladsgroup.json [14:40:36] ok, on https://test.wikidata.org/wiki/Special:Preferences?uselang=qqx#mw-prefsection-betafeatures, the count in betafeatures-section-desc changes between 1 and 2 [14:40:40] so it’s still being counted [14:40:56] / Check if feature is in the allow list [14:41:02] .// Check if feature is in the allow list [14:41:05] an allow list, huh [14:41:09] there's an allow list? [14:41:13] til [14:41:27] yup wgBetaFeaturesAllowList [14:41:31] / DO NOT add entries here without OK from Greg Grossmeier or James Forrester. [14:41:35] happy times [14:41:39] okay. sooooo [14:41:50] rollback? Or deploy the harmless change and have another patch for the allow list? [14:41:53] !log uploaded spicerack_12.0.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia [14:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1209 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84912 and previous config saved to /var/cache/conftool/dbconfig/20251105-144158-root.json [14:42:03] I would say rollback for now [14:42:17] I just click 'no' to 'continue with sync'? [14:42:29] no, you’ll also need to do other stuff after that [14:42:31] but start with that [14:42:34] !log arthurtaylor@deploy2002 Sync cancelled. [14:42:42] and then I need to find out if it has support for creating revert commits directly… [14:42:46] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1202157 (https://phabricator.wikimedia.org/T409299) [14:43:15] scap backport has a --revert flag but I don’t see an option for it in SpiderPig [14:43:17] doesn't prompt me, at least [14:43:17] * Lucas_WMDE searches phab [14:43:30] T396106 [14:43:31] T396106: spiderpig should give the revert procedure after a canceled deployment - https://phabricator.wikimedia.org/T396106 [14:44:02] so I think you can either run `scap backport --revert 1197613` on the deployment server manually [14:44:09] or submit a revert change on Gerrit and then deploy that as a normal Spiderpig [14:44:26] which would you recommend? [14:44:28] I think I’d go with the latter [14:44:31] (03CR) 10Scott French: [C:03+1] "+1 to dropping the php7.4 images. Thank you!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1202089 (owner: 10Elukey) [14:44:36] k. will try that [14:45:20] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:45:21] (03PS1) 10Arthur taylor: Revert "Enable the MEX / wbui2025 beta feature on testwikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202160 [14:45:25] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:45:28] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1202160 [14:45:47] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Revert "Enable the MEX / wbui2025 beta feature on testwikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202160 (owner: 10Arthur taylor) [14:45:48] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:45:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202160 (owner: 10Arthur taylor) [14:45:51] LGTM [14:45:52] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:46:04] k. i'll deploy it [14:46:20] oh, wait [14:46:34] (03PS2) 10Lucas Werkmeister (WMDE): Revert "Enable the MEX / wbui2025 beta feature on testwikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202160 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [14:46:44] I just ninja-edited the bug ID in there so it’s attached [14:46:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arthurtaylor@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202160 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [14:46:50] thanks :) [14:47:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P84913 and previous config saved to /var/cache/conftool/dbconfig/20251105-144723-marostegui.json [14:47:34] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1220* gradually with 4 steps - Work done [14:47:38] (03Merged) 10jenkins-bot: Revert "Enable the MEX / wbui2025 beta feature on testwikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202160 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [14:48:12] !log arthurtaylor@deploy2002 Started scap sync-world: Backport for [[gerrit:1202160|Revert "Enable the MEX / wbui2025 beta feature on testwikidata" (T407737)]] [14:48:14] T407737: [MEX] Add mobile editing for statments on Test Wikidata - https://phabricator.wikimedia.org/T407737 [14:48:19] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:48:27] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:50:36] !log arthurtaylor@deploy2002 arthurtaylor: Backport for [[gerrit:1202160|Revert "Enable the MEX / wbui2025 beta feature on testwikidata" (T407737)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:51:41] k. the beta feature seems to be gone again. Should I hit yes to sync? Or is it enough to hit no at this point [14:51:59] I’d still hit yes [14:52:08] shouldn’t make a difference but I think it looks less confusing in the spiderpig overview ^^ [14:52:13] fair. [14:52:18] !log arthurtaylor@deploy2002 arthurtaylor: Continuing with sync [14:53:27] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11344986 (10Krd) Thank you. I will change it. [14:54:12] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable the MEX / wbui2025 beta feature on testwikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [14:54:51] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135#11345006 (10Arnoldokoth) @Xaosflux Perhaps this is a bug... From what I can see, some counts are accurate while others aren't. Also, is it pos... [14:54:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2215 (T403362)', diff saved to https://phabricator.wikimedia.org/P84915 and previous config saved to /var/cache/conftool/dbconfig/20251105-145457-ladsgroup.json [14:55:01] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [14:55:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2085.codfw.wmnet with OS bullseye [14:55:31] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11345026 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2085.codfw.wmnet with OS bullseye complete... [14:56:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2086.codfw.wmnet with OS bullseye [14:56:35] !log arthurtaylor@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202160|Revert "Enable the MEX / wbui2025 beta feature on testwikidata" (T407737)]] (duration: 08m 23s) [14:56:39] T407737: [MEX] Add mobile editing for statments on Test Wikidata - https://phabricator.wikimedia.org/T407737 [14:56:42] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11345027 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2086.codfw.wmnet with OS bullseye [14:57:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1209 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84916 and previous config saved to /var/cache/conftool/dbconfig/20251105-145704-root.json [14:57:18] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135#11345031 (10Krd) Please provide actual examples which counts are not accurate. I cannot reproduce it. [14:59:08] okay. well, that was fun. I'm clocking off again, but I seem to have left everything unbroken. Thanks for the support Lucas_WMDE ! [14:59:41] …yay! ^^ [14:59:52] congrats on your first two deployments and hopefully better luck next time :D [14:59:56] (03CR) 10Elukey: [V:03+2 C:03+2] Remove buster-based images that are currently failing to build [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1202089 (owner: 10Elukey) [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251105T1500) [15:00:06] (03CR) 10Elukey: [V:03+2 C:03+2] golang: fix golang1.24 warning while running docker-pkg [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1201714 (owner: 10Elukey) [15:00:49] (03CR) 10Wangombe: "This patch seems to have an empty change log." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202144 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [15:01:20] (03PS1) 10CDanis: tcpproxy: haproxy: listen on v4+v6 for both ports [puppet] - 10https://gerrit.wikimedia.org/r/1202163 (https://phabricator.wikimedia.org/T408532) [15:01:43] Okie-dokie. [15:02:01] (03PS1) 10Lucas Werkmeister (WMDE): Enable the MEX / wbui2025 beta feature on testwikidata (v2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202164 (https://phabricator.wikimedia.org/T407737) [15:02:19] !log UTC afternoon backport+config window done [15:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:21] (btw) [15:02:31] (03CR) 10Lucas Werkmeister (WMDE): [C:04-2] "requires Foundation approval first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202164 (https://phabricator.wikimedia.org/T407737) (owner: 10Lucas Werkmeister (WMDE)) [15:02:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P84917 and previous config saved to /var/cache/conftool/dbconfig/20251105-150230-marostegui.json [15:02:56] (03PS1) 10Elukey: Drop python3-buster from images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1202165 [15:03:13] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2025-10-28-150053 to 2025-11-05-063501 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202148 (https://phabricator.wikimedia.org/T406625) (owner: 10Jforrester) [15:04:31] 06SRE, 06collaboration-services, 10Znuny: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11345061 (10Krd) I think I have changed the setting and deployed it, but it still shows the old value, and now cannot be edited. Please check what is wrong. Please change it to: volunteers-v... [15:04:34] (03PS2) 10Abijeet Patro: Remove SpecialContributeSkinsEnabled for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202144 (https://phabricator.wikimedia.org/T400067) [15:04:39] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11345064 (10elukey) 05Resolved→03Open Let's keep it open to discuss the aforementioned metric issues :) [15:04:51] (03CR) 10Abijeet Patro: "Oops, I forgot to publish. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202144 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [15:05:06] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to ops-limited for blake - https://phabricator.wikimedia.org/T409166#11345069 (10Kappakayala) Approved! [15:05:12] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-10-28-150053 to 2025-11-05-063501 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202148 (https://phabricator.wikimedia.org/T406625) (owner: 10Jforrester) [15:05:18] (03CR) 10CI reject: [V:04-1] Remove SpecialContributeSkinsEnabled for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202144 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [15:06:35] (03CR) 10Lucas Werkmeister (WMDE): [C:04-2] Enable the MEX / wbui2025 beta feature on testwikidata (v2) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202164 (https://phabricator.wikimedia.org/T407737) (owner: 10Lucas Werkmeister (WMDE)) [15:06:58] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:07:39] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:07:52] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:08:21] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:37] (03CR) 10Herron: [C:03+1] prometheus: split targets into directories by source [puppet] - 10https://gerrit.wikimedia.org/r/1201773 (https://phabricator.wikimedia.org/T305223) (owner: 10Cwhite) [15:08:56] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:09:17] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:09:18] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2086.codfw.wmnet with reason: host reimage [15:09:51] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:10:04] (03PS1) 10CDanis: turnilo: add x-i-b [puppet] - 10https://gerrit.wikimedia.org/r/1202170 [15:10:04] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:10:10] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:10:12] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:10:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:10:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:10:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps2006.codfw.wmnet [15:10:27] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-10-28-205854 to 2025-11-04-215809 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202149 (https://phabricator.wikimedia.org/T406625) (owner: 10Jforrester) [15:10:30] 06SRE, 10decommission-hardware: decommission maps2005/maps2006/maps2007/maps2008/maps2009/map2010 - https://phabricator.wikimedia.org/T409291#11345107 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps2006.codfw.wmnet` - maps2006.codfw.wmnet (**PASS**) - Downti... [15:11:06] (03CR) 10Vgutierrez: [C:03+1] turnilo: add x-i-b [puppet] - 10https://gerrit.wikimedia.org/r/1202170 (owner: 10CDanis) [15:11:32] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps2007.codfw.wmnet [15:11:41] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11345109 (10herron) [15:11:44] (03CR) 10CDanis: [C:03+2] turnilo: add x-i-b [puppet] - 10https://gerrit.wikimedia.org/r/1202170 (owner: 10CDanis) [15:12:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1209 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84918 and previous config saved to /var/cache/conftool/dbconfig/20251105-151210-root.json [15:12:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1202165 (owner: 10Elukey) [15:12:50] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-10-28-205854 to 2025-11-04-215809 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202149 (https://phabricator.wikimedia.org/T406625) (owner: 10Jforrester) [15:13:15] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:13:38] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:13:54] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:13:59] (03PS1) 10JHathaway: vrts: update admin email [puppet] - 10https://gerrit.wikimedia.org/r/1202171 (https://phabricator.wikimedia.org/T408967) [15:14:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2086.codfw.wmnet with reason: host reimage [15:14:22] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:14:29] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:15:01] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:15:18] (03CR) 10Elukey: [V:03+2 C:03+2] Drop python3-buster from images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1202165 (owner: 10Elukey) [15:15:30] 06SRE, 06collaboration-services, 10Znuny, 13Patch-For-Review: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967#11345124 (10jhathaway) >>! In T408967#11345061, @Krd wrote: > I think I have changed the setting and deployed it, but it still shows the old value, and now cannot be ed... [15:15:35] (03CR) 10Jforrester: [C:03+2] Enable embedded Wikifunctions calls on bnwiki and seven Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202143 (https://phabricator.wikimedia.org/T406342) (owner: 10Jforrester) [15:16:29] (03Merged) 10jenkins-bot: Enable embedded Wikifunctions calls on bnwiki and seven Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202143 (https://phabricator.wikimedia.org/T406342) (owner: 10Jforrester) [15:16:39] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:17:16] 10SRE-SLO: Sloth: onboard subset of existing SLOs to pilot - https://phabricator.wikimedia.org/T409310 (10herron) 03NEW [15:17:33] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1202143|Enable embedded Wikifunctions calls on bnwiki and seven Wiktionaries (T406342)]] [15:17:35] T406342: [26Q2] Wikifunctions Rollout - https://phabricator.wikimedia.org/T406342 [15:17:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T407997)', diff saved to https://phabricator.wikimedia.org/P84919 and previous config saved to /var/cache/conftool/dbconfig/20251105-151738-marostegui.json [15:17:42] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:17:44] 10SRE-SLO: Sloth: onboard subset of existing SLOs to pilot - https://phabricator.wikimedia.org/T409310#11345161 (10herron) ` # tonecheck.yml version: "prometheus/v1" service: "tonecheck" labels: owner: "sre" slos: - name: "tone-check-availability" objective: 95.0 description: "Tone check pre save che... [15:17:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2156.codfw.wmnet with reason: Maintenance [15:18:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T407997)', diff saved to https://phabricator.wikimedia.org/P84920 and previous config saved to /var/cache/conftool/dbconfig/20251105-151802-marostegui.json [15:18:06] 10SRE-SLO: Sloth: onboard subset of existing SLOs to pilot - https://phabricator.wikimedia.org/T409310#11345164 (10herron) ` # editcheck.yml version: "prometheus/v1" service: "edit-check" labels: owner: "sre" slos: - name: "edit-check-pre-save-checks-ratio" objective: 99.0 description: "Edit check pr... [15:19:28] (03PS2) 10CDanis: tcpproxy: haproxy: listen on v4+v6 for both ports [puppet] - 10https://gerrit.wikimedia.org/r/1202163 (https://phabricator.wikimedia.org/T408532) [15:19:28] (03PS1) 10CDanis: tcpproxy: haproxy: log level change to info [puppet] - 10https://gerrit.wikimedia.org/r/1202172 (https://phabricator.wikimedia.org/T408532) [15:19:57] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1202143|Enable embedded Wikifunctions calls on bnwiki and seven Wiktionaries (T406342)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:20:21] (03CR) 10Alexandros Kosiaris: [C:03+1] wikikube: Add wikikube-ctrl200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1195350 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [15:20:52] !log jforrester@deploy2002 jforrester: Continuing with sync [15:21:14] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps2007.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:22:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps2007.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:22:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:22:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps2007.codfw.wmnet [15:22:51] 06SRE, 10decommission-hardware: decommission maps2005/maps2006/maps2007/maps2008/maps2009/map2010 - https://phabricator.wikimedia.org/T409291#11345192 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps2007.codfw.wmnet` - maps2007.codfw.wmnet (**PASS**) - Downti... [15:24:07] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps2008.codfw.wmnet [15:24:43] 10SRE-SLO: Sloth: adapt default month view to quarter view - https://phabricator.wikimedia.org/T409312 (10herron) 03NEW [15:24:59] (03PS7) 10Btullis: Change the component from where we install elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) [15:25:04] (03CR) 10Elukey: [C:03+1] apt-staging: Relax the cleanup for the incoming queue [puppet] - 10https://gerrit.wikimedia.org/r/1202113 (https://phabricator.wikimedia.org/T409253) (owner: 10Muehlenhoff) [15:25:41] (03CR) 10Vgutierrez: "maybe this isn't relevant here but you will see IPv4‑mapped IPv6 addresses for IPv4 clients like `::ffff:127.0.0.1`" [puppet] - 10https://gerrit.wikimedia.org/r/1202163 (https://phabricator.wikimedia.org/T408532) (owner: 10CDanis) [15:25:51] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:26:01] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:26:27] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:26:29] (03PS4) 10Clément Goubert: rest-gateway: Create metrics mapping for ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199008 (https://phabricator.wikimedia.org/T408183) (owner: 10Daniel Kinzler) [15:26:29] (03PS2) 10Clément Goubert: api-gateway: improve metrics mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201599 (https://phabricator.wikimedia.org/T409173) (owner: 10Daniel Kinzler) [15:26:31] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:27:07] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202143|Enable embedded Wikifunctions calls on bnwiki and seven Wiktionaries (T406342)]] (duration: 09m 35s) [15:27:10] T406342: [26Q2] Wikifunctions Rollout - https://phabricator.wikimedia.org/T406342 [15:27:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1209 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84921 and previous config saved to /var/cache/conftool/dbconfig/20251105-152716-root.json [15:27:28] (03PS1) 10Filippo Giunchedi: pontoon: improve UX during create-hosts errors [puppet] - 10https://gerrit.wikimedia.org/r/1202174 [15:27:46] (03CR) 10Btullis: [C:03+2] Change the component from where we install elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [15:27:47] jmm@cumin2002 decommission (PID 3922775) is awaiting input [15:27:57] Deploy done. [15:29:16] (03PS1) 10Muehlenhoff: Remove old maps nodes from site.pp and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1202176 (https://phabricator.wikimedia.org/T381565) [15:29:52] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:30:02] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:30:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251105T1500) [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251105T1530) [15:30:39] (03CR) 10AOkoth: [C:03+1] vrts: update admin email [puppet] - 10https://gerrit.wikimedia.org/r/1202171 (https://phabricator.wikimedia.org/T408967) (owner: 10JHathaway) [15:30:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2086.codfw.wmnet with OS bullseye [15:30:56] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11345231 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2086.codfw.wmnet with OS bullseye complete... [15:31:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2087.codfw.wmnet with OS bullseye [15:31:51] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11345236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2087.codfw.wmnet with OS bullseye [15:31:54] 10SRE-SLO: Sloth: adapt default month view to quarter view - https://phabricator.wikimedia.org/T409312#11345239 (10herron) Off hand the sloth detail dashboards "month error budget burn chart" panel uses Grafana built-ins in the "relative time" and "time shift" to fix the panel on the current month. {F69907906... [15:32:31] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Create metrics mapping for ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199008 (https://phabricator.wikimedia.org/T408183) (owner: 10Daniel Kinzler) [15:32:46] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:32:48] (03CR) 10Clément Goubert: [C:03+2] "Merging as is, we'll rename the mapping when we rename the variable." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199008 (https://phabricator.wikimedia.org/T408183) (owner: 10Daniel Kinzler) [15:33:21] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:45] (03Merged) 10jenkins-bot: rest-gateway: Create metrics mapping for ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199008 (https://phabricator.wikimedia.org/T408183) (owner: 10Daniel Kinzler) [15:35:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T407997)', diff saved to https://phabricator.wikimedia.org/P84922 and previous config saved to /var/cache/conftool/dbconfig/20251105-153508-marostegui.json [15:35:10] (03PS1) 10CDanis: turnilo: x-i-b: cast to NUMBER with Plywood [puppet] - 10https://gerrit.wikimedia.org/r/1202178 [15:35:12] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:36:27] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps2008.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:38:16] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:38:28] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:38:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: maps2008.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:38:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:38:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts maps2008.codfw.wmnet [15:38:45] 06SRE, 10decommission-hardware: decommission maps2005/maps2006/maps2007/maps2008/maps2009/map2010 - https://phabricator.wikimedia.org/T409291#11345253 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps2008.codfw.wmnet` - maps2008.codfw.wmnet (**PASS**) - Downti... [15:39:24] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [15:41:24] Lucas_WMDE: hah, I guess it's time to update that comment :) [15:42:04] if we should be pinging someone else, feel free to let me know ;) [15:42:13] (we’re currently trying to figure out who’ll reach out to whom on what platform :D) [15:42:56] or do you mean nobody’s approval is necessary anymore? ^^ [15:43:16] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [15:43:29] Lucas_WMDE: T352825 says James needs to approve [15:43:29] T352825: Transfer product ownership of production Beta Features to an actual product person - https://phabricator.wikimedia.org/T352825 [15:43:38] 10SRE-SLO: Sloth: onboard subset of existing SLOs to pilot - https://phabricator.wikimedia.org/T409310#11345290 (10herron) ` # xlab.yml version: "prometheus/v1" service: "xlab" labels: owner: "sre" slos: - name: "xlab-standalone-event-validation-success-rate" objective: 95 description: "xlab standalo... [15:43:42] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [15:43:42] Lucas_WMDE: Yes, you can still ping me until someone from Product takes over. [15:43:55] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [15:44:02] ok, thanks [15:44:19] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2087.codfw.wmnet with reason: host reimage [15:44:25] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [15:44:47] jmm@cumin2002 decommission (PID 3927614) is awaiting input [15:44:52] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [15:45:06] cmooney@cumin1003 netbox (PID 3949948) is awaiting input [15:45:13] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [15:45:18] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [15:45:22] !log running racadm racreset on maps2009 [15:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:32] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:45:39] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [15:45:39] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:45:45] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add ipv6 reverse dns for nl-ix port marseille - cmooney@cumin1003" [15:45:46] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [15:45:50] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add ipv6 reverse dns for nl-ix port marseille - cmooney@cumin1003" [15:45:50] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:45:53] (03PS1) 10Lucas Werkmeister (WMDE): Update BetaFeatures comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202180 [15:46:08] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [15:46:25] (03CR) 10Lucas Werkmeister (WMDE): "Does this look okay?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202180 (owner: 10Lucas Werkmeister (WMDE)) [15:47:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2087.codfw.wmnet with reason: host reimage [15:47:29] !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pfw1-codfw with reason: pfw1a/b-codfw [15:48:32] (03PS4) 10Pmiazga: api-geteway: rename symbols used in restgw ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) [15:48:32] (03CR) 10Pmiazga: api-geteway: rename symbols used in restgw ratelimiter (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) (owner: 10Pmiazga) [15:50:04] (03CR) 10Hnowlan: [C:03+1] api-gateway: improve metrics mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201599 (https://phabricator.wikimedia.org/T409173) (owner: 10Daniel Kinzler) [15:50:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P84924 and previous config saved to /var/cache/conftool/dbconfig/20251105-155015-marostegui.json [15:50:51] (03PS1) 10Btullis: Update the wikimedia-opensearch apt repository signed-by for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1202183 (https://phabricator.wikimedia.org/T407199) [15:51:34] (03PS3) 10Clément Goubert: api-gateway: improve metrics mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201599 (https://phabricator.wikimedia.org/T409173) (owner: 10Daniel Kinzler) [15:52:34] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1202183 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [15:53:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1202176 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:53:47] (03CR) 10Btullis: [C:03+2] Update the wikimedia-opensearch apt repository signed-by for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1202183 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [15:54:37] (03CR) 10Btullis: [V:03+1 C:03+2] "PCC SUCCESS (CORE_DIFF 4 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1202183 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [15:55:30] (03CR) 10Vgutierrez: [C:03+1] turnilo: x-i-b: cast to NUMBER with Plywood [puppet] - 10https://gerrit.wikimedia.org/r/1202178 (owner: 10CDanis) [15:57:30] (03CR) 10Clément Goubert: "Can you rebase on master, bump `Chart.yaml` version, and rename the metric mapping in `charts/api-gateway/config/ratelimiter_metrics.yaml`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) (owner: 10Pmiazga) [15:58:19] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:58:20] (03CR) 10Clément Goubert: [C:03+2] api-gateway: improve metrics mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201599 (https://phabricator.wikimedia.org/T409173) (owner: 10Daniel Kinzler) [15:58:28] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:58:50] 10SRE-SLO: Sloth: onboard subset of existing SLOs to pilot - https://phabricator.wikimedia.org/T409310#11345346 (10herron) [15:58:59] (03PS1) 10Brouberol: airflow: define a network policy specific to task pods requiring egress to a proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202186 (https://phabricator.wikimedia.org/T408238) [15:59:00] (03PS1) 10Brouberol: airflow-platform-eng: enabled jobs properly labeled to egress to the urldownloader proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202187 (https://phabricator.wikimedia.org/T408238) [15:59:41] (03CR) 10Clément Goubert: [C:03+2] README: pre-commit hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200356 (owner: 10Clément Goubert) [16:00:14] (03Merged) 10jenkins-bot: api-gateway: improve metrics mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201599 (https://phabricator.wikimedia.org/T409173) (owner: 10Daniel Kinzler) [16:00:23] (03PS1) 10Marostegui: migration1011.sh: Switch to use the depool/pool cookbook [software] - 10https://gerrit.wikimedia.org/r/1202190 [16:00:30] (03CR) 10CI reject: [V:04-1] airflow: define a network policy specific to task pods requiring egress to a proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202186 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [16:00:36] !log ongoing pfw1b-codfw Junos downgrade [16:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:44] (03CR) 10CI reject: [V:04-1] airflow-platform-eng: enabled jobs properly labeled to egress to the urldownloader proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202187 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [16:01:31] (03Merged) 10jenkins-bot: README: pre-commit hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200356 (owner: 10Clément Goubert) [16:02:02] (03PS2) 10Marostegui: migration1011.sh: Switch to use the depool/pool cookbook [software] - 10https://gerrit.wikimedia.org/r/1202190 [16:02:23] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:02:30] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:04:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2087.codfw.wmnet with OS bullseye [16:04:32] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11345374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2087.codfw.wmnet with OS bullseye complete... [16:04:36] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [16:04:38] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [16:05:14] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [16:05:19] (03CR) 10Federico Ceratto: [C:03+1] "As discussed on IRC" [software] - 10https://gerrit.wikimedia.org/r/1202190 (owner: 10Marostegui) [16:05:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P84926 and previous config saved to /var/cache/conftool/dbconfig/20251105-160523-marostegui.json [16:05:38] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [16:06:01] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [16:06:15] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [16:06:41] (03CR) 10Marostegui: [C:03+2] migration1011.sh: Switch to use the depool/pool cookbook [software] - 10https://gerrit.wikimedia.org/r/1202190 (owner: 10Marostegui) [16:07:07] (03Merged) 10jenkins-bot: migration1011.sh: Switch to use the depool/pool cookbook [software] - 10https://gerrit.wikimedia.org/r/1202190 (owner: 10Marostegui) [16:07:39] FIRING: [5x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:07:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:0 (Core: pfw1-codfw:xe-7/2/0 {#11923_12249-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:07:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - fasw1-f5b-codfw:et-0/0/47 (Core: pfw1-codfw:et-7/1/0 {#122515}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:07:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - fasw1-f5b-codfw:et-0/0/47 (Core: pfw1-codfw:et-7/1/0 {#122515}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:08:05] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:08:07] huh [16:08:13] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:08:21] topranks: expected? [16:09:26] (03CR) 10JHathaway: [C:03+2] vrts: update admin email [puppet] - 10https://gerrit.wikimedia.org/r/1202171 (https://phabricator.wikimedia.org/T408967) (owner: 10JHathaway) [16:09:43] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#11345384 (10Andrew) >>! In T376400#10834875, @taavi wrote: >>>! In T376400#10834856, @Andrew wrote: >> Can you point me to some specific examples? M... [16:12:13] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:12:23] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:12:58] (03PS3) 10Pmiazga: api-gateway: Make x-ratelimit response header configurable. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) [16:13:04] (03PS1) 10MVernon: re-add hosts to ring, drain 3 more [puppet] - 10https://gerrit.wikimedia.org/r/1202192 (https://phabricator.wikimedia.org/T400876) [16:13:46] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11345408 (10MatthewVernon) [16:14:46] (03PS1) 10Dpogorzelski: knative-serving: add podspec features Why: To allow pods to be scheduled on specific nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202194 (https://phabricator.wikimedia.org/T403599) [16:14:47] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11345413 (10MatthewVernon) 05Open→03Resolved @Jhancock.wm Thanks! We're all done here now :) [16:14:50] !log javiermonton@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [16:15:08] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11345417 (10MatthewVernon) [16:15:23] !log javiermonton@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:15:29] 10SRE-SLO: Sloth: onboard subset of existing SLOs to pilot - https://phabricator.wikimedia.org/T409310#11345425 (10herron) [16:15:40] 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new JBOD disk controllers into SM swift backends - https://phabricator.wikimedia.org/T400878#11345429 (10MatthewVernon) All completed now. [16:20:02] (03PS2) 10Brouberol: airflow: define a network policy specific to task pods requiring egress to a proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202186 (https://phabricator.wikimedia.org/T408238) [16:20:02] (03PS2) 10Brouberol: airflow-platform-eng: enabled jobs properly labeled to egress to the urldownloader proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202187 (https://phabricator.wikimedia.org/T408238) [16:20:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T407997)', diff saved to https://phabricator.wikimedia.org/P84927 and previous config saved to /var/cache/conftool/dbconfig/20251105-162032-marostegui.json [16:20:36] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:20:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2177.codfw.wmnet with reason: Maintenance [16:20:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T407997)', diff saved to https://phabricator.wikimedia.org/P84928 and previous config saved to /var/cache/conftool/dbconfig/20251105-162055-marostegui.json [16:21:22] (03PS3) 10Brouberol: airflow: define a network policy specific to task pods requiring egress to a proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202186 (https://phabricator.wikimedia.org/T408238) [16:21:22] (03PS3) 10Brouberol: airflow-platform-eng: enabled jobs properly labeled to egress to the urldownloader proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202187 (https://phabricator.wikimedia.org/T408238) [16:21:26] (03CR) 10Ottomata: [C:03+1] "One comment suggestion, but LGTM. Very nice!" [alerts] - 10https://gerrit.wikimedia.org/r/1199859 (https://phabricator.wikimedia.org/T405952) (owner: 10TChin) [16:21:30] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:21:37] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:21:46] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135#11345480 (10Xaosflux) My screenshot in the description shows it. Also, it was intermittently jumping all around, including down to zero; perhap... [16:22:25] (03CR) 10CDobbins: [C:03+2] dnsrecursor config: fix a few broken settings in the yaml config [puppet] - 10https://gerrit.wikimedia.org/r/1201821 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [16:22:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-codfw and pfw1a-codfw (208.80.153.201) - group fundraising - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:22:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/0/1:3 (pfw1-codfw:xe-0/2/0 {#11922_12273-4}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:22:51] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - fasw1-f5a-codfw:et-0/0/47 (Core: pfw1-codfw:et-0/1/0 {#122505}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:22:51] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - fasw1-f5a-codfw:et-0/0/47 (Core: pfw1-codfw:et-0/1/0 {#122505}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:24:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Connect spare Nokia D2L switches to Spines in eqiad - https://phabricator.wikimedia.org/T409288#11345485 (10cmooney) Awesome @Jclark-ctr thank you! We can probably close this task for now I think, I can set up a new one for the actual... [16:24:15] !log add peering to NL-ix route servers from drmrs T386986 [16:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Connect spare Nokia D2L switches to Spines in eqiad - https://phabricator.wikimedia.org/T409288#11345488 (10Jclark-ctr) 05Open→03Resolved [16:25:04] (03PS1) 10Ahmon Dancy: buildkitd: Bump buildkit image to wmf-v0.25.2 [puppet] - 10https://gerrit.wikimedia.org/r/1202195 (https://phabricator.wikimedia.org/T409313) [16:25:31] (03CR) 10CDobbins: [C:03+1] dnsrecursor config: fix a few broken settings in the yaml config [puppet] - 10https://gerrit.wikimedia.org/r/1201821 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [16:26:14] (03CR) 10Jcrespo: [C:03+1] re-add hosts to ring, drain 3 more [puppet] - 10https://gerrit.wikimedia.org/r/1202192 (https://phabricator.wikimedia.org/T400876) (owner: 10MVernon) [16:26:30] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:26:38] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:27:11] !log javiermonton@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [16:27:21] !log javiermonton@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:27:34] !log javiermonton@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [16:27:41] !log javiermonton@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:28:45] topranks: should we be worried about the BGP and switch alerts above? [16:29:10] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [16:29:23] claime: I don't think so, I expect this is because of papaul's upgrade of the payment firewalls in codfw [16:29:31] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [16:29:36] topranks: ack, was just wondering :) [16:29:46] papaul_: are the above alerts yours? [16:30:00] claime: no probs, we should usually very much worry, these hosts should have been downtimed [16:30:07] but no stress that happens - thanks for being alert! [16:30:38] (03CR) 10Jforrester: [C:03+1] "Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202180 (owner: 10Lucas Werkmeister (WMDE)) [16:30:47] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [16:30:57] topranks: yes [16:31:09] papaul_: ack, thanks for confirming [16:31:10] (03CR) 10MVernon: [C:03+2] re-add hosts to ring, drain 3 more [puppet] - 10https://gerrit.wikimedia.org/r/1202192 (https://phabricator.wikimedia.org/T400876) (owner: 10MVernon) [16:31:12] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [16:31:20] claime: thank you [16:32:18] (03PS1) 10Federico Ceratto: db2161: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202197 (https://phabricator.wikimedia.org/T406008) [16:32:25] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11345541 (10MatthewVernon) [16:32:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-codfw and pfw1a-codfw (208.80.153.201) - group fundraising - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:32:48] (03CR) 10CI reject: [V:04-1] db2161: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202197 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [16:32:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/0/1:3 (pfw1-codfw:xe-0/2/0 {#11922_12273-4}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:32:51] RESOLVED: [4x] SwitchCoreInterfaceDown: Switch core interface down - fasw1-f5a-codfw:et-0/0/47 (Core: pfw1-codfw:et-0/1/0 {#122505}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:32:51] RESOLVED: [4x] SwitchCoreInterfaceDown: Switch core interface down - fasw1-f5a-codfw:et-0/0/47 (Core: pfw1-codfw:et-0/1/0 {#122505}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:34:13] (03PS2) 10Federico Ceratto: db2161: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202197 (https://phabricator.wikimedia.org/T406008) [16:35:01] (03CR) 10Marostegui: [C:03+1] db2161: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202197 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [16:35:37] jouncebot: nowandnext [16:35:37] No deployments scheduled for the next 1 hour(s) and 24 minute(s) [16:35:37] In 1 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251105T1800) [16:35:42] !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade [16:35:49] might roll out https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1202180 if nobody objects [16:35:53] (comment update, should be a no-op) [16:35:56] (03CR) 10BCornwall: [C:03+1] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1201901 (https://phabricator.wikimedia.org/T409255) (owner: 10Gerrit maintenance bot) [16:36:04] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool db2161 - Upgrading db2161.codfw.wmnet [16:36:34] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2161 - Upgrading db2161.codfw.wmnet [16:36:36] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - fasw1-f5a-codfw:et-0/0/47 (Core: pfw1-codfw:et-0/1/0 {#122505}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:37:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-codfw and pfw1a-codfw (208.80.153.201) - group fundraising - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:37:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/0/1:3 (pfw1-codfw:xe-0/2/0 {#11922_12273-4}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:37:51] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - fasw1-f5a-codfw:et-0/0/47 (Core: pfw1-codfw:et-0/1/0 {#122505}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:38:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T407997)', diff saved to https://phabricator.wikimedia.org/P84930 and previous config saved to /var/cache/conftool/dbconfig/20251105-163801-marostegui.json [16:38:05] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:38:19] (03PS1) 10Clément Goubert: api-gateway: Fix regex for api-gateway metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202200 (https://phabricator.wikimedia.org/T409173) [16:39:34] fceratto@cumin1003 major-upgrade (PID 4008283) is awaiting input [16:40:20] (03PS3) 10Kamila Součková: mw-web: Remove the hard-coded k8s version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) [16:42:04] !log pfw1a/b-codfw Junos downgrade complete [16:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:43] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Fix regex for api-gateway metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202200 (https://phabricator.wikimedia.org/T409173) (owner: 10Clément Goubert) [16:43:36] RESOLVED: [4x] SwitchCoreInterfaceDown: Switch core interface down - fasw1-f5a-codfw:et-0/0/47 (Core: pfw1-codfw:et-0/1/0 {#122505}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:44:13] (03PS1) 10Scott French: mw-parsoid: migrate to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1202184 (https://phabricator.wikimedia.org/T405955) [16:44:14] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:44:16] (03PS1) 10Scott French: mw-misc: migrate to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1202185 (https://phabricator.wikimedia.org/T405955) [16:44:18] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11345587 (10herron) [16:44:21] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:44:36] (03Merged) 10jenkins-bot: api-gateway: Fix regex for api-gateway metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202200 (https://phabricator.wikimedia.org/T409173) (owner: 10Clément Goubert) [16:45:36] fceratto@cumin1003 major-upgrade (PID 4008283) is awaiting input [16:46:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202180 (owner: 10Lucas Werkmeister (WMDE)) [16:46:27] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:46:31] * Lucas_WMDE deploys the comment update [16:46:36] RESOLVED: [4x] SwitchCoreInterfaceDown: Switch core interface down - fasw1-f5a-codfw:et-0/0/47 (Core: pfw1-codfw:et-0/1/0 {#122505}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:46:39] (03CR) 10Elukey: [C:03+1] knative-serving: add podspec features Why: To allow pods to be scheduled on specific nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202194 (https://phabricator.wikimedia.org/T403599) (owner: 10Dpogorzelski) [16:47:02] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [16:47:23] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:47:23] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [16:47:23] (03CR) 10Elukey: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1202176 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:47:31] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [16:47:35] (03Merged) 10jenkins-bot: Update BetaFeatures comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202180 (owner: 10Lucas Werkmeister (WMDE)) [16:47:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-codfw and pfw1a-codfw (208.80.153.201) - group fundraising - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:47:49] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [16:47:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/0/1:3 (pfw1-codfw:xe-0/2/0 {#11922_12273-4}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:48:02] 10SRE-SLO: Sloth: adapt default month view to quarter view - https://phabricator.wikimedia.org/T409312#11345598 (10elukey) The horror query for the 3 months quarterly window would be: ` [16:48:07] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1202180|Update BetaFeatures comments]] [16:50:22] (03CR) 10Federico Ceratto: [C:03+2] db2161: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1202197 (https://phabricator.wikimedia.org/T406008) (owner: 10Federico Ceratto) [16:50:27] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1202180|Update BetaFeatures comments]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:50:56] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11345613 (10hnowlan) 05In progress→03Stalled [16:51:15] (03PS21) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [16:51:28] (03PS1) 10Andrew Bogott: Update config to reflect running state as of now [wikitech-static] - 10https://gerrit.wikimedia.org/r/1202202 [16:51:30] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [16:53:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P84931 and previous config saved to /var/cache/conftool/dbconfig/20251105-165309-marostegui.json [16:54:15] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [16:55:44] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202180|Update BetaFeatures comments]] (duration: 07m 38s) [16:56:04] (03PS1) 10Urbanecm: mediawiki::maintenance::growthexperiments: Correct the logs comment [puppet] - 10https://gerrit.wikimedia.org/r/1202203 [16:56:37] !log dancy@deploy2002 Installing scap version "4.223.0" for 2 host(s) [16:58:23] !log dancy@deploy2002 Installation of scap version "4.223.0" completed for 2 hosts [16:58:31] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [16:58:50] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [16:59:03] (03CR) 10Elukey: [C:03+1] knative-serving: add podspec features Why: To allow pods to be scheduled on specific nodes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202194 (https://phabricator.wikimedia.org/T403599) (owner: 10Dpogorzelski) [17:00:02] (03PS20) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [17:00:02] (03PS22) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [17:00:02] (03PS21) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [17:00:31] !log disable-puppet on A:cp hosts for haproxy config change [17:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:42] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [17:00:44] (03CR) 10CDanis: [C:03+2] turnilo: x-i-b: cast to NUMBER with Plywood [puppet] - 10https://gerrit.wikimedia.org/r/1202178 (owner: 10CDanis) [17:00:47] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:00:59] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:01:29] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1201844 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [17:01:33] (03CR) 10Scott French: [C:03+2] P:cache::haproxy: ensure x-requestctl is updated [puppet] - 10https://gerrit.wikimedia.org/r/1201844 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [17:02:27] (03PS1) 10Andrew Bogott: apache conf: add a ratelimit to most file downloads [wikitech-static] - 10https://gerrit.wikimedia.org/r/1202204 [17:03:06] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Update config to reflect running state as of now [wikitech-static] - 10https://gerrit.wikimedia.org/r/1202202 (owner: 10Andrew Bogott) [17:03:12] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] apache conf: add a ratelimit to most file downloads [wikitech-static] - 10https://gerrit.wikimedia.org/r/1202204 (owner: 10Andrew Bogott) [17:08:02] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:08:07] (03PS1) 10Cathal Mooney: gnmic config: merge juniper and nokia bgp collection again [puppet] - 10https://gerrit.wikimedia.org/r/1202205 (https://phabricator.wikimedia.org/T393996) [17:08:13] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:08:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P84932 and previous config saved to /var/cache/conftool/dbconfig/20251105-170816-marostegui.json [17:08:22] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [17:08:30] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [17:09:05] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:10:27] (03CR) 10Cathal Mooney: [C:03+2] gnmic config: merge juniper and nokia bgp collection again [puppet] - 10https://gerrit.wikimedia.org/r/1202205 (https://phabricator.wikimedia.org/T393996) (owner: 10Cathal Mooney) [17:10:29] !log rolling run-puppet-agent on A:cp hosts for haproxy config change [17:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:31] (03CR) 10Andrew Bogott: [C:03+2] dnsrecursor config: fix a few broken settings in the yaml config [puppet] - 10https://gerrit.wikimedia.org/r/1201821 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [17:12:52] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db2161 gradually with 4 steps - Migration of db2161.codfw.wmnet completed [17:14:37] (03PS1) 10Scott French: hieradata: use_etcd_known_client_ident pilot on cp2041 (#2) [puppet] - 10https://gerrit.wikimedia.org/r/1202208 (https://phabricator.wikimedia.org/T403220) [17:15:09] (03CR) 10Fabfur: [C:03+1] hieradata: use_etcd_known_client_ident pilot on cp2041 (#2) [puppet] - 10https://gerrit.wikimedia.org/r/1202208 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [17:16:29] (03CR) 10CDobbins: "Should be fixed now. test-cookbook still fails if the host is pooled, but passes if run it without the pool/depool check." [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [17:16:49] (03CR) 10Dzahn: [C:03+2] apt-staging: Relax the cleanup for the incoming queue [puppet] - 10https://gerrit.wikimedia.org/r/1202113 (https://phabricator.wikimedia.org/T409253) (owner: 10Muehlenhoff) [17:19:03] (03CR) 10Dzahn: [C:03+2] tcpproxy: haproxy: log level change to info [puppet] - 10https://gerrit.wikimedia.org/r/1202172 (https://phabricator.wikimedia.org/T408532) (owner: 10CDanis) [17:20:42] (03CR) 10Dzahn: [C:03+2] tcpproxy: haproxy: listen on v4+v6 for both ports [puppet] - 10https://gerrit.wikimedia.org/r/1202163 (https://phabricator.wikimedia.org/T408532) (owner: 10CDanis) [17:23:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T407997)', diff saved to https://phabricator.wikimedia.org/P84934 and previous config saved to /var/cache/conftool/dbconfig/20251105-172324-marostegui.json [17:23:28] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:23:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2190.codfw.wmnet with reason: Maintenance [17:23:43] (03CR) 10Dzahn: [C:03+2] buildkitd: Bump buildkit image to wmf-v0.25.2 [puppet] - 10https://gerrit.wikimedia.org/r/1202195 (https://phabricator.wikimedia.org/T409313) (owner: 10Ahmon Dancy) [17:23:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T407997)', diff saved to https://phabricator.wikimedia.org/P84935 and previous config saved to /var/cache/conftool/dbconfig/20251105-172347-marostegui.json [17:24:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359#11345778 (10BTullis) Hi @Jclark-ctr - Apologies for the delay. I have checked which drive it is and unmounted the volume. ` btullis@an-... [17:24:54] (03PS2) 10CDanis: Add discovery-conftool-state to ignored stale texts [alerts] - 10https://gerrit.wikimedia.org/r/1201852 [17:25:24] (03PS3) 10CDanis: Add discovery-conftool-state to ignored stale texts [alerts] - 10https://gerrit.wikimedia.org/r/1201852 [17:25:27] (03CR) 10CDanis: Add discovery-conftool-state to ignored stale texts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1201852 (owner: 10CDanis) [17:32:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359#11345807 (10Jclark-ctr) @BTullis thanks! failed drive has been replaced! [17:42:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T407997)', diff saved to https://phabricator.wikimedia.org/P84937 and previous config saved to /var/cache/conftool/dbconfig/20251105-174218-marostegui.json [17:42:22] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:42:40] FIRING: [5x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:34] (03CR) 10Dzahn: "my inline comment is not intended to be a blocker. feel free to merge regardless of the answer." [alerts] - 10https://gerrit.wikimedia.org/r/1201087 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [17:47:51] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:48:53] 06SRE, 10LDAP-Access-Requests: Grant Access to WMDE LDAP groups for vicaplet-wmde - https://phabricator.wikimedia.org/T408920#11345855 (10Dzahn) 05In progress→03Resolved a:03Dzahn [17:49:07] (03CR) 10AOkoth: "Hey Daniel. Np. I'll try and write something up." [alerts] - 10https://gerrit.wikimedia.org/r/1201087 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [17:49:17] 06SRE, 10LDAP-Access-Requests: Grant Access to WMDE LDAP groups for vicaplet-wmde - https://phabricator.wikimedia.org/T408920#11345860 (10Dzahn) a:05Dzahn→03None [17:51:51] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:51:51] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11345874 (10Dzahn) a:05Volker_E→03None [17:51:55] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11345875 (10Dzahn) 05Stalled→03Open [17:52:22] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11345899 (10Dzahn) [17:55:58] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135#11345904 (10Krd) Cannot reproduce that, looks good to me currently. [17:57:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P84939 and previous config saved to /var/cache/conftool/dbconfig/20251105-175726-marostegui.json [17:58:15] (03PS1) 10Ssingh: ncredir: Update donate.wikipedia25.{org,com} redir [puppet] - 10https://gerrit.wikimedia.org/r/1202216 (https://phabricator.wikimedia.org/T408168) [17:58:21] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2161 gradually with 4 steps - Migration of db2161.codfw.wmnet completed [17:58:22] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [18:00:05] swfrench-wmf: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki infrastructure (UTC late) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251105T1800). [18:00:10] o/ [18:00:22] I'll get started in a few minutes [18:02:30] (03CR) 10Dzahn: [C:03+1] ncredir: Update donate.wikipedia25.{org,com} redir [puppet] - 10https://gerrit.wikimedia.org/r/1202216 (https://phabricator.wikimedia.org/T408168) (owner: 10Ssingh) [18:02:44] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): serve 10% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201697 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:04:36] (03Merged) 10jenkins-bot: mw-(api-ext|web): serve 10% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201697 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:07:11] (03CR) 10CDobbins: [C:03+1] ncredir: Update donate.wikipedia25.{org,com} redir [puppet] - 10https://gerrit.wikimedia.org/r/1202216 (https://phabricator.wikimedia.org/T408168) (owner: 10Ssingh) [18:07:27] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:07:44] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:07:51] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:08:09] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:09:05] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:12:30] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Transport link saturation not alerting - https://phabricator.wikimedia.org/T409330 (10ssingh) 03NEW [18:12:32] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Transport link saturation not alerting - https://phabricator.wikimedia.org/T409330#11345967 (10ssingh) p:05Triage→03High [18:12:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P84941 and previous config saved to /var/cache/conftool/dbconfig/20251105-181233-marostegui.json [18:12:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:14:46] (03CR) 10Bking: [C:03+1] Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [18:14:53] (03CR) 10Ssingh: [C:03+2] ncredir: Update donate.wikipedia25.{org,com} redir [puppet] - 10https://gerrit.wikimedia.org/r/1202216 (https://phabricator.wikimedia.org/T408168) (owner: 10Ssingh) [18:16:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:17:06] (03CR) 10RLazarus: [C:03+1] mw-parsoid: migrate to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1202184 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:17:11] (03CR) 10RLazarus: [C:03+1] mw-misc: migrate to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1202185 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:17:14] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11345977 (10thcipriani) >>! In T408924#11339110, @ItamarWMDE wrote: > The scripts being discussed are not the PHP maintenance scripts, but the bash scripts currently invoked by airflow: > >... [18:20:49] (03PS21) 10Btullis: Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) [18:20:49] (03PS23) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [18:20:49] (03PS22) 10Btullis: Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) [18:20:56] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:21:09] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:21:21] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:21:36] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:24:50] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in ms-be1074 - https://phabricator.wikimedia.org/T409040#11345989 (10VRiley-WMF) Had trouble connecting to the unit at first. However, I was able to gain access and upload the TSR report and resubmitted the ticket to Dell. [18:24:54] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7558/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [18:25:07] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:25:08] (03CR) 10CDobbins: [V:03+1 C:03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7559/console" [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [18:25:18] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:25:22] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:25:38] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:27:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T407997)', diff saved to https://phabricator.wikimedia.org/P84942 and previous config saved to /var/cache/conftool/dbconfig/20251105-182741-marostegui.json [18:27:45] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [18:27:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2194.codfw.wmnet with reason: Maintenance [18:28:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2194 (T407997)', diff saved to https://phabricator.wikimedia.org/P84943 and previous config saved to /var/cache/conftool/dbconfig/20251105-182805-marostegui.json [18:28:47] (03CR) 10Btullis: [V:03+1 C:03+2] Pin the version of opensearch wherever it is installed [puppet] - 10https://gerrit.wikimedia.org/r/1196022 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [18:28:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [alerts] - 10https://gerrit.wikimedia.org/r/1201852 (owner: 10CDanis) [18:29:02] (03CR) 10CDanis: [C:03+2] Add discovery-conftool-state to ignored stale texts [alerts] - 10https://gerrit.wikimedia.org/r/1201852 (owner: 10CDanis) [18:30:11] (03CR) 10Scott French: "Thanks for the review, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1202184 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:30:15] (03CR) 10Scott French: [C:03+2] mw-parsoid: migrate to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1202184 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [18:30:41] (03Merged) 10jenkins-bot: Add discovery-conftool-state to ignored stale texts [alerts] - 10https://gerrit.wikimedia.org/r/1201852 (owner: 10CDanis) [18:33:12] (03PS1) 10CDobbins: hieradata: add doh1001 flag for new pdns-rec cfg [puppet] - 10https://gerrit.wikimedia.org/r/1202225 [18:34:12] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:34:15] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRT queue index shows incorrect value - https://phabricator.wikimedia.org/T409135#11346042 (10Geagea) Iv' checked som: permission-commons ok permission-en (2 vs 1) permission-nl (5 vs 4) permission-zh (13 vs 4) wm-br (5 vs 3)... [18:34:21] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:34:32] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:34:42] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:38:07] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7560/co" [puppet] - 10https://gerrit.wikimedia.org/r/1202225 (owner: 10CDobbins) [18:39:41] (03PS24) 10Btullis: Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) [18:39:45] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [18:39:49] !log swfrench@deploy2002 Started scap sync-world: No-deployment scap run to switch mw-parsoid to PHP 8.3 - T405955 [18:39:52] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [18:40:22] !log swfrench@deploy2002 Stopping before sync operations [18:42:09] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [18:42:37] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [18:43:53] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdf) failed in thanos-be2008 - https://phabricator.wikimedia.org/T409036#11346104 (10Jhancock.wm) they shipped the drive today after escalating! i'll plug this in first thing when it gets here. should be here thursday. [18:45:29] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [18:45:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T407997)', diff saved to https://phabricator.wikimedia.org/P84944 and previous config saved to /var/cache/conftool/dbconfig/20251105-184538-marostegui.json [18:45:42] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [18:46:16] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [18:51:49] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [18:52:34] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [18:53:07] (03CR) 10Btullis: [C:03+2] Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [18:56:07] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Transport link saturation not alerting - https://phabricator.wikimedia.org/T409330#11346176 (10cmooney) Thanks for the task @ssingh ! I agree this is definitely a major gap. In terms of the alertmanager rule you list it does make sense we should h... [19:00:05] jeena and dduvall: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251105T1900). [19:00:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P84945 and previous config saved to /var/cache/conftool/dbconfig/20251105-190046-marostegui.json [19:03:11] I will deploy to group1 now [19:04:50] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202232 (https://phabricator.wikimedia.org/T408271) [19:04:53] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202232 (https://phabricator.wikimedia.org/T408271) (owner: 10TrainBranchBot) [19:06:06] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202232 (https://phabricator.wikimedia.org/T408271) (owner: 10TrainBranchBot) [19:08:11] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808#11346214 (10CDanis) **Tracing updates** section LGTM, no config changes needed. Thanks! [19:13:10] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.1 refs T408271 [19:13:14] T408271: 1.46.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T408271 [19:13:44] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Transport link saturation not alerting - https://phabricator.wikimedia.org/T409330#11346272 (10ssingh) >>! In T409330#11346176, @cmooney wrote: > Thanks for the task @ssingh ! > > I agree this is definitely a major gap. In terms of the alertmanage... [19:15:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P84946 and previous config saved to /var/cache/conftool/dbconfig/20251105-191553-marostegui.json [19:31:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T407997)', diff saved to https://phabricator.wikimedia.org/P84947 and previous config saved to /var/cache/conftool/dbconfig/20251105-193102-marostegui.json [19:31:07] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [19:31:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2209.codfw.wmnet with reason: Maintenance [19:31:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2209 (T407997)', diff saved to https://phabricator.wikimedia.org/P84948 and previous config saved to /var/cache/conftool/dbconfig/20251105-193126-marostegui.json [19:32:40] (03CR) 10Scott French: [C:03+2] hieradata: use_etcd_known_client_ident pilot on cp2041 (#2) [puppet] - 10https://gerrit.wikimedia.org/r/1202208 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [19:34:05] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:36:46] FIRING: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [19:40:56] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11346394 (10VRiley-WMF) [19:48:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T407997)', diff saved to https://phabricator.wikimedia.org/P84949 and previous config saved to /var/cache/conftool/dbconfig/20251105-194850-marostegui.json [19:48:53] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [19:52:48] (03PS1) 10Dzahn: tcpproxy: allow PRODUCTION_NETWORKS to connect to 29418 [puppet] - 10https://gerrit.wikimedia.org/r/1202261 (https://phabricator.wikimedia.org/T408532) [19:53:29] (03CR) 10Ssingh: [C:03+1] tcpproxy: allow PRODUCTION_NETWORKS to connect to 29418 [puppet] - 10https://gerrit.wikimedia.org/r/1202261 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [19:55:03] (03CR) 10Dzahn: [C:03+2] tcpproxy: allow PRODUCTION_NETWORKS to connect to 29418 [puppet] - 10https://gerrit.wikimedia.org/r/1202261 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [19:56:45] RESOLVED: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [20:03:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P84950 and previous config saved to /var/cache/conftool/dbconfig/20251105-200357-marostegui.json [20:07:46] !log dancy@deploy2002 Installing scap version "4.224.0" for 2 host(s) [20:09:32] !log dancy@deploy2002 Installation of scap version "4.224.0" completed for 2 hosts [20:11:06] (03PS1) 10Scott French: hieradata: end use_etcd_known_client_ident pilot on cp2041 (#2) [puppet] - 10https://gerrit.wikimedia.org/r/1202272 (https://phabricator.wikimedia.org/T403220) [20:17:24] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 57670888 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:18:18] (03CR) 10Scott French: [C:03+2] hieradata: end use_etcd_known_client_ident pilot on cp2041 (#2) [puppet] - 10https://gerrit.wikimedia.org/r/1202272 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [20:18:24] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2922392 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:19:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P84951 and previous config saved to /var/cache/conftool/dbconfig/20251105-201905-marostegui.json [20:34:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T407997)', diff saved to https://phabricator.wikimedia.org/P84952 and previous config saved to /var/cache/conftool/dbconfig/20251105-203413-marostegui.json [20:34:17] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [20:34:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2227.codfw.wmnet with reason: Maintenance [20:34:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T407997)', diff saved to https://phabricator.wikimedia.org/P84953 and previous config saved to /var/cache/conftool/dbconfig/20251105-203438-marostegui.json [20:37:18] (03PS1) 10Catrope: Set up Special:AccountRecovery and enable on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202285 (https://phabricator.wikimedia.org/T399742) [20:38:51] (03CR) 10Reedy: Set up Special:AccountRecovery and enable on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202285 (https://phabricator.wikimedia.org/T399742) (owner: 10Catrope) [20:39:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202285 (https://phabricator.wikimedia.org/T399742) (owner: 10Catrope) [20:44:20] (03PS2) 10Catrope: Set up Special:AccountRecovery and enable on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202285 (https://phabricator.wikimedia.org/T399742) [20:44:57] (03CR) 10Catrope: Set up Special:AccountRecovery and enable on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202285 (https://phabricator.wikimedia.org/T399742) (owner: 10Catrope) [20:52:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T407997)', diff saved to https://phabricator.wikimedia.org/P84955 and previous config saved to /var/cache/conftool/dbconfig/20251105-205211-marostegui.json [20:52:15] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251105T2100). [21:00:05] RoanKattouw: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:02:40] I'll self-serve [21:05:25] (03PS1) 10Ebernhardson: dumps: Fix missing trailing slash in cirrus-search-index path [puppet] - 10https://gerrit.wikimedia.org/r/1202289 (https://phabricator.wikimedia.org/T366248) [21:06:59] (03PS2) 10Ebernhardson: dumps: Fix missing trailing slash in cirrus-search-index path [puppet] - 10https://gerrit.wikimedia.org/r/1202289 (https://phabricator.wikimedia.org/T366248) [21:07:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P84957 and previous config saved to /var/cache/conftool/dbconfig/20251105-210718-marostegui.json [21:07:30] (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7561/console" [puppet] - 10https://gerrit.wikimedia.org/r/1202289 (https://phabricator.wikimedia.org/T366248) (owner: 10Ebernhardson) [21:09:05] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:09:05] FIRING: [21x] CertAlmostExpired: Certificate for service cr2-eqsin.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:09:12] (03CR) 10Ebernhardson: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7562/co" [puppet] - 10https://gerrit.wikimedia.org/r/1202289 (https://phabricator.wikimedia.org/T366248) (owner: 10Ebernhardson) [21:10:46] (03PS3) 10Ebernhardson: dumps: Fix missing trailing slash in cirrus-search-index path [puppet] - 10https://gerrit.wikimedia.org/r/1202289 (https://phabricator.wikimedia.org/T366248) [21:19:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202285 (https://phabricator.wikimedia.org/T399742) (owner: 10Catrope) [21:21:16] (03Merged) 10jenkins-bot: Set up Special:AccountRecovery and enable on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202285 (https://phabricator.wikimedia.org/T399742) (owner: 10Catrope) [21:21:48] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1202285|Set up Special:AccountRecovery and enable on testwiki (T399742)]] [21:21:51] T399742: Integrated on-page form for EmailAuth recovery requests - https://phabricator.wikimedia.org/T399742 [21:22:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P84958 and previous config saved to /var/cache/conftool/dbconfig/20251105-212226-marostegui.json [21:23:43] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy7001.magru.wmnet with OS trixie [21:23:56] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11346856 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for h... [21:24:23] !log catrope@deploy2002 catrope: Backport for [[gerrit:1202285|Set up Special:AccountRecovery and enable on testwiki (T399742)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:25:30] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11346868 (10Dzahn) We are debugging why things work from SOME of the VMs but not from others..in this pattern: {P84954} [21:25:52] !log catrope@deploy2002 catrope: Continuing with sync [21:30:09] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202285|Set up Special:AccountRecovery and enable on testwiki (T399742)]] (duration: 08m 21s) [21:30:13] T399742: Integrated on-page form for EmailAuth recovery requests - https://phabricator.wikimedia.org/T399742 [21:32:45] FIRING: Traffic bill over quota: Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [21:33:57] (03PS1) 10Cwhite: scap: use new logging-logstash in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1202295 (https://phabricator.wikimedia.org/T409339) [21:35:24] (03CR) 10BryanDavis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1202295 (https://phabricator.wikimedia.org/T409339) (owner: 10Cwhite) [21:37:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T407997)', diff saved to https://phabricator.wikimedia.org/P84960 and previous config saved to /var/cache/conftool/dbconfig/20251105-213734-marostegui.json [21:37:38] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [21:37:45] FIRING: [2x] Traffic bill over quota: Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [21:37:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2239.codfw.wmnet with reason: Maintenance [21:39:46] (03CR) 10BryanDavis: [C:03+1] "PCC looks right at https://puppet-compiler.wmflabs.org/output/1202295/7563/deployment-deploy04.deployment-prep.eqiad1.wikimedia.cloud/inde" [puppet] - 10https://gerrit.wikimedia.org/r/1202295 (https://phabricator.wikimedia.org/T409339) (owner: 10Cwhite) [21:41:47] (03PS1) 10Catrope: Configure HTTP proxy for EmailAuth AccountRecovery [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202298 (https://phabricator.wikimedia.org/T399742) [21:42:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202298 (https://phabricator.wikimedia.org/T399742) (owner: 10Catrope) [21:42:40] FIRING: [5x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:42:44] (03CR) 10Cwhite: [C:03+2] scap: use new logging-logstash in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1202295 (https://phabricator.wikimedia.org/T409339) (owner: 10Cwhite) [21:42:55] (03Merged) 10jenkins-bot: Configure HTTP proxy for EmailAuth AccountRecovery [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202298 (https://phabricator.wikimedia.org/T399742) (owner: 10Catrope) [21:43:28] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1202298|Configure HTTP proxy for EmailAuth AccountRecovery (T399742)]] [21:43:31] T399742: Integrated on-page form for EmailAuth recovery requests - https://phabricator.wikimedia.org/T399742 [21:45:59] !log catrope@deploy2002 catrope: Backport for [[gerrit:1202298|Configure HTTP proxy for EmailAuth AccountRecovery (T399742)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:47:13] !log catrope@deploy2002 catrope: Continuing with sync [21:48:49] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11347016 (10herron) [21:49:06] 10SRE-SLO: Sloth: onboard subset of existing SLOs to pilot - https://phabricator.wikimedia.org/T409310#11347017 (10herron) [21:51:28] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202298|Configure HTTP proxy for EmailAuth AccountRecovery (T399742)]] (duration: 08m 01s) [21:51:31] T399742: Integrated on-page form for EmailAuth recovery requests - https://phabricator.wikimedia.org/T399742 [21:51:41] (03PS1) 10DLynch: Enable editcheck addReference a/b test on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202301 (https://phabricator.wikimedia.org/T406134) [21:52:45] FIRING: [2x] Traffic bill over quota: Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [21:53:57] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy7001.magru.wmnet with reason: host reimage [21:57:45] RESOLVED: Traffic bill over quota: Alert for device cr2-esams.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [21:58:11] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy7001.magru.wmnet with reason: host reimage [21:59:53] (03CR) 10Ryan Kemper: [C:03+2] dumps: Fix missing trailing slash in cirrus-search-index path [puppet] - 10https://gerrit.wikimedia.org/r/1202289 (https://phabricator.wikimedia.org/T366248) (owner: 10Ebernhardson) [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251105T2200) [22:05:28] 10SRE-SLO: Sloth: onboard subset of existing SLOs to pilot - https://phabricator.wikimedia.org/T409310#11347088 (10herron) sloth editcheck has been backfilled for range --start=2025-06-01T00:00:00Z --end=2025-11-01T00:00:00Z ` ts=2025-11-05T21:23:01.789103977Z caller=sidecar.go:254 level=info msg="successfully... [22:05:49] (03CR) 10Esanders: [C:03+1] Enable editcheck addReference a/b test on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202301 (https://phabricator.wikimedia.org/T406134) (owner: 10DLynch) [22:05:54] 10SRE-SLO: Sloth: onboard subset of existing SLOs to pilot - https://phabricator.wikimedia.org/T409310#11347091 (10herron) [22:08:21] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:09:05] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:10:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:12:16] !log T366248 `sudo rm -rfv /srv/dumps/xmldatadumps/public/other/cirrus_search_index/cirrus-search-index/` on `clouddumps100[1,2].wikimedia.org` [22:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:19] T366248: Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script - https://phabricator.wikimedia.org/T366248 [22:13:02] !log [WDQS] Restarting blazegraph across all codfw `wdqs-main` hosts, hoping it resolves the lag issues although it's likely that it won't [22:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:05] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:14:36] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy7001.magru.wmnet with OS trixie [22:14:48] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11347142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host... [22:17:50] !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.ban Banning hosts: foobar1001.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860 [22:17:50] !log ryankemper@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: foobar1001.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860 [22:17:53] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [22:18:14] !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch1068.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860 [22:18:17] !log ryankemper@cumin1002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch1068.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860 [22:19:20] !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch1068.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860 [22:19:20] !log ryankemper@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: cirrussearch1068.eqiad.wmnet for test new spicerack elasticsearch library - ryankemper@cumin1002 - T390860 [22:19:32] !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [22:19:34] !log ryankemper@cumin1002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [22:21:02] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:22:36] !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart (test new spicerack version) - ryankemper@cumin1002 - T390860 [22:25:29] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Update server provision script to support Nokia switches - https://phabricator.wikimedia.org/T405637#11347169 (10cmooney) 05Open→03Resolved This one is complete, was not much to do. [22:25:47] (03PS1) 10Cathal Mooney: Refactor of move_server and script to move selective hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202313 (https://phabricator.wikimedia.org/T405640) [22:26:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:28:38] (03PS2) 10Scott French: mw-misc: migrate to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1202185 (https://phabricator.wikimedia.org/T405955) [22:28:39] (03PS1) 10Scott French: mw-wikifunctions: migrate to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1202315 (https://phabricator.wikimedia.org/T405955) [22:29:57] (03CR) 10Cathal Mooney: Refactor of move_server and script to move selective hosts (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202313 (https://phabricator.wikimedia.org/T405640) (owner: 10Cathal Mooney) [22:33:11] (03PS2) 10Cathal Mooney: Refactor of move_server and script to move selective hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1202313 (https://phabricator.wikimedia.org/T405640) [22:41:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:46:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:47:02] (03PS1) 10Tim Starling: Limit sitemap namespaces to content namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202320 (https://phabricator.wikimedia.org/T407127) [22:48:14] !log ryankemper@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart (test new spicerack version) - ryankemper@cumin1002 - T390860 [22:48:17] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [22:51:02] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:51:20] (03PS4) 10Bking: WDQS: Log `x-ja3n` `x-is-browser` `x-is-client-ip`in nginx [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) (owner: 10Stevemunene) [22:51:56] (03CR) 10CI reject: [V:04-1] WDQS: Log `x-ja3n` `x-is-browser` `x-is-client-ip`in nginx [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) (owner: 10Stevemunene) [22:52:10] (03PS5) 10Bking: WDQS: Log `x-ja3n` `x-is-browser` `x-is-client-ip`in nginx [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) (owner: 10Stevemunene) [22:52:48] (03PS1) 10Ryan Kemper: search: bring cirrussearch2089 back into service [puppet] - 10https://gerrit.wikimedia.org/r/1202322 (https://phabricator.wikimedia.org/T399943) [22:52:53] (03CR) 10CI reject: [V:04-1] WDQS: Log `x-ja3n` `x-is-browser` `x-is-client-ip`in nginx [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) (owner: 10Stevemunene) [22:53:20] (03PS1) 10Scott French: mw-(api-ext|web): serve 25% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202321 (https://phabricator.wikimedia.org/T405955) [22:53:44] (03CR) 10Bking: [C:03+2] search: bring cirrussearch2089 back into service [puppet] - 10https://gerrit.wikimedia.org/r/1202322 (https://phabricator.wikimedia.org/T399943) (owner: 10Ryan Kemper) [22:55:36] (03PS6) 10Bking: WDQS: Log `x-ja3n` `x-is-browser` `x-is-client-ip`in nginx [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) (owner: 10Stevemunene) [22:58:36] !log ryankemper@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch2089.codfw.wmnet [22:59:09] (03CR) 10RLazarus: [C:03+1] mw-(api-ext|web): serve 25% of residual traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202321 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [22:59:10] (03PS3) 10Scott French: deployment_server: migrate mw-misc to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1202185 (https://phabricator.wikimedia.org/T405955) [22:59:10] (03PS2) 10Scott French: deployment_server: migrate mw-wikifunctions to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1202315 (https://phabricator.wikimedia.org/T405955) [22:59:34] (03CR) 10RLazarus: [C:03+1] deployment_server: migrate mw-wikifunctions to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1202315 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [23:00:03] !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860 [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251105T2300) [23:00:06] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [23:00:16] !log ryankemper@cumin1002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860 [23:00:54] !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860 [23:02:08] (03PS7) 10Ryan Kemper: WDQS: Log `x-ja3n` `x-is-browser` `x-is-client-ip`in nginx [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) (owner: 10Stevemunene) [23:04:46] (03PS8) 10Bking: WDQS: Log `x-ja3n` `x-is-browser` `x-is-client-ip`in nginx [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) (owner: 10Stevemunene) [23:06:17] (03PS1) 10Aaron Schulz: restgateway: make spec-json-wikimedia catch non-www domain too [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202323 (https://phabricator.wikimedia.org/T396805) [23:06:30] (03CR) 10Ryan Kemper: [C:03+1] WDQS: Log `x-ja3n` `x-is-browser` `x-is-client-ip`in nginx [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) (owner: 10Stevemunene) [23:06:49] jouncebot: nowandnext [23:06:49] For the next 0 hour(s) and 53 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251105T2300) [23:06:49] In 7 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T0700) [23:06:49] In 7 hour(s) and 53 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251106T0700) [23:07:12] * swfrench-wmf suspects the Web Team will not be using this window [23:08:28] in which case, I might sneak in a narrowly scoped change to switch mw-misc over to 8.3 [23:08:34] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1202185 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [23:08:41] (03CR) 10Scott French: [C:03+2] deployment_server: migrate mw-misc to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1202185 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [23:09:24] (03CR) 10Bking: [C:03+2] WDQS: Log `x-ja3n` `x-is-browser` `x-is-client-ip`in nginx [puppet] - 10https://gerrit.wikimedia.org/r/1201734 (https://phabricator.wikimedia.org/T408123) (owner: 10Stevemunene) [23:12:54] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11347371 (10Dzahn) a:05thcipriani→03None [23:12:56] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11347372 (10Dzahn) 05Stalled→03Open [23:13:17] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11347373 (10Dzahn) [23:13:58] FYI, I'm waiting on a puppet-agent run on the deployment host, after which I'll be running scap and some manual helmfile applies. I'll follow up here when all is clear. [23:16:21] for some reason that one feels like the slowest puppet run of all [23:16:43] yeah it super is, I keep meaning to take an afternoon and find out what it's doing in there [23:16:44] almost done ... [23:17:00] !log swfrench@deploy2002 Started scap sync-world: No-deployment scap run to switch mw-misc to PHP 8.3 - T405955 [23:17:04] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [23:17:26] someday if it takes more than 30 minutes we'll be in real trouble :) [23:17:32] !log swfrench@deploy2002 Stopping before sync operations [23:17:51] lol [23:18:11] alright, scap done. now some `helmfile`'ing. [23:19:25] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [23:19:26] when creating a VM in magru.. for some related probably to routed ganeti and DHCP.. the machine ends up with only IPv4 and an "fe80" link local bound on inet6 but not the real IPv6 it should be getting .. [23:19:49] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [23:25:08] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11347390 (10Dzahn) 05Resolved→03Open [23:25:55] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [23:26:19] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [23:30:45] okay, I think I've kicked all the tires that can be kicked at this point, and generic service health looks good too. [23:30:53] I believe I'm done for now :) [23:33:13] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [23:36:48] (03PS1) 10DLynch: Edit check: allow any check to be an a/b test including default ones [extensions/VisualEditor] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1202331 (https://phabricator.wikimedia.org/T406134) [23:45:13] (03PS2) 10Aaron Schulz: restgateway: make spec-json-wikimedia catch non-www domain too [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202323 (https://phabricator.wikimedia.org/T396805) [23:46:15] (03PS1) 10Scott French: hiera: enable haproxy known-client identification [puppet] - 10https://gerrit.wikimedia.org/r/1202306 (https://phabricator.wikimedia.org/T403220) [23:46:44] (03CR) 10Scott French: "Manual PCC run looks good: https://puppet-compiler.wmflabs.org/output/1202306/7564/" [puppet] - 10https://gerrit.wikimedia.org/r/1202306 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [23:58:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201826 (owner: 10Aaron Schulz)