[00:04:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.26% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:05:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T415786)', diff saved to https://phabricator.wikimedia.org/P88571 and previous config saved to /var/cache/conftool/dbconfig/20260204-000501-marostegui.json [00:05:04] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [00:05:08] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1001.eqiad.wmnet with OS trixie [00:05:16] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1002.eqiad.wmnet with OS trixie [00:05:18] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581394 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1001.eqiad.wmnet with OS trixie [00:05:24] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581395 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1002.eqiad.wmnet with OS trixie [00:09:46] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-ctrl1002.eqiad.wmnet with reason: host reimage [00:09:55] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-ctrl1001.eqiad.wmnet with reason: host reimage [00:11:29] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1004.eqiad.wmnet with OS trixie [00:11:38] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581404 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1004.eqiad.wmnet with OS trixie [00:11:42] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1003.eqiad.wmnet with OS trixie [00:11:49] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581405 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1003.eqiad.wmnet with OS trixie [00:13:34] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-ctrl1002.eqiad.wmnet with reason: host reimage [00:14:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:15:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P88572 and previous config saved to /var/cache/conftool/dbconfig/20260204-001509-marostegui.json [00:17:33] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-ctrl1001.eqiad.wmnet with reason: host reimage [00:17:50] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1001.eqiad.wmnet with reason: host reimage [00:17:54] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1002.eqiad.wmnet with reason: host reimage [00:21:58] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1001.eqiad.wmnet with reason: host reimage [00:23:37] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1004.eqiad.wmnet with reason: host reimage [00:24:14] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1003.eqiad.wmnet with reason: host reimage [00:25:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P88573 and previous config saved to /var/cache/conftool/dbconfig/20260204-002518-marostegui.json [00:25:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1002.eqiad.wmnet with reason: host reimage [00:29:36] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1003.eqiad.wmnet with reason: host reimage [00:30:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-ctrl1002.eqiad.wmnet with OS trixie [00:30:36] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-ctrl1002.eqiad.wmnet with OS trixie completed: - tools-k8s-ctrl1002 (**PASS**) - Down... [00:33:11] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:33:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-ctrl1001.eqiad.wmnet with OS trixie [00:33:45] (03PS1) 10Dzahn: wmnet: upgrade vrts from the "without multiple backends" section [dns] - 10https://gerrit.wikimedia.org/r/1236384 [00:33:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1004.eqiad.wmnet with reason: host reimage [00:33:48] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581536 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-ctrl1001.eqiad.wmnet with OS trixie completed: - tools-k8s-ctrl1001 (**PASS**) - Down... [00:34:19] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:35:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T415786)', diff saved to https://phabricator.wikimedia.org/P88574 and previous config saved to /var/cache/conftool/dbconfig/20260204-003526-marostegui.json [00:35:30] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [00:35:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1232.eqiad.wmnet with reason: Maintenance [00:35:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1232 (T415786)', diff saved to https://phabricator.wikimedia.org/P88575 and previous config saved to /var/cache/conftool/dbconfig/20260204-003551-marostegui.json [00:37:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1001.eqiad.wmnet with OS trixie [00:37:54] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581544 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1001.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1001 (**PASS**) -... [00:38:09] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:40:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1236385 [00:40:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1236385 (owner: 10TrainBranchBot) [00:40:32] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:41:22] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:42:20] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1002.eqiad.wmnet with OS trixie [00:42:30] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:42:30] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1002.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1002 (**PASS**) -... [00:44:27] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1005.eqiad.wmnet with OS trixie [00:44:40] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581583 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1005.eqiad.wmnet with OS trixie [00:45:00] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1006.eqiad.wmnet with OS trixie [00:45:10] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581584 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1006.eqiad.wmnet with OS trixie [00:45:11] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1003.eqiad.wmnet with OS trixie [00:45:19] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581585 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1003.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1003 (**PASS**) -... [00:45:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:46:07] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1007.eqiad.wmnet with OS trixie [00:46:14] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581586 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1007.eqiad.wmnet with OS trixie [00:49:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [00:49:36] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1004.eqiad.wmnet with OS trixie [00:49:43] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581590 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1004.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1004 (**PASS**) -... [00:50:09] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-worker1008.eqiad.wmnet with OS trixie [00:50:17] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-worker1008.eqiad.wmnet with OS trixie [00:51:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1236385 (owner: 10TrainBranchBot) [00:54:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T415786)', diff saved to https://phabricator.wikimedia.org/P88576 and previous config saved to /var/cache/conftool/dbconfig/20260204-005419-marostegui.json [00:54:24] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [00:55:36] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1005.eqiad.wmnet with reason: host reimage [00:56:28] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1006.eqiad.wmnet with reason: host reimage [00:57:31] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1007.eqiad.wmnet with reason: host reimage [00:59:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1005.eqiad.wmnet with reason: host reimage [01:00:57] (03PS1) 10Dzahn: zuul: set owner and notify zookeeper service with pki::get_cert [puppet] - 10https://gerrit.wikimedia.org/r/1236386 (https://phabricator.wikimedia.org/T405119) [01:01:09] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on tools-k8s-worker1008.eqiad.wmnet with reason: host reimage [01:01:26] (03CR) 10CI reject: [V:04-1] zuul: set owner and notify zookeeper service with pki::get_cert [puppet] - 10https://gerrit.wikimedia.org/r/1236386 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [01:03:13] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1007.eqiad.wmnet with reason: host reimage [01:05:53] (03PS2) 10Dzahn: zuul: set owner and notify zookeeper service with pki::get_cert [puppet] - 10https://gerrit.wikimedia.org/r/1236386 (https://phabricator.wikimedia.org/T405119) [01:07:14] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1008.eqiad.wmnet with reason: host reimage [01:09:16] (03PS1) 10Ladsgroup: UserImpact: Remove zeros in per-article view stats [extensions/GrowthExperiments] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236387 (https://phabricator.wikimedia.org/T414080) [01:09:27] (03PS1) 10Ladsgroup: UserImpact: Remove zeros in per-article view stats [extensions/GrowthExperiments] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236388 (https://phabricator.wikimedia.org/T414080) [01:09:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P88577 and previous config saved to /var/cache/conftool/dbconfig/20260204-010928-marostegui.json [01:10:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1236389 [01:10:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1236389 (owner: 10TrainBranchBot) [01:10:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tools-k8s-worker1006.eqiad.wmnet with reason: host reimage [01:11:20] jouncebot: nowandnext [01:11:20] No deployments scheduled for the next 5 hour(s) and 48 minute(s) [01:11:20] In 5 hour(s) and 48 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T0700) [01:11:29] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1236386/7970/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1236386 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [01:11:30] (03CR) 10Ladsgroup: [C:03+2] UserImpact: Remove zeros in per-article view stats [extensions/GrowthExperiments] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236387 (https://phabricator.wikimedia.org/T414080) (owner: 10Ladsgroup) [01:11:33] (03CR) 10Ladsgroup: [C:03+2] UserImpact: Remove zeros in per-article view stats [extensions/GrowthExperiments] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236388 (https://phabricator.wikimedia.org/T414080) (owner: 10Ladsgroup) [01:12:22] (03PS3) 10Dzahn: zuul: set owner and notify zookeeper service with pki::get_cert [puppet] - 10https://gerrit.wikimedia.org/r/1236386 (https://phabricator.wikimedia.org/T405119) [01:15:33] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:15:53] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:15:54] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1005.eqiad.wmnet with OS trixie [01:16:01] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581623 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1005.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1005 (**PASS**) -... [01:16:56] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581624 (10Jclark-ctr) [01:19:26] (03PS1) 10Dzahn: zuul: move cert paths to role level, drop host-name based config [puppet] - 10https://gerrit.wikimedia.org/r/1236390 (https://phabricator.wikimedia.org/T405119) [01:19:58] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:20:37] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1236386/7970/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1236386 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [01:20:50] (03CR) 10Dzahn: [V:03+1] "so far so good, but what about the truststore path" [puppet] - 10https://gerrit.wikimedia.org/r/1236386 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [01:20:57] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:20:58] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1007.eqiad.wmnet with OS trixie [01:21:04] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1007.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1007 (**PASS**) -... [01:21:11] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1236386/7970/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1236390 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [01:21:56] (03Merged) 10jenkins-bot: UserImpact: Remove zeros in per-article view stats [extensions/GrowthExperiments] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236387 (https://phabricator.wikimedia.org/T414080) (owner: 10Ladsgroup) [01:22:20] (03Merged) 10jenkins-bot: UserImpact: Remove zeros in per-article view stats [extensions/GrowthExperiments] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236388 (https://phabricator.wikimedia.org/T414080) (owner: 10Ladsgroup) [01:23:38] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:24:02] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:24:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1008.eqiad.wmnet with OS trixie [01:24:12] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581638 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1008.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1008 (**PASS**) -... [01:24:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P88578 and previous config saved to /var/cache/conftool/dbconfig/20260204-012436-marostegui.json [01:25:32] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1236387|UserImpact: Remove zeros in per-article view stats (T414080)]], [[gerrit:1236388|UserImpact: Remove zeros in per-article view stats (T414080)]] [01:25:35] T414080: x1 increase in writes results in a large increase of binlog files (over 2000) - https://phabricator.wikimedia.org/T414080 [01:26:49] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:27:20] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [01:27:21] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tools-k8s-worker1006.eqiad.wmnet with OS trixie [01:27:33] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581642 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host tools-k8s-worker1006.eqiad.wmnet with OS trixie completed: - tools-k8s-worker1006 (**PASS**) -... [01:29:37] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1236387|UserImpact: Remove zeros in per-article view stats (T414080)]], [[gerrit:1236388|UserImpact: Remove zeros in per-article view stats (T414080)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:29:59] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [01:34:37] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1236389 (owner: 10TrainBranchBot) [01:36:10] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1236387|UserImpact: Remove zeros in per-article view stats (T414080)]], [[gerrit:1236388|UserImpact: Remove zeros in per-article view stats (T414080)]] (duration: 10m 38s) [01:36:13] T414080: x1 increase in writes results in a large increase of binlog files (over 2000) - https://phabricator.wikimedia.org/T414080 [01:39:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T415786)', diff saved to https://phabricator.wikimedia.org/P88579 and previous config saved to /var/cache/conftool/dbconfig/20260204-013944-marostegui.json [01:39:48] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [01:39:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [01:39:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1202 (T415786)', diff saved to https://phabricator.wikimedia.org/P88580 and previous config saved to /var/cache/conftool/dbconfig/20260204-013958-marostegui.json [01:41:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T415786)', diff saved to https://phabricator.wikimedia.org/P88581 and previous config saved to /var/cache/conftool/dbconfig/20260204-014127-marostegui.json [01:48:40] FIRING: SystemdUnitFailed: wmf_auto_restart_kerberos_rsync.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:55:20] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581667 (10Jclark-ctr) 05Open→03Resolved [01:56:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P88582 and previous config saved to /var/cache/conftool/dbconfig/20260204-015635-marostegui.json [02:00:51] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:06:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T415786)', diff saved to https://phabricator.wikimedia.org/P88583 and previous config saved to /var/cache/conftool/dbconfig/20260204-020609-marostegui.json [02:06:13] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [02:11:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P88584 and previous config saved to /var/cache/conftool/dbconfig/20260204-021144-marostegui.json [02:13:42] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 50s) [02:16:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P88585 and previous config saved to /var/cache/conftool/dbconfig/20260204-021617-marostegui.json [02:26:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P88586 and previous config saved to /var/cache/conftool/dbconfig/20260204-022626-marostegui.json [02:26:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T415786)', diff saved to https://phabricator.wikimedia.org/P88587 and previous config saved to /var/cache/conftool/dbconfig/20260204-022652-marostegui.json [02:26:56] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [02:27:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2188.codfw.wmnet with reason: Maintenance [02:27:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2188 (T415786)', diff saved to https://phabricator.wikimedia.org/P88588 and previous config saved to /var/cache/conftool/dbconfig/20260204-022717-marostegui.json [02:36:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T415786)', diff saved to https://phabricator.wikimedia.org/P88589 and previous config saved to /var/cache/conftool/dbconfig/20260204-023634-marostegui.json [02:36:38] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [02:36:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1234.eqiad.wmnet with reason: Maintenance [02:36:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1234 (T415786)', diff saved to https://phabricator.wikimedia.org/P88590 and previous config saved to /var/cache/conftool/dbconfig/20260204-023659-marostegui.json [02:45:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T415786)', diff saved to https://phabricator.wikimedia.org/P88591 and previous config saved to /var/cache/conftool/dbconfig/20260204-024521-marostegui.json [02:45:24] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [03:00:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P88592 and previous config saved to /var/cache/conftool/dbconfig/20260204-030029-marostegui.json [03:10:35] 06SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875#11581780 (10AntiCompositeNumber) [03:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [03:15:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P88593 and previous config saved to /var/cache/conftool/dbconfig/20260204-031537-marostegui.json [03:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:30:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T415786)', diff saved to https://phabricator.wikimedia.org/P88594 and previous config saved to /var/cache/conftool/dbconfig/20260204-033046-marostegui.json [03:30:49] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [03:31:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1227.eqiad.wmnet with reason: Maintenance [03:31:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T415786)', diff saved to https://phabricator.wikimedia.org/P88595 and previous config saved to /var/cache/conftool/dbconfig/20260204-033110-marostegui.json [03:56:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T415786)', diff saved to https://phabricator.wikimedia.org/P88596 and previous config saved to /var/cache/conftool/dbconfig/20260204-035612-marostegui.json [03:56:16] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [04:09:29] 10SRE-SLO, 10Observability-Alerting, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q3): sloth deployment - https://phabricator.wikimedia.org/T414579#11581798 (10herron) >! In T414579#11581131, @tappof wrote: >> * Templates SLO manifests and allows default values (e.g. default alert state) >> * Allow... [04:09:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T415786)', diff saved to https://phabricator.wikimedia.org/P88597 and previous config saved to /var/cache/conftool/dbconfig/20260204-040933-marostegui.json [04:09:38] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [04:11:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P88598 and previous config saved to /var/cache/conftool/dbconfig/20260204-041121-marostegui.json [04:19:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P88599 and previous config saved to /var/cache/conftool/dbconfig/20260204-041941-marostegui.json [04:26:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P88600 and previous config saved to /var/cache/conftool/dbconfig/20260204-042629-marostegui.json [04:29:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P88601 and previous config saved to /var/cache/conftool/dbconfig/20260204-042950-marostegui.json [04:39:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T415786)', diff saved to https://phabricator.wikimedia.org/P88602 and previous config saved to /var/cache/conftool/dbconfig/20260204-043958-marostegui.json [04:40:02] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [04:40:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1235.eqiad.wmnet with reason: Maintenance [04:40:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1235 (T415786)', diff saved to https://phabricator.wikimedia.org/P88603 and previous config saved to /var/cache/conftool/dbconfig/20260204-044022-marostegui.json [04:41:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T415786)', diff saved to https://phabricator.wikimedia.org/P88604 and previous config saved to /var/cache/conftool/dbconfig/20260204-044137-marostegui.json [04:41:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2202.codfw.wmnet with reason: Maintenance [04:59:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T415786)', diff saved to https://phabricator.wikimedia.org/P88605 and previous config saved to /var/cache/conftool/dbconfig/20260204-045953-marostegui.json [04:59:57] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [05:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:15:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P88606 and previous config saved to /var/cache/conftool/dbconfig/20260204-051501-marostegui.json [05:26:27] (03PS1) 10QChris: Add .gitreview [slothslos] - 10https://gerrit.wikimedia.org/r/1236432 [05:26:27] (03CR) 10QChris: [V:03+2 C:03+2] Add .gitreview [slothslos] - 10https://gerrit.wikimedia.org/r/1236432 (owner: 10QChris) [05:30:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P88607 and previous config saved to /var/cache/conftool/dbconfig/20260204-053009-marostegui.json [05:34:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:45:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T415786)', diff saved to https://phabricator.wikimedia.org/P88608 and previous config saved to /var/cache/conftool/dbconfig/20260204-054518-marostegui.json [05:45:21] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [05:45:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1231.eqiad.wmnet with reason: Maintenance [05:45:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1231 (T415786)', diff saved to https://phabricator.wikimedia.org/P88609 and previous config saved to /var/cache/conftool/dbconfig/20260204-054542-marostegui.json [05:48:40] FIRING: SystemdUnitFailed: wmf_auto_restart_kerberos_rsync.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:05:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2212.codfw.wmnet with reason: Maintenance [06:05:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2212 (T415786)', diff saved to https://phabricator.wikimedia.org/P88610 and previous config saved to /var/cache/conftool/dbconfig/20260204-060516-marostegui.json [06:05:20] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [06:10:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T415786)', diff saved to https://phabricator.wikimedia.org/P88611 and previous config saved to /var/cache/conftool/dbconfig/20260204-061047-marostegui.json [06:10:51] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [06:11:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2207 with weight 0 T416300', diff saved to https://phabricator.wikimedia.org/P88612 and previous config saved to /var/cache/conftool/dbconfig/20260204-061122-marostegui.json [06:11:26] T416300: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T416300 [06:11:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T416300 [06:11:53] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1236109 (https://phabricator.wikimedia.org/T416300) (owner: 10Gerrit maintenance bot) [06:12:59] !log Starting s2 codfw failover from db2204 to db2207 - T416300 [06:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:16:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s2 codfw as read-only for maintenance - T416300', diff saved to https://phabricator.wikimedia.org/P88613 and previous config saved to /var/cache/conftool/dbconfig/20260204-061613-marostegui.json [06:16:17] (03CR) 10Ayounsi: [C:03+1] DNS: Enable Bird 2.18 for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1228560 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [06:16:31] (03CR) 10Ayounsi: [C:03+1] "lgtm but leaving the last call to Sukhe" [puppet] - 10https://gerrit.wikimedia.org/r/1228560 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [06:16:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2207 to s2 primary and set section read-write T416300', diff saved to https://phabricator.wikimedia.org/P88614 and previous config saved to /var/cache/conftool/dbconfig/20260204-061637-marostegui.json [06:16:41] T416300: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T416300 [06:16:58] (03CR) 10Marostegui: [C:03+2] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1236110 (https://phabricator.wikimedia.org/T416300) (owner: 10Gerrit maintenance bot) [06:17:04] !log marostegui@dns1006 START - running authdns-update [06:17:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2204 T416300', diff saved to https://phabricator.wikimedia.org/P88615 and previous config saved to /var/cache/conftool/dbconfig/20260204-061739-marostegui.json [06:18:07] !log marostegui@dns1006 END - running authdns-update [06:20:09] (03PS1) 10Marostegui: db2204: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1236602 (https://phabricator.wikimedia.org/T415786) [06:20:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P88616 and previous config saved to /var/cache/conftool/dbconfig/20260204-062055-marostegui.json [06:30:53] 06SRE, 10Maps, 06Traffic, 07affects-Kiwix-and-openZIM: On using Wikimedia Maps to build Kiwix Openstreetmap ZIMs - https://phabricator.wikimedia.org/T416374#11581910 (10Bugreporter) This does not need to add a whitelist. Instead you need to set a proper referer when fetching tiles. [06:31:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P88617 and previous config saved to /var/cache/conftool/dbconfig/20260204-063103-marostegui.json [06:34:46] (03CR) 10Marostegui: [C:03+2] db2204: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1236602 (https://phabricator.wikimedia.org/T415786) (owner: 10Marostegui) [06:41:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T415786)', diff saved to https://phabricator.wikimedia.org/P88618 and previous config saved to /var/cache/conftool/dbconfig/20260204-064107-marostegui.json [06:41:12] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [06:41:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T415786)', diff saved to https://phabricator.wikimedia.org/P88619 and previous config saved to /var/cache/conftool/dbconfig/20260204-064118-marostegui.json [06:41:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1239.eqiad.wmnet with reason: Maintenance [06:44:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:56:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P88620 and previous config saved to /var/cache/conftool/dbconfig/20260204-065616-marostegui.json [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T0700) [07:11:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P88621 and previous config saved to /var/cache/conftool/dbconfig/20260204-071124-marostegui.json [07:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [07:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:26:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T415786)', diff saved to https://phabricator.wikimedia.org/P88622 and previous config saved to /var/cache/conftool/dbconfig/20260204-072632-marostegui.json [07:26:36] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [07:26:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1253.eqiad.wmnet with reason: Maintenance [07:26:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1253 (T415786)', diff saved to https://phabricator.wikimedia.org/P88623 and previous config saved to /var/cache/conftool/dbconfig/20260204-072658-marostegui.json [07:34:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2204.codfw.wmnet with reason: Schema change [07:35:41] !log Deploy schema change on db2204 (old s2 codfw master) T415786 [07:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:45] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [07:37:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T415786)', diff saved to https://phabricator.wikimedia.org/P88624 and previous config saved to /var/cache/conftool/dbconfig/20260204-073735-marostegui.json [07:39:07] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [07:49:46] (03PS1) 10Muehlenhoff: kerberos::kadminserver: Fix service name [puppet] - 10https://gerrit.wikimedia.org/r/1236622 [07:50:43] (03CR) 10Tiziano Fogli: [C:03+2] centralauth: add recording rules for grafana widgets (write) [puppet] - 10https://gerrit.wikimedia.org/r/1236233 (https://phabricator.wikimedia.org/T415035) (owner: 10Tiziano Fogli) [07:52:40] (03CR) 10Muehlenhoff: [C:03+2] kerberos::kadminserver: Fix service name [puppet] - 10https://gerrit.wikimedia.org/r/1236622 (owner: 10Muehlenhoff) [07:52:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P88625 and previous config saved to /var/cache/conftool/dbconfig/20260204-075243-marostegui.json [07:57:36] (03PS1) 10Muehlenhoff: Make bitu-account-managers manageable in idm.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1236652 [07:59:15] (03PS1) 10Slyngshede: P:idm bitu-account-managers permission [puppet] - 10https://gerrit.wikimedia.org/r/1236657 [08:00:05] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:23] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1236657 (owner: 10Slyngshede) [08:01:44] (03Abandoned) 10Muehlenhoff: Make bitu-account-managers manageable in idm.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1236652 (owner: 10Muehlenhoff) [08:03:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_kerberos_rsync.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:06] (03CR) 10Slyngshede: [C:03+2] P:idm bitu-account-managers permission [puppet] - 10https://gerrit.wikimedia.org/r/1236657 (owner: 10Slyngshede) [08:07:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P88626 and previous config saved to /var/cache/conftool/dbconfig/20260204-080751-marostegui.json [08:08:03] (03CR) 10Elukey: [C:03+1] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236333 (owner: 10Muehlenhoff) [08:08:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1240.eqiad.wmnet with reason: Maintenance [08:09:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ssalgaonkar-wmf - https://phabricator.wikimedia.org/T415594#11582089 (10elukey) 05Open→03Resolved a:03elukey [08:12:22] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1235781 (owner: 10L10n-bot) [08:13:43] (03PS1) 10Slyngshede: P:idm remove approver [puppet] - 10https://gerrit.wikimedia.org/r/1236668 [08:19:36] (03CR) 10Tiziano Fogli: "A couple of considerations:" [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [08:22:08] jouncebot: now [08:22:08] For the next 0 hour(s) and 37 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T0800) [08:23:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T415786)', diff saved to https://phabricator.wikimedia.org/P88627 and previous config saved to /var/cache/conftool/dbconfig/20260204-082259-marostegui.json [08:23:03] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [08:23:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2216.codfw.wmnet with reason: Maintenance [08:23:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2216 (T415786)', diff saved to https://phabricator.wikimedia.org/P88628 and previous config saved to /var/cache/conftool/dbconfig/20260204-082324-marostegui.json [08:23:36] Amir1, urbanecm: I'm going to backport a patch that missed the train. I'll self service [08:24:31] 06SRE, 10LDAP-Access-Requests: Add Jacob Thwaites WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T416358#11582113 (10Jacob_WMDE) Hi @Dzahn, I've just emailed Katie, I'll let you know once this step is complete. [08:24:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T415786)', diff saved to https://phabricator.wikimedia.org/P88629 and previous config saved to /var/cache/conftool/dbconfig/20260204-082450-marostegui.json [08:25:35] (03PS1) 10Phuedx: ext.wikimediaEvents: Add Test Kitchen new external path test [extensions/WikimediaEvents] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236669 (https://phabricator.wikimedia.org/T415708) [08:29:19] (03CR) 10CI reject: [V:04-1] ext.wikimediaEvents: Add Test Kitchen new external path test [extensions/WikimediaEvents] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236669 (https://phabricator.wikimedia.org/T415708) (owner: 10Phuedx) [08:31:46] (03CR) 10Phuedx: "Recheck" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236669 (https://phabricator.wikimedia.org/T415708) (owner: 10Phuedx) [08:38:16] OK. I'm not going to backport the change yet. There's a test failure that I'll need to dig into [08:38:30] (03PS1) 10Elukey: installserver: add EFI preseed config for ms-fe102[14] [puppet] - 10https://gerrit.wikimedia.org/r/1236671 (https://phabricator.wikimedia.org/T416245) [08:39:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P88630 and previous config saved to /var/cache/conftool/dbconfig/20260204-083958-marostegui.json [08:40:25] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:40:29] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:41:10] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:41:36] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1236671 (https://phabricator.wikimedia.org/T416245) (owner: 10Elukey) [08:42:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:44:56] (03CR) 10Elukey: [C:03+2] installserver: add EFI preseed config for ms-fe102[14] [puppet] - 10https://gerrit.wikimedia.org/r/1236671 (https://phabricator.wikimedia.org/T416245) (owner: 10Elukey) [08:47:47] phuedx: hi, take your time! :) I am the one running the MW train this week. I haven't even started my daily routine yet [08:47:47] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11582175 (10elukey) @Jclark-ctr Matthew is out this week, I just merged a change that should unblock you. Lemme know how it goes! [08:49:41] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:50:29] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:51:10] FIRING: [8x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:52:39] FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:53:49] (03PS1) 10Elukey: services: upgrade thumbor's haproxy container to Bookworm and 2.8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236673 [08:55:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P88631 and previous config saved to /var/cache/conftool/dbconfig/20260204-085506-marostegui.json [08:59:25] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:59:29] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:59:41] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T0900) [09:00:29] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:01:10] FIRING: [9x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:02:39] RESOLVED: [8x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:05:43] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:06:10] RESOLVED: [9x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:06:29] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:06:40] FIRING: [4x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:07:54] FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:08:54] (03CR) 10Gehel: [C:04-1] "See comments inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [09:10:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T415786)', diff saved to https://phabricator.wikimedia.org/P88632 and previous config saved to /var/cache/conftool/dbconfig/20260204-091015-marostegui.json [09:10:18] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [09:10:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [09:11:16] (03CR) 10Jelto: [C:03+1] "lgtm now, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [09:11:25] FIRING: [9x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:12:29] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:12:43] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:12:52] phuedx: about your WikimediaEvents patch, I am not sure it is the cause of the CI failure since the error seems to be in CheckUser. It might be missing an extension [09:13:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1204 - https://phabricator.wikimedia.org/T414861#11582231 (10Gehel) a:05Jclark-ctr→03BTullis [09:13:45] hashar: I think so too but I don't want this to block you from deploying the train. I'll abandon the patch for now [09:13:55] The patch that I'm trying to backport is in -wmf.14 anyway so [09:14:07] (03Abandoned) 10Phuedx: ext.wikimediaEvents: Add Test Kitchen new external path test [extensions/WikimediaEvents] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236669 (https://phabricator.wikimedia.org/T415708) (owner: 10Phuedx) [09:14:38] I might have broken it while removing recursive injection of dependencies, or something changed in CheckUser that suddenly hard require another ext [09:14:39] :/ [09:16:25] RESOLVED: [9x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:16:58] (03CR) 10Dzahn: [C:03+2] "also needs a check if this is the active host around it - quickdatacopy only installs rsync service where needed - unless we change that" [puppet] - 10https://gerrit.wikimedia.org/r/1236308 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:17:54] RESOLVED: [8x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:18:07] 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06ServiceOps new, and 6 others: October 2025 Bullseye reboots (ServiceOps hosts) - https://phabricator.wikimedia.org/T416451 (10Blake) 03NEW [09:18:23] phuedx: confirmed, I ran the CI job for WikimediaEvents @ wmf/1.46.0-wmf.13 and it fails the same way https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php83/36726//console [09:18:31] and I am pretty sure that is due to CheckUser [09:27:13] (03PS1) 10Dzahn: vrts: ensure rsync auto restart only on active host [puppet] - 10https://gerrit.wikimedia.org/r/1236674 (https://phabricator.wikimedia.org/T416449) [09:27:32] (03CR) 10CI reject: [V:04-1] vrts: ensure rsync auto restart only on active host [puppet] - 10https://gerrit.wikimedia.org/r/1236674 (https://phabricator.wikimedia.org/T416449) (owner: 10Dzahn) [09:28:23] (03PS2) 10Dzahn: vrts: ensure rsync auto restart only on active host [puppet] - 10https://gerrit.wikimedia.org/r/1236674 (https://phabricator.wikimedia.org/T416449) [09:29:40] 06SRE, 06Infrastructure-Foundations, 07Epic: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452 (10elukey) 03NEW [09:29:42] (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1236674" [puppet] - 10https://gerrit.wikimedia.org/r/1236308 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:29:46] I am doing the train [09:30:25] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236675 (https://phabricator.wikimedia.org/T413805) [09:30:28] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236675 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [09:31:44] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236675 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [09:32:39] (03PS3) 10Dzahn: vrts: ensure rsync auto restart only on active host [puppet] - 10https://gerrit.wikimedia.org/r/1236674 (https://phabricator.wikimedia.org/T416449) [09:34:03] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cumin2003 - https://phabricator.wikimedia.org/T416385#11582293 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:34:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1251.eqiad.wmnet with reason: Maintenance [09:34:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1251 (T415786)', diff saved to https://phabricator.wikimedia.org/P88634 and previous config saved to /var/cache/conftool/dbconfig/20260204-093421-marostegui.json [09:34:25] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [09:34:30] (03PS1) 10Muehlenhoff: Add site.pp/preseed for cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1236676 (https://phabricator.wikimedia.org/T461385) [09:34:34] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1236674/7972/" [puppet] - 10https://gerrit.wikimedia.org/r/1236674 (https://phabricator.wikimedia.org/T416449) (owner: 10Dzahn) [09:34:46] (03PS2) 10Elukey: services: upgrade thumbor's haproxy container to Bookworm and 2.8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236673 (https://phabricator.wikimedia.org/T416452) [09:34:52] (03CR) 10Jelto: [C:03+1] vrts: ensure rsync auto restart only on active host [puppet] - 10https://gerrit.wikimedia.org/r/1236674 (https://phabricator.wikimedia.org/T416449) (owner: 10Dzahn) [09:35:13] (03CR) 10Dzahn: [C:03+2] vrts: ensure rsync auto restart only on active host [puppet] - 10https://gerrit.wikimedia.org/r/1236674 (https://phabricator.wikimedia.org/T416449) (owner: 10Dzahn) [09:37:53] (03Abandoned) 10Slyngshede: Meta IP location changes [dns] - 10https://gerrit.wikimedia.org/r/1216806 (owner: 10Slyngshede) [09:37:54] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.14 refs T413805 [09:37:57] T413805: 1.46.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T413805 [09:38:20] !log installing openssl security updates [09:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:53] (03CR) 10Muehlenhoff: [C:03+2] Add Cumin alias for crm [puppet] - 10https://gerrit.wikimedia.org/r/1236238 (owner: 10Muehlenhoff) [09:41:35] (03CR) 10Dzahn: [C:03+2] ncredir: remove wikipedia25.org, keep wikipedia25.com to www.wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [09:43:15] (03CR) 10Elukey: [C:03+1] Add site.pp/preseed for cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1236676 (https://phabricator.wikimedia.org/T461385) (owner: 10Muehlenhoff) [09:47:26] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting for auth and bot (magru) [puppet] - 10https://gerrit.wikimedia.org/r/1236679 (https://phabricator.wikimedia.org/T406545) [09:49:31] (03CR) 10Muehlenhoff: [C:03+2] Add site.pp/preseed for cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1236676 (https://phabricator.wikimedia.org/T461385) (owner: 10Muehlenhoff) [09:54:52] I am going to restart Jenkins instances [09:55:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T415786)', diff saved to https://phabricator.wikimedia.org/P88635 and previous config saved to /var/cache/conftool/dbconfig/20260204-095510-marostegui.json [09:55:14] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [09:56:53] * hashar waits for job to complete [09:59:26] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1236679 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [09:59:44] (03CR) 10Jelto: [V:03+1] "I591dcb36570281234854fb3cdb90fc3386ce87a9 adds general support to set the QoS to low optionally but the default is unchanged. This change " [puppet] - 10https://gerrit.wikimedia.org/r/1234984 (owner: 10Jelto) [09:59:44] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cumin2003 - https://phabricator.wikimedia.org/T416385#11582363 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03Jhancock.wm >>! In T416385#11580539, @Jhancock.wm wrote: > @MoritzMuehlenhoff > when you or someone you can delegate this to can, could you fil... [10:01:54] (03CR) 10Vgutierrez: [C:03+1] varnish: set Retry-After for cli_tool, wdqs and library policies [puppet] - 10https://gerrit.wikimedia.org/r/1230937 (https://phabricator.wikimedia.org/T415375) (owner: 10Fabfur) [10:02:31] 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06ServiceOps new, and 6 others: October 2025 Bullseye reboots (ServiceOps hosts) - https://phabricator.wikimedia.org/T416451#11582371 (10MoritzMuehlenhoff) The conf* servers are tricky to reboot, they've been often skipped in the past (as visible by th... [10:06:02] !log Restarting Gerrit instances [10:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [10:06:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2150 (T415786)', diff saved to https://phabricator.wikimedia.org/P88636 and previous config saved to /var/cache/conftool/dbconfig/20260204-100638-marostegui.json [10:06:42] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [10:06:57] 06SRE, 10LDAP-Access-Requests: Grant Access to bitu-account-managers(?) for reedy - https://phabricator.wikimedia.org/T416062#11582392 (10MoritzMuehlenhoff) @Reedy Due to an oversight "Bitu-account-managers" was only requesteable on the test instance for Bitu. This has now been fixed, please request it on http... [10:09:08] stopping it again [10:09:32] 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06ServiceOps new, and 6 others: October 2025 Bullseye reboots (ServiceOps hosts) - https://phabricator.wikimedia.org/T416451#11582397 (10Blake) Hey Moritz, thanks, that makes sense. Does that mean we'd only reboot the codfw hosts, as eqiad will be the... [10:10:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P88637 and previous config saved to /var/cache/conftool/dbconfig/20260204-101018-marostegui.json [10:10:42] !log Gerrit is back [10:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:11:19] !log Restarted CI Jenkins [10:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:31] (03PS1) 10Elukey: admin: add user ggalofre to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1236686 (https://phabricator.wikimedia.org/T415172) [10:14:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1236686 (https://phabricator.wikimedia.org/T415172) (owner: 10Elukey) [10:15:40] I am getting a coffee break and I'll check the logs [10:16:03] (03CR) 10Elukey: [C:03+2] admin: add user ggalofre to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1236686 (https://phabricator.wikimedia.org/T415172) (owner: 10Elukey) [10:16:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:16:30] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11582410 (10jcrespo) >>! In T414725#11580692, @Jclark-ctr wrote: > @jcrespo with eLukey and Topranks help we where able to get it to start imaging... [10:16:47] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 07Epic, 13Patch-For-Review: Migrate Docker images running in Production away from Bullseye - https://phabricator.wikimedia.org/T416452#11582411 (10Gehel) [10:22:03] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for ggalofre - https://phabricator.wikimedia.org/T415172#11582435 (10elukey) 05In progress→03Resolved Data access is propagating now, it will be available in ~30 mins. Going to close, please reopen and/... [10:25:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P88638 and previous config saved to /var/cache/conftool/dbconfig/20260204-102527-marostegui.json [10:27:59] ah metawiki OAuth fails with `Key cannot be empty` [10:28:00] pff [10:29:25] !log Rolling back to group0 due to an issue with OAuth on metawiki # T413805 [10:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:28] T413805: 1.46.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T413805 [10:29:42] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236687 (https://phabricator.wikimedia.org/T413805) [10:29:45] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236687 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [10:30:36] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236687 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [10:31:46] T416456 [10:31:47] T416456: Lcobucci\JWT\Signer\InvalidKeyProvided: Key cannot be empty (/w/rest.php/oauth2/access_token) - https://phabricator.wikimedia.org/T416456 [10:33:07] (03CR) 10Ayounsi: [C:03+1] Add Nokia BGP routing policy for wikikube-worker / k8s hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1229562 (https://phabricator.wikimedia.org/T408757) (owner: 10Cathal Mooney) [10:33:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:36:42] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.14 refs T413805 [10:36:46] T413805: 1.46.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T413805 [10:36:50] (03CR) 10Dzahn: [C:03+1] "oh yea, my comment was about a previous PS then" [puppet] - 10https://gerrit.wikimedia.org/r/1234984 (owner: 10Jelto) [10:37:16] (03PS1) 10Elukey: install_server: add UEFI partman recipe for backup1015 [puppet] - 10https://gerrit.wikimedia.org/r/1236688 (https://phabricator.wikimedia.org/T414725) [10:38:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:38:18] (03PS2) 10Elukey: install_server: add UEFI partman recipe for backup1015 [puppet] - 10https://gerrit.wikimedia.org/r/1236688 (https://phabricator.wikimedia.org/T414725) [10:38:47] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:39:12] !log upgrade cloudcumin2001 to bookworm T403153 [10:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:15] T403153: Upgrade cloudcumin hosts to bookworm/trixie - https://phabricator.wikimedia.org/T403153 [10:39:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11582518 (10elukey) @jcrespo Hi! I guess you refer to https://wikitech.wikimedia.org/wiki/UEFI_Boot, we can definitely add more docs together if you have time. I file... [10:40:14] (03CR) 10CI reject: [V:04-1] install_server: add UEFI partman recipe for backup1015 [puppet] - 10https://gerrit.wikimedia.org/r/1236688 (https://phabricator.wikimedia.org/T414725) (owner: 10Elukey) [10:40:34] (03PS2) 10Fabfur: cache::upload: enable global ratelimiting for auth and bot (magru) [puppet] - 10https://gerrit.wikimedia.org/r/1236679 (https://phabricator.wikimedia.org/T406545) [10:40:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T415786)', diff saved to https://phabricator.wikimedia.org/P88639 and previous config saved to /var/cache/conftool/dbconfig/20260204-104035-marostegui.json [10:40:39] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [10:41:09] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1236679 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:41:48] (03PS3) 10Fabfur: cache::upload: enable global ratelimiting for bot (magru) [puppet] - 10https://gerrit.wikimedia.org/r/1236679 (https://phabricator.wikimedia.org/T406545) [10:42:44] (03CR) 10Alexandros Kosiaris: [C:04-1] "Left a couple of comments for the commit message. Simply put it reads like slop and doesn't represent what the patch does. The change itse" [puppet] - 10https://gerrit.wikimedia.org/r/1222271 (https://phabricator.wikimedia.org/T201491) (owner: 10Divyaratann Srivastava) [10:43:15] (03PS3) 10Elukey: install_server: add UEFI partman recipe for backup1015 [puppet] - 10https://gerrit.wikimedia.org/r/1236688 (https://phabricator.wikimedia.org/T414725) [10:44:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcumin2001.codfw.wmnet [10:45:48] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1236679 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:47:03] !log installing openjdk-17 security updates [10:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:32] (03CR) 10Fabfur: [C:03+2] varnish: set Retry-After for cli_tool, wdqs and library policies [puppet] - 10https://gerrit.wikimedia.org/r/1230937 (https://phabricator.wikimedia.org/T415375) (owner: 10Fabfur) [10:48:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin2001.codfw.wmnet [10:52:29] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11582550 (10jijiki) [10:52:50] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11582551 (10jijiki) [10:53:19] (03CR) 10Alexandros Kosiaris: [C:03+1] docker_registry: move /v2/restricted to the s3 restricted backend [puppet] - 10https://gerrit.wikimedia.org/r/1229145 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey) [10:54:14] (03CR) 10Alexandros Kosiaris: [C:03+1] ferm: Only collect resources when ensure is present [puppet] - 10https://gerrit.wikimedia.org/r/1214549 (owner: 10Majavah) [10:55:13] (03CR) 10Vgutierrez: [C:03+1] "VTCs are happy" [puppet] - 10https://gerrit.wikimedia.org/r/1236679 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [10:55:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11582558 (10Gehel) a:05Jclark-ctr→03BTullis [10:56:43] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11582569 (10Gehel) With the various investigations that have happened around Airflow, do we now have a... [10:59:05] (03PS1) 10Kosta Harlan: IPReputationIPoidDataLookup: Allow returning stale values for 72 hours [extensions/IPReputation] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236689 (https://phabricator.wikimedia.org/T416316) [10:59:19] (03PS1) 10Kosta Harlan: IPReputationIPoidDataLookup: Allow returning stale values for 72 hours [extensions/IPReputation] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236690 (https://phabricator.wikimedia.org/T416316) [10:59:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/IPReputation] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236690 (https://phabricator.wikimedia.org/T416316) (owner: 10Kosta Harlan) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T1100) [11:00:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/IPReputation] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236689 (https://phabricator.wikimedia.org/T416316) (owner: 10Kosta Harlan) [11:02:58] (03PS1) 10Zabe: Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" [core] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236692 (https://phabricator.wikimedia.org/T416456) [11:03:40] (03PS4) 10Elukey: install_server: add UEFI partman recipe for backup1015 [puppet] - 10https://gerrit.wikimedia.org/r/1236688 (https://phabricator.wikimedia.org/T414725) [11:04:18] (03PS1) 10Zabe: Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" [vendor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236693 (https://phabricator.wikimedia.org/T416456) [11:04:52] (03PS2) 10Zabe: Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" [core] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236692 (https://phabricator.wikimedia.org/T416456) [11:05:52] (03CR) 10Hnowlan: [C:03+1] services: upgrade thumbor's haproxy container to Bookworm and 2.8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236673 (https://phabricator.wikimedia.org/T416452) (owner: 10Elukey) [11:06:03] (03CR) 10Hnowlan: [C:03+1] "Thanks for doing this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236673 (https://phabricator.wikimedia.org/T416452) (owner: 10Elukey) [11:06:08] (03CR) 10Alexandros Kosiaris: "Adding effie for review." [puppet] - 10https://gerrit.wikimedia.org/r/1229229 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester) [11:07:11] (03PS5) 10Elukey: install_server: add UEFI partman recipe for backup1015 [puppet] - 10https://gerrit.wikimedia.org/r/1236688 (https://phabricator.wikimedia.org/T414725) [11:08:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T415786)', diff saved to https://phabricator.wikimedia.org/P88640 and previous config saved to /var/cache/conftool/dbconfig/20260204-110829-marostegui.json [11:08:33] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [11:11:35] (03PS1) 10Majavah: hieradata: codfw1dev: Add bastion-codfw1dev-06 to cumin_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1236694 [11:12:06] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1236694 (owner: 10Majavah) [11:12:16] (03CR) 10FNegri: [C:03+1] hieradata: codfw1dev: Add bastion-codfw1dev-06 to cumin_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1236694 (owner: 10Majavah) [11:12:16] (03CR) 10Majavah: [C:03+2] hieradata: codfw1dev: Add bastion-codfw1dev-06 to cumin_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1236694 (owner: 10Majavah) [11:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [11:14:54] (03CR) 10Fabfur: [C:03+2] cache::upload: enable global ratelimiting for bot (magru) [puppet] - 10https://gerrit.wikimedia.org/r/1236679 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [11:15:03] (03CR) 10Jcrespo: [C:03+1] install_server: add UEFI partman recipe for backup1015 [puppet] - 10https://gerrit.wikimedia.org/r/1236688 (https://phabricator.wikimedia.org/T414725) (owner: 10Elukey) [11:15:54] (03CR) 10Elukey: [C:03+2] install_server: add UEFI partman recipe for backup1015 [puppet] - 10https://gerrit.wikimedia.org/r/1236688 (https://phabricator.wikimedia.org/T414725) (owner: 10Elukey) [11:18:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P88641 and previous config saved to /var/cache/conftool/dbconfig/20260204-111837-marostegui.json [11:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:21:15] (03PS1) 10Majavah: hieradata: codfw1dev: Remove old bastions [puppet] - 10https://gerrit.wikimedia.org/r/1236696 [11:22:28] (03CR) 10FNegri: [C:03+1] hieradata: codfw1dev: Remove old bastions [puppet] - 10https://gerrit.wikimedia.org/r/1236696 (owner: 10Majavah) [11:22:41] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:23:33] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:24:10] (03CR) 10Majavah: [C:03+2] hieradata: codfw1dev: Remove old bastions [puppet] - 10https://gerrit.wikimedia.org/r/1236696 (owner: 10Majavah) [11:25:12] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:26:56] !log jynus@cumin1003 START - Cookbook sre.hosts.reimage for host backup1015.eqiad.wmnet with OS trixie [11:27:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11582703 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1003 for host backup1015.eqiad.wmnet with OS trixie [11:28:03] (03PS3) 10Majavah: ferm: Only collect resources when ensure is present [puppet] - 10https://gerrit.wikimedia.org/r/1214549 [11:28:03] (03PS6) 10Majavah: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) [11:28:03] (03PS7) 10Majavah: wmflib: hosts2ips: Allow passing in IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1211650 [11:28:03] (03PS7) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [11:28:04] (03PS7) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [11:28:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P88642 and previous config saved to /var/cache/conftool/dbconfig/20260204-112846-marostegui.json [11:29:34] jouncebot: nowandnext [11:29:34] For the next 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T1100) [11:29:34] In 0 hour(s) and 30 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T1200) [11:29:47] ah lovely, I can deploy thumbor [11:30:30] (03CR) 10Elukey: [C:03+2] services: upgrade thumbor's haproxy container to Bookworm and 2.8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236673 (https://phabricator.wikimedia.org/T416452) (owner: 10Elukey) [11:32:10] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: sync [11:32:14] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7974/console" [puppet] - 10https://gerrit.wikimedia.org/r/1214549 (owner: 10Majavah) [11:32:20] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [11:34:07] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7975/console" [puppet] - 10https://gerrit.wikimedia.org/r/1214549 (owner: 10Majavah) [11:34:33] (03CR) 10Majavah: [V:03+1 C:03+2] ferm: Only collect resources when ensure is present [puppet] - 10https://gerrit.wikimedia.org/r/1214549 (owner: 10Majavah) [11:35:07] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [11:35:33] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [11:38:10] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: sync [11:38:47] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:38:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T415786)', diff saved to https://phabricator.wikimedia.org/P88643 and previous config saved to /var/cache/conftool/dbconfig/20260204-113854-marostegui.json [11:38:57] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [11:39:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [11:39:25] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [11:40:04] looks good on eqiad [11:41:50] !log jynus@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1015.eqiad.wmnet with reason: host reimage [11:42:27] !log installing openjdk-11 security updates [11:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:51] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, 07Datacenter-Switchover: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11582768 (10Blake) [11:42:54] 06SRE, 06Data-Persistence, 06ServiceOps new, 07Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907#11582770 (10Blake) [11:43:04] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting for bot (ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/1236701 (https://phabricator.wikimedia.org/T406545) [11:43:06] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting for bot (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1236702 (https://phabricator.wikimedia.org/T406545) [11:43:08] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting for bot (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1236703 (https://phabricator.wikimedia.org/T406545) [11:43:10] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting for bot (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1236704 (https://phabricator.wikimedia.org/T406545) [11:43:12] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting for bot (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1236705 (https://phabricator.wikimedia.org/T406545) [11:43:14] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting for bot (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1236706 (https://phabricator.wikimedia.org/T406545) [11:43:51] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1236701 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [11:45:35] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1015.eqiad.wmnet with reason: host reimage [11:45:41] (03CR) 10Hnowlan: [C:03+1] rediscope: lower cpu and memoy limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233161 (owner: 10Daniel Kinzler) [11:47:07] (03PS7) 10Majavah: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) [11:47:07] (03PS8) 10Majavah: wmflib: hosts2ips: Allow passing in IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1211650 [11:47:07] (03PS8) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [11:47:07] (03PS8) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [11:47:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T415786)', diff saved to https://phabricator.wikimedia.org/P88644 and previous config saved to /var/cache/conftool/dbconfig/20260204-114718-marostegui.json [11:47:22] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [11:49:52] (03PS8) 10Majavah: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) [11:49:52] (03PS9) 10Majavah: wmflib: hosts2ips: Allow passing in IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1211650 [11:49:52] (03PS9) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [11:49:52] (03PS9) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [11:49:53] (03PS1) 10Majavah: nftables::service: Fix variable reference [puppet] - 10https://gerrit.wikimedia.org/r/1236707 [11:55:12] (03PS1) 10Jcrespo: install_server: Prevent reimage of backup1015 and setup all other new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1236708 (https://phabricator.wikimedia.org/T414725) [11:59:32] (03CR) 10Jcrespo: "I want to do some double checks before merging, but please Luca, have a look in case you see something wrong." [puppet] - 10https://gerrit.wikimedia.org/r/1236708 (https://phabricator.wikimedia.org/T414725) (owner: 10Jcrespo) [12:00:05] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T1200) [12:00:53] (03CR) 10Cathal Mooney: [C:03+2] Add Nokia BGP routing policy for wikikube-worker / k8s hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1229562 (https://phabricator.wikimedia.org/T408757) (owner: 10Cathal Mooney) [12:02:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P88645 and previous config saved to /var/cache/conftool/dbconfig/20260204-120227-marostegui.json [12:02:37] (03Merged) 10jenkins-bot: Add Nokia BGP routing policy for wikikube-worker / k8s hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1229562 (https://phabricator.wikimedia.org/T408757) (owner: 10Cathal Mooney) [12:03:41] (03PS5) 10Cathal Mooney: prepend_as_out: switch outbound policy rather than modify existing [homer/public] - 10https://gerrit.wikimedia.org/r/1130093 (https://phabricator.wikimedia.org/T389606) [12:05:12] (03CR) 10Elukey: [C:03+1] "One less custom recipe \o/" [puppet] - 10https://gerrit.wikimedia.org/r/1236708 (https://phabricator.wikimedia.org/T414725) (owner: 10Jcrespo) [12:05:38] (03PS1) 10Mvolz: Retry update of zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236710 [12:05:44] (03CR) 10Dzahn: [V:03+1 C:03+2] zuul: set owner and notify zookeeper service with pki::get_cert [puppet] - 10https://gerrit.wikimedia.org/r/1236386 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [12:06:20] (03Abandoned) 10Alexandros Kosiaris: monitoring: Harmonize check naming to a common set of rules [puppet] - 10https://gerrit.wikimedia.org/r/448503 (owner: 10Jcrespo) [12:06:42] !log jynus@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jynus@cumin1003" [12:07:19] !log jynus@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jynus@cumin1003" [12:07:20] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1015.eqiad.wmnet with OS trixie [12:07:27] (03Abandoned) 10Alexandros Kosiaris: mesh.configuration: fix compliace with spec [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048440 (owner: 10Giuseppe Lavagetto) [12:07:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11582838 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1003 for host backup1015.eqiad.wmnet with OS trixie completed: - backup1015... [12:07:50] (03Abandoned) 10Alexandros Kosiaris: mediawiki::web::yaml_defs: remove hack for static files [puppet] - 10https://gerrit.wikimedia.org/r/1052129 (owner: 10Giuseppe Lavagetto) [12:09:40] (03CR) 10Jcrespo: "I will setup the extra HDs before merging, I think it worked as expected, but in case there is some tunning necessary (e.g. logically look" [puppet] - 10https://gerrit.wikimedia.org/r/1236708 (https://phabricator.wikimedia.org/T414725) (owner: 10Jcrespo) [12:11:20] (03CR) 10Mvolz: [C:03+2] Retry update of zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236710 (owner: 10Mvolz) [12:12:57] (03Merged) 10jenkins-bot: Retry update of zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236710 (owner: 10Mvolz) [12:13:35] (03Abandoned) 10Alexandros Kosiaris: Remove obsolete dummy certs for api and parsoid [labs/private] - 10https://gerrit.wikimedia.org/r/1059029 (https://phabricator.wikimedia.org/T360636) (owner: 10Clément Goubert) [12:14:23] (03CR) 10Cathal Mooney: [C:03+2] prepend_as_out: switch outbound policy rather than modify existing [homer/public] - 10https://gerrit.wikimedia.org/r/1130093 (https://phabricator.wikimedia.org/T389606) (owner: 10Cathal Mooney) [12:14:29] (03CR) 10Alexandros Kosiaris: [C:03+1] "I don't think we ever moved again with this at the end. Do we want to?" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1063217 (https://phabricator.wikimedia.org/T339863) (owner: 10Hnowlan) [12:15:42] (03CR) 10Alexandros Kosiaris: [C:04-1] "What Pppery says." [software] - 10https://gerrit.wikimedia.org/r/1224788 (https://phabricator.wikimedia.org/T201491) (owner: 10Ebenezer Rao) [12:15:42] (03Merged) 10jenkins-bot: prepend_as_out: switch outbound policy rather than modify existing [homer/public] - 10https://gerrit.wikimedia.org/r/1130093 (https://phabricator.wikimedia.org/T389606) (owner: 10Cathal Mooney) [12:16:19] (03CR) 10Alexandros Kosiaris: [C:03+2] cxserver: update chart metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218270 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [12:17:08] (03CR) 10Hnowlan: "I think this needs further verification rather than this hacky fix. I'll abandon and leave the issue open for investigation" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1063217 (https://phabricator.wikimedia.org/T339863) (owner: 10Hnowlan) [12:17:17] (03Abandoned) 10Hnowlan: poolcounter: introduce allowlist to skip rate limit [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1063217 (https://phabricator.wikimedia.org/T339863) (owner: 10Hnowlan) [12:17:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P88646 and previous config saved to /var/cache/conftool/dbconfig/20260204-121735-marostegui.json [12:18:13] (03Merged) 10jenkins-bot: cxserver: update chart metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218270 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [12:18:32] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [12:18:54] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [12:21:21] General heads up to SREs I'm about to reapply a change which last time caused performance issues. It contains a patch that hopefully fixes it, but I'm not 100% confident the patch fixes it entirely, so there's potential it'll make zotero alert again. [12:21:45] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [12:22:16] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [12:22:27] Mvolz: would please relay this message to #wikimedia-sre please? [12:23:03] !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: OpenJDK 11 security updates - jmm@cumin2002 [12:23:53] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe1021.eqiad.wmnet with OS bullseye [12:24:06] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11582869 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1021.eqiad.wmnet with OS bullseye [12:25:03] done! [12:25:11] (03CR) 10Dzahn: [C:03+2] zuul: move cert paths to role level, drop host-name based config [puppet] - 10https://gerrit.wikimedia.org/r/1236390 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [12:25:22] Err, message sent, not done with deployment. [12:28:49] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! Thanks for working on this." [puppet] - 10https://gerrit.wikimedia.org/r/1236243 (owner: 10Muehlenhoff) [12:29:14] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1234984 (owner: 10Jelto) [12:30:16] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 95): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7977/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1236707 (owner: 10Majavah) [12:31:51] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [12:32:21] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [12:32:30] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1236707 (owner: 10Majavah) [12:32:41] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11582903 (10cmooney) RIPEstat looks good in terms of visibility of the new ns2 Anycast prefix: {F71670832 width=400} [12:32:41] (03CR) 10Majavah: [V:03+1 C:03+2] nftables::service: Fix variable reference [puppet] - 10https://gerrit.wikimedia.org/r/1236707 (owner: 10Majavah) [12:32:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T415786)', diff saved to https://phabricator.wikimedia.org/P88647 and previous config saved to /var/cache/conftool/dbconfig/20260204-123243-marostegui.json [12:32:47] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [12:33:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [12:33:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2159 (T415786)', diff saved to https://phabricator.wikimedia.org/P88648 and previous config saved to /var/cache/conftool/dbconfig/20260204-123308-marostegui.json [12:33:31] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [12:40:09] 06SRE, 10Prod-Kubernetes, 06ServiceOps new, 06SRE Observability (FY2025/2026-Q3): write some recording rules for queries used in the appserver RED k8s dashboard - https://phabricator.wikimedia.org/T249663#11582946 (10MLechvien-WMF) #sre_observability were you able to look into this? [12:40:13] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1021.eqiad.wmnet with reason: host reimage [12:40:41] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:40:45] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:40:49] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:41:28] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:41:36] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:41:42] 06SRE, 06serviceops, 07Datacenter-Switchover: Investigate failed maintenance jobs discovered during DC switchback - https://phabricator.wikimedia.org/T335409#11582959 (10Blake) 05Open→03Invalid Given that the maintenance servers are going to be decommissioned (https://wikitech.wikimedia.org/wiki/Main... [12:42:30] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe1023.eqiad.wmnet with OS bullseye [12:42:31] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe1024.eqiad.wmnet with OS bullseye [12:42:38] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11582976 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1023.eqiad.wmnet with OS bullseye [12:42:39] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11582977 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1024.eqiad.wmnet with OS bullseye [12:42:43] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:43:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: OpenJDK 11 security updates - jmm@cumin2002 [12:43:13] (03CR) 10Hnowlan: [C:03+1] "lgtm, one note, one nit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233154 (https://phabricator.wikimedia.org/T405578) (owner: 10Daniel Kinzler) [12:44:16] !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: OpenJDK 11 security updates - jmm@cumin2002 [12:44:17] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1021.eqiad.wmnet with reason: host reimage [12:44:34] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11582983 (10Jclark-ctr) [12:44:46] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11582985 (10Jclark-ctr) a:05MatthewVernon→03Jclark-ctr [12:44:59] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Expose hosts from MysqlLegacyRemoteHosts in spicerack - https://phabricator.wikimedia.org/T328911#11582986 (10Blake) [12:46:13] (03CR) 10Hnowlan: [C:03+1] rest-gateway: re.apply "add support for sessionJwt cookies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233155 (https://phabricator.wikimedia.org/T405578) (owner: 10Daniel Kinzler) [12:47:15] 14SRE-Sprint-Week-Sustainability-March2023, 06Data-Persistence (work done), 06ServiceOps new, 07Datacenter-Switchover, 07Sustainability (Incident Followup): Globalize mwconfig ReadOnly - https://phabricator.wikimedia.org/T330304#11582991 (10Blake) [12:53:00] !log remove legacy eventstreams-internal discovery certificate T365798 [12:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:03] T365798: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798 [12:53:09] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11583052 (10Jclark-ctr) a:03Jclark-ctr [12:54:03] (03PS3) 10Ryan Kemper: opensearch-semantic-search-test: depl eqiad, codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234594 (https://phabricator.wikimedia.org/T414691) [12:54:09] 06SRE, 06Data-Persistence, 06ServiceOps new, 07Datacenter-Switchover, 07Epic: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907#11583063 (10Blake) [12:54:37] 06SRE, 06Data-Persistence, 06ServiceOps new, 07Datacenter-Switchover, 07Epic: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907#11583067 (10Blake) a:05Blake→03None [12:55:15] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236103 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [12:56:37] (03CR) 10Kevin Bazira: [C:03+2] ml-services: Cap maxReplicas at 7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236103 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [12:56:44] (03PS4) 10Brouberol: opensearch-semantic-search-test: depl eqiad, codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234594 (https://phabricator.wikimedia.org/T414691) (owner: 10Ryan Kemper) [12:58:33] (03Merged) 10jenkins-bot: ml-services: Cap maxReplicas at 7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236103 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [12:58:37] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1024.eqiad.wmnet with reason: host reimage [12:59:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:00:00] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-fe1023.eqiad.wmnet with OS bullseye [13:00:12] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11583077 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1023.eqiad.wmnet with OS bullseye executed with errors: - ms-fe1023 (**FAIL**) - Remove... [13:00:21] !log remove legacy wdqs-internal discovery certificate T365798 [13:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:24] T365798: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798 [13:00:45] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe1023.eqiad.wmnet with OS bullseye [13:00:49] !log jclark@cumin1003 START - Cookbook sre.hosts.move-vlan for host ms-fe1023 [13:00:49] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-fe1023 [13:00:52] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11583099 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1023.eqiad.wmnet with OS bullseye [13:02:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: OpenJDK 11 security updates - jmm@cumin2002 [13:03:58] (03PS1) 10Kevin Bazira: ml-services: update rr-wikidata prod image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236723 (https://phabricator.wikimedia.org/T414060) [13:04:16] RESOLVED: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:05:11] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1024.eqiad.wmnet with reason: host reimage [13:06:22] !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: OpenJDK 11 security updates - jmm@cumin2002 [13:07:47] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:10:51] jclark@cumin1003 reimage (PID 2527779) is awaiting input [13:10:54] (03PS1) 10Federico Ceratto: mysql: rename newpool cookbook to pool [cookbooks] - 10https://gerrit.wikimedia.org/r/1236726 (https://phabricator.wikimedia.org/T383674) [13:12:01] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:12:01] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1021.eqiad.wmnet with OS bullseye [13:12:12] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11583168 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1021.eqiad.wmnet with OS bullseye completed: - ms-fe1021 (**WARN**) - Removed from Pupp... [13:12:44] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11583169 (10Jclark-ctr) [13:13:00] (03PS6) 10Cathal Mooney: gitlab: set qos to low in rsync server [puppet] - 10https://gerrit.wikimedia.org/r/1234984 (owner: 10Jelto) [13:13:55] (03PS7) 10Cathal Mooney: gitlab: set qos to low in rsync server [puppet] - 10https://gerrit.wikimedia.org/r/1234984 (owner: 10Jelto) [13:14:57] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1234984 (owner: 10Jelto) [13:16:35] (03PS1) 10Brouberol: airflow-sre: enable ssh access from task pods to the puppetservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236729 (https://phabricator.wikimedia.org/T402512) [13:17:01] (03PS2) 10Brouberol: airflow-sre: enable ssh access from task pods to the puppetservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236729 (https://phabricator.wikimedia.org/T402512) [13:17:12] (03CR) 10CI reject: [V:04-1] airflow-sre: enable ssh access from task pods to the puppetservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236729 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [13:17:42] (03CR) 10Brouberol: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236729 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [13:19:54] (03PS1) 10Dzahn: zookeeper/zuul: use standard port 2281 for TLS secureClientPort [puppet] - 10https://gerrit.wikimedia.org/r/1236730 (https://phabricator.wikimedia.org/T405119) [13:20:58] (03CR) 10Gkyziridis: [C:03+1] "Thnx for deploying." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236723 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [13:23:31] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11583201 (10Jclark-ctr) [13:24:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: OpenJDK 11 security updates - jmm@cumin2002 [13:25:25] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update rr-wikidata prod image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236723 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [13:25:44] (03CR) 10Jelto: [C:03+1] "looks good to me, the additional `firewall::client` looks reasonable for the GitLab-specific use-case of uploading to rsyncd replica hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1234984 (owner: 10Jelto) [13:26:00] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:26:16] (03PS1) 10Dzahn: zookeeper: set keystore format to PKCS12 when enabling TLS (for zuul) [puppet] - 10https://gerrit.wikimedia.org/r/1236735 (https://phabricator.wikimedia.org/T405119) [13:27:07] (03CR) 10Hashar: "That solved it, the `/` partition has more space now:" [puppet] - 10https://gerrit.wikimedia.org/r/1234269 (owner: 10Arnaudb) [13:27:18] (03Merged) 10jenkins-bot: ml-services: update rr-wikidata prod image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236723 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [13:28:35] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:29:24] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:29:25] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:30:52] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:31:31] (03PS1) 10TheDJ: Fix audio transcodes [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236739 [13:31:40] jclark@cumin1003 reimage (PID 2528735) is awaiting input [13:33:35] (03PS1) 10Joal: Add turnilo-next and turnilo to wmnet/wm.org [dns] - 10https://gerrit.wikimedia.org/r/1236740 (https://phabricator.wikimedia.org/T416115) [13:37:09] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [13:37:10] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1024.eqiad.wmnet with OS bullseye [13:37:18] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11583254 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1024.eqiad.wmnet with OS bullseye completed: - ms-fe1024 (**WARN**) - Removed from Pupp... [13:37:35] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11583255 (10Jclark-ctr) [13:40:20] (03PS1) 10Majavah: hieradata: Fix cumin SSH allowlist for codfw1dev bastions [puppet] - 10https://gerrit.wikimedia.org/r/1236742 [13:40:37] jclark@cumin1003 reimage (PID 2533601) is awaiting input [13:40:43] (03CR) 10Daniel Kinzler: redioscope: enable time bucket (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 (owner: 10Daniel Kinzler) [13:41:11] (03CR) 10Majavah: [C:03+2] hieradata: Fix cumin SSH allowlist for codfw1dev bastions [puppet] - 10https://gerrit.wikimedia.org/r/1236742 (owner: 10Majavah) [13:44:33] (03PS1) 10Majavah: base: kernel: Fix legacy fact usage [puppet] - 10https://gerrit.wikimedia.org/r/1236743 [13:44:43] (03CR) 10Hashar: [C:03+2] Fix audio transcodes [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236739 (owner: 10TheDJ) [13:46:26] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:46:36] (03Merged) 10jenkins-bot: Fix audio transcodes [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236739 (owner: 10TheDJ) [13:46:54] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe1022.eqiad.wmnet with OS bullseye [13:47:03] (03PS2) 10Majavah: base: kernel: Drop R320 overrides [puppet] - 10https://gerrit.wikimedia.org/r/1236743 [13:47:08] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11583283 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1022.eqiad.wmnet with OS bullseye [13:49:46] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1236743 (owner: 10Majavah) [13:49:54] (03CR) 10Majavah: [C:03+2] base: kernel: Drop R320 overrides [puppet] - 10https://gerrit.wikimedia.org/r/1236743 (owner: 10Majavah) [13:50:45] (03CR) 10Brouberol: Add turnilo-next and turnilo to wmnet/wm.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1236740 (https://phabricator.wikimedia.org/T416115) (owner: 10Joal) [13:51:10] (03PS1) 10Jelto: gerrit: add GerritHaProxy* alerts [alerts] - 10https://gerrit.wikimedia.org/r/1236746 (https://phabricator.wikimedia.org/T416189) [13:52:03] (03CR) 10Jelto: [C:03+2] Add an option to the flag generated firewall rules with low QoS [puppet] - 10https://gerrit.wikimedia.org/r/1236243 (owner: 10Muehlenhoff) [13:52:07] (03CR) 10Jelto: [C:03+2] gitlab: set qos to low in rsync server [puppet] - 10https://gerrit.wikimedia.org/r/1234984 (owner: 10Jelto) [13:52:40] !log disable nrpe2nodexp check for ferm on cloudcumin* [13:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:50] (03CR) 10CI reject: [V:04-1] gerrit: add GerritHaProxy* alerts [alerts] - 10https://gerrit.wikimedia.org/r/1236746 (https://phabricator.wikimedia.org/T416189) (owner: 10Jelto) [13:53:51] (03PS1) 10Kosta Harlan: Add client.tag_metadata_categories field support [extensions/IPReputation] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236747 (https://phabricator.wikimedia.org/T414571) [13:55:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11583321 (10Jclark-ctr) 05Open→03Resolved Thanks @jcrespo updating preseed file and kicking off the imaging! [13:55:59] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11583324 (10Jclark-ctr) [13:56:34] (03PS2) 10Jelto: gerrit: add GerritHaProxy* alerts [alerts] - 10https://gerrit.wikimedia.org/r/1236746 (https://phabricator.wikimedia.org/T416189) [13:57:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11583332 (10cmooney) In terms of the moves here, what makes sense from the netops point of view is to tackle in this order: # Move both pfw1 units ## C... [13:58:13] (03PS1) 10Urbanecm: DatabaseUserImpactStore: log attempts to save zero pageviews values [extensions/GrowthExperiments] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236748 (https://phabricator.wikimedia.org/T414080) [13:58:39] (03PS1) 10Urbanecm: DatabaseUserImpactStore: log attempts to save zero pageviews values [extensions/GrowthExperiments] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236749 (https://phabricator.wikimedia.org/T414080) [13:58:50] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup20[16-20] - https://phabricator.wikimedia.org/T414727#11583336 (10jcrespo) [13:58:54] kostajh: i see we both have backports, if you're OK, I can +2 both to start CI? [13:59:44] also i see hashar is deploying something? [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T1400). nyaa~ [14:00:05] kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:06] oh my bad yes [14:00:07] sorry [14:00:16] that is a patch for TimedmediaHandler which I have +2ed [14:00:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/IPReputation] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236747 (https://phabricator.wikimedia.org/T414571) (owner: 10Kosta Harlan) [14:00:24] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/+/1236739 [14:00:30] hello [14:00:34] it is a straighforward patch, that can be rolled with other patches [14:00:36] urbanecm: yes, go ahead [14:00:39] though it is already merged [14:00:43] (sorry) [14:00:44] (03CR) 10Urbanecm: [C:03+2] DatabaseUserImpactStore: log attempts to save zero pageviews values [extensions/GrowthExperiments] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236749 (https://phabricator.wikimedia.org/T414080) (owner: 10Urbanecm) [14:00:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:00:47] (03CR) 10Urbanecm: [C:03+2] DatabaseUserImpactStore: log attempts to save zero pageviews values [extensions/GrowthExperiments] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236748 (https://phabricator.wikimedia.org/T414080) (owner: 10Urbanecm) [14:00:57] (03CR) 10Urbanecm: [C:03+2] IPReputationIPoidDataLookup: Allow returning stale values for 72 hours [extensions/IPReputation] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236690 (https://phabricator.wikimedia.org/T416316) (owner: 10Kosta Harlan) [14:00:58] (03CR) 10Urbanecm: [C:03+2] IPReputationIPoidDataLookup: Allow returning stale values for 72 hours [extensions/IPReputation] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236689 (https://phabricator.wikimedia.org/T416316) (owner: 10Kosta Harlan) [14:01:12] urbanecm: there's a third one here https://gerrit.wikimedia.org/r/c/1236747/ [14:01:18] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11583343 (10jcrespo) [14:01:20] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:01:55] kostajh: hmm, let's not test the limits and do it in ~two batches? [14:02:09] (03CR) 10Dzahn: gerrit: add GerritHaProxy* alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1236746 (https://phabricator.wikimedia.org/T416189) (owner: 10Jelto) [14:02:14] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup2015 - https://phabricator.wikimedia.org/T414724#11583345 (10jcrespo) [14:02:27] (03PS1) 10Elukey: installserver: fix preseed config for ms-fe102[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/1236750 (https://phabricator.wikimedia.org/T416245) [14:03:33] (03PS3) 10Jelto: gerrit: add GerritHaProxy* alerts [alerts] - 10https://gerrit.wikimedia.org/r/1236746 (https://phabricator.wikimedia.org/T416189) [14:03:49] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:03:51] (03CR) 10Jelto: gerrit: add GerritHaProxy* alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1236746 (https://phabricator.wikimedia.org/T416189) (owner: 10Jelto) [14:04:12] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1236750 (https://phabricator.wikimedia.org/T416245) (owner: 10Elukey) [14:04:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [14:04:44] Deployment zotero-production in zotero at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=zotero&var-deployment=zotero-production - ... [14:04:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:05:40] urbanecm: sure [14:06:02] (03CR) 10Dzahn: [C:03+1] "lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/1236746 (https://phabricator.wikimedia.org/T416189) (owner: 10Jelto) [14:06:25] (03CR) 10Elukey: [C:03+2] installserver: fix preseed config for ms-fe102[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/1236750 (https://phabricator.wikimedia.org/T416245) (owner: 10Elukey) [14:06:34] !log installing php7.4 security updates [14:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:14] (03CR) 10Ssingh: [C:03+1] "Thanks for the patch. I will take care of the deployment if that's fine because I want to disable Puppet on A:dnsbox and test on a single " [puppet] - 10https://gerrit.wikimedia.org/r/1228560 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [14:08:48] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-fe1022.eqiad.wmnet with OS bullseye [14:09:03] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11583368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1022.eqiad.wmnet with OS bullseye executed with errors: - ms-fe1022... [14:10:00] (03CR) 10CI reject: [V:04-1] DatabaseUserImpactStore: log attempts to save zero pageviews values [extensions/GrowthExperiments] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236748 (https://phabricator.wikimedia.org/T414080) (owner: 10Urbanecm) [14:10:21] perfect :( [14:11:30] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, 07Datacenter-Switchover: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11583373 (10Blake) The way I'm considering going about this would be to create a switchover lock o... [14:12:52] (03Merged) 10jenkins-bot: DatabaseUserImpactStore: log attempts to save zero pageviews values [extensions/GrowthExperiments] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236749 (https://phabricator.wikimedia.org/T414080) (owner: 10Urbanecm) [14:13:32] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-fe1023.eqiad.wmnet with OS bullseye [14:13:45] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11583378 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1023.eqiad.wmnet with OS bullseye executed with errors: - ms-fe1023... [14:14:19] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe1022.eqiad.wmnet with OS bullseye [14:14:19] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe1023.eqiad.wmnet with OS bullseye [14:14:23] !log jclark@cumin1003 START - Cookbook sre.hosts.move-vlan for host ms-fe1023 [14:14:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-fe1023 [14:14:32] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11583389 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1022.eqiad.wmnet with OS bullseye [14:14:39] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11583390 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1023.eqiad.wmnet with OS bullseye [14:14:46] (03Merged) 10jenkins-bot: DatabaseUserImpactStore: log attempts to save zero pageviews values [extensions/GrowthExperiments] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236748 (https://phabricator.wikimedia.org/T414080) (owner: 10Urbanecm) [14:14:47] (03Merged) 10jenkins-bot: IPReputationIPoidDataLookup: Allow returning stale values for 72 hours [extensions/IPReputation] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236690 (https://phabricator.wikimedia.org/T416316) (owner: 10Kosta Harlan) [14:14:49] (03Merged) 10jenkins-bot: IPReputationIPoidDataLookup: Allow returning stale values for 72 hours [extensions/IPReputation] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236689 (https://phabricator.wikimedia.org/T416316) (owner: 10Kosta Harlan) [14:14:57] (03CR) 10Ssingh: [C:03+2] DNS: Enable Bird 2.18 for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1228560 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [14:15:17] !logsudo cumin "A:dnsbox" "disable-puppet 'merging CR 1228560'" [14:15:21] !log sudo cumin "A:dnsbox" "disable-puppet 'merging CR 1228560'" [14:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:27] starting scap [14:16:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T415786)', diff saved to https://phabricator.wikimedia.org/P88650 and previous config saved to /var/cache/conftool/dbconfig/20260204-141559-marostegui.json [14:16:03] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [14:16:13] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1236739|Fix audio transcodes]], [[gerrit:1236749|DatabaseUserImpactStore: log attempts to save zero pageviews values (T414080)]], [[gerrit:1236748|DatabaseUserImpactStore: log attempts to save zero pageviews values (T414080)]], [[gerrit:1236690|IPReputationIPoidDataLookup: Allow returning stale values for 72 hours (T416316)]], [[gerrit:1236689|IPReput [14:16:13] ationIPoidDataLookup: Allow returning stale values for 72 hours (T416316)]] [14:16:14] (03CR) 10Vgutierrez: [C:03+1] "VTCs are happy" [puppet] - 10https://gerrit.wikimedia.org/r/1236701 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [14:16:17] T414080: x1 increase in writes results in a large increase of binlog files (over 2000) - https://phabricator.wikimedia.org/T414080 [14:16:18] T416316: IPReputation: Improve caching logic to handle backend downtime - https://phabricator.wikimedia.org/T416316 [14:16:55] (03PS3) 10Daniel Kinzler: redioscope: enable time bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 [14:17:01] (03CR) 10Fabfur: [C:03+2] cache::upload: enable global ratelimiting for bot (ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/1236701 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [14:17:10] (03PS2) 10Daniel Kinzler: rediscope: lower cpu and memoy limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233161 [14:17:20] (03PS4) 10Daniel Kinzler: redioscope: enable time bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 [14:18:20] (03CR) 10Jcrespo: [C:03+2] install_server: Prevent reimage of backup1015 and setup all other new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1236708 (https://phabricator.wikimedia.org/T414725) (owner: 10Jcrespo) [14:18:26] !log urbanecm@deploy2002 hartman, kharlan, urbanecm: Backport for [[gerrit:1236739|Fix audio transcodes]], [[gerrit:1236749|DatabaseUserImpactStore: log attempts to save zero pageviews values (T414080)]], [[gerrit:1236748|DatabaseUserImpactStore: log attempts to save zero pageviews values (T414080)]], [[gerrit:1236690|IPReputationIPoidDataLookup: Allow returning stale values for 72 hours (T416316)]], [[gerrit:1236689|IPRe [14:18:26] putationIPoidDataLookup: Allow returning stale values for 72 hours (T416316)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:18:37] kostajh: hashar: ready for testing [14:18:52] urbanecm: thanks, will look [14:19:05] * urbanecm looking at his patches [14:19:38] thx [14:19:43] I am not sure how to test the video encoding ;) [14:19:51] (03PS3) 10Daniel Kinzler: api-gateway: Re-apply "Rest-gateway Read `ratelimit_class` and `user_id` from JWT" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233154 (https://phabricator.wikimedia.org/T405578) [14:20:13] hashar: in this particular case, i don't think there is a way, as it is a job [14:20:17] urbanecm: good from my side [14:20:27] !log urbanecm@deploy2002 hartman, kharlan, urbanecm: Continuing with sync [14:20:28] yeah I imagine the x-debug is not carried to the backend job [14:20:35] !log sudo cumin -b1 -s5 "A:dnsbox" "run-puppet-agent --enable 'merging CR 1228560'" [14:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:38] indeed [14:20:48] but TheDJ monitors transcoding issues ( https://quarry.wmcloud.org/query/83666 ) so I guess he will be able to confirm the fix [14:21:39] (03CR) 10Daniel Kinzler: api-gateway: Re-apply "Rest-gateway Read `ratelimit_class` and `user_id` from JWT" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233154 (https://phabricator.wikimedia.org/T405578) (owner: 10Daniel Kinzler) [14:21:53] (03CR) 10Marostegui: [C:03+1] mysql: rename newpool cookbook to pool [cookbooks] - 10https://gerrit.wikimedia.org/r/1236726 (https://phabricator.wikimedia.org/T383674) (owner: 10Federico Ceratto) [14:21:54] (03PS5) 10Daniel Kinzler: rest-gateway: re.apply "add support for sessionJwt cookies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233155 (https://phabricator.wikimedia.org/T405578) [14:22:16] (03PS3) 10Daniel Kinzler: rest gateway: include service values.yaml when testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229119 [14:23:02] sounds good! [14:23:11] (03PS4) 10Daniel Kinzler: api-gateway: Re-apply "Rest-gateway Read `ratelimit_class` and `user_id` from JWT" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233154 (https://phabricator.wikimedia.org/T405578) [14:23:24] (03CR) 10Urbanecm: [C:03+2] Add client.tag_metadata_categories field support [extensions/IPReputation] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236747 (https://phabricator.wikimedia.org/T414571) (owner: 10Kosta Harlan) [14:23:37] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1236753 (https://phabricator.wikimedia.org/T416480) [14:23:43] kostajh: i take that https://gerrit.wikimedia.org/r/c/mediawiki/extensions/IPReputation/+/1236747 is missing from .13 intentionally? [14:23:51] as far as i can see, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/IPReputation/+/1236248 is not in .13 [14:24:31] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1236739|Fix audio transcodes]], [[gerrit:1236749|DatabaseUserImpactStore: log attempts to save zero pageviews values (T414080)]], [[gerrit:1236748|DatabaseUserImpactStore: log attempts to save zero pageviews values (T414080)]], [[gerrit:1236690|IPReputationIPoidDataLookup: Allow returning stale values for 72 hours (T416316)]], [[gerrit:1236689|IPRepu [14:24:31] tationIPoidDataLookup: Allow returning stale values for 72 hours (T416316)]] (duration: 08m 17s) [14:24:35] T414080: x1 increase in writes results in a large increase of binlog files (over 2000) - https://phabricator.wikimedia.org/T414080 [14:24:35] T416316: IPReputation: Improve caching logic to handle backend downtime - https://phabricator.wikimedia.org/T416316 [14:25:29] kostajh: anyway, i'm done. feel free to take over (or i can finish if you prefer) [14:25:54] 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06ServiceOps new, and 6 others: October 2025 Bullseye reboots (ServiceOps hosts) - https://phabricator.wikimedia.org/T416451#11583455 (10Blake) p:05Triage→03Medium [14:26:01] (03Merged) 10jenkins-bot: Add client.tag_metadata_categories field support [extensions/IPReputation] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236747 (https://phabricator.wikimedia.org/T414571) (owner: 10Kosta Harlan) [14:26:43] urbanecm: if you're able to press the button, that would be great [14:26:54] 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06ServiceOps new, and 6 others: October 2025 Bullseye reboots (ServiceOps hosts) - https://phabricator.wikimedia.org/T416451#11583459 (10Blake) Moving this to scheduled, as I'll do the conf servers we can during the switchover. [14:26:56] urbanecm: and yes, missing from wmf.13 because of merge conflict, and I don't want to bother with fixing it now :) [14:27:05] (03PS1) 10Muehlenhoff: Remove access for chandra-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1236754 [14:27:15] okay, i just wanted to make sure it's not an omission [14:27:16] doing! [14:27:52] !log remove legacy kibana discovery certificate T365798 [14:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:55] T365798: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798 [14:27:57] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1236747|Add client.tag_metadata_categories field support (T414571)]] [14:28:00] T414571: Add spur data next to IP in the checkuser result page - https://phabricator.wikimedia.org/T414571 [14:28:39] once it has completed I will ask reencoding of some audio file on cmmons and verify it got fixed [14:28:47] example: https://commons.wikimedia.org/wiki/File:Marcos_Brunet_-_Dialogo_Intimo_-_(Disco_Completo).ogg [14:28:53] * hashar coffee [14:29:27] (03PS1) 10Vgutierrez: traffic: avoid firing FermMSS alert if observed MSS is 0 [alerts] - 10https://gerrit.wikimedia.org/r/1236755 (https://phabricator.wikimedia.org/T400155) [14:30:01] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1023.eqiad.wmnet with reason: host reimage [14:30:03] hashar: patch should be live already [14:30:03] !log urbanecm@deploy2002 kharlan, urbanecm: Backport for [[gerrit:1236747|Add client.tag_metadata_categories field support (T414571)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:30:07] happy to push some on-wiki btns if needed [14:30:10] kostajh: can you test, please? [14:30:10] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1022.eqiad.wmnet with reason: host reimage [14:30:14] urbanecm: yes, will do [14:30:17] ty [14:30:31] (03PS6) 10Daniel Kinzler: rest-gateway: re.apply "add support for sessionJwt cookies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233155 (https://phabricator.wikimedia.org/T405578) [14:30:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:31:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P88651 and previous config saved to /var/cache/conftool/dbconfig/20260204-143108-marostegui.json [14:32:00] urbanecm: should be good to go [14:33:02] (03CR) 10Daniel Kinzler: redioscope: enable time bucket (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 (owner: 10Daniel Kinzler) [14:33:13] !log urbanecm@deploy2002 kharlan, urbanecm: Continuing with sync [14:33:15] proceeding [14:34:05] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1023.eqiad.wmnet with reason: host reimage [14:35:55] urbanecm: thanks! [14:36:57] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1022.eqiad.wmnet with reason: host reimage [14:37:23] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1236747|Add client.tag_metadata_categories field support (T414571)]] (duration: 09m 26s) [14:37:26] T414571: Add spur data next to IP in the checkuser result page - https://phabricator.wikimedia.org/T414571 [14:37:46] (03Abandoned) 10Slyngshede: P:ganeti: Absent checks for generic Ganeti services. [puppet] - 10https://gerrit.wikimedia.org/r/1003374 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:38:11] (03CR) 10Slyngshede: [V:03+2 C:03+2] Docker build [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1229106 (https://phabricator.wikimedia.org/T412826) (owner: 10Slyngshede) [14:39:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236361 (https://phabricator.wikimedia.org/T416174) (owner: 10Seawolf35gerrit) [14:39:34] 10SRE-swift-storage, 10Ceph, 06ServiceOps new, 07Epic, and 3 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11583538 (10elukey) >>! In T412951#11576712, @Scott_French wrote: > Thank you, @elukey! > > No objections... [14:39:41] (03Abandoned) 10Alexandros Kosiaris: changeprop: Remove all MCS endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017054 (https://phabricator.wikimedia.org/T361483) (owner: 10Alexandros Kosiaris) [14:40:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:41:08] (03PS1) 10Slyngshede: Docker documentation [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1236759 [14:41:33] (03PS3) 10Daniel Kinzler: rediscope: lower cpu and memoy limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233161 [14:41:40] (03PS5) 10Daniel Kinzler: redioscope: enable time bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 [14:42:38] (03PS3) 10Alexandros Kosiaris: toolhub: make extraFQDNs specific to codfw, eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/954290 [14:43:53] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:44:20] PROBLEM - Swift https backend on ms-fe1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:44:45] (03PS3) 10Brouberol: airflow: allow the definition of rsync/ssh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236756 (https://phabricator.wikimedia.org/T402512) [14:44:54] (03PS5) 10Brouberol: airflow-sre: enable ssh access from task pods to the puppetservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236729 (https://phabricator.wikimedia.org/T402512) [14:45:03] (03CR) 10Elukey: [C:03+1] transports: add new API for the execution results [software/cumin] - 10https://gerrit.wikimedia.org/r/1224033 (owner: 10Volans) [14:45:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1181 with weight 0 T416356', diff saved to https://phabricator.wikimedia.org/P88652 and previous config saved to /var/cache/conftool/dbconfig/20260204-144508-marostegui.json [14:45:11] T416356: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T416356 [14:45:25] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1236315 (https://phabricator.wikimedia.org/T416356) (owner: 10Gerrit maintenance bot) [14:45:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s7 T416356 [14:45:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:46:14] !log Starting s7 eqiad failover from db1236 to db1181 - T416356 [14:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P88653 and previous config saved to /var/cache/conftool/dbconfig/20260204-144616-marostegui.json [14:47:09] (03CR) 10Daniel Kinzler: [C:03+2] api-gateway: Re-apply "Rest-gateway Read `ratelimit_class` and `user_id` from JWT" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233154 (https://phabricator.wikimedia.org/T405578) (owner: 10Daniel Kinzler) [14:47:14] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: re.apply "add support for sessionJwt cookies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233155 (https://phabricator.wikimedia.org/T405578) (owner: 10Daniel Kinzler) [14:47:34] (03PS1) 10Marostegui: db1236: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1236760 (https://phabricator.wikimedia.org/T415786) [14:47:46] (03PS6) 10Daniel Kinzler: redioscope: enable time bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 [14:48:06] (03CR) 10Marostegui: [C:03+2] db1236: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1236760 (https://phabricator.wikimedia.org/T415786) (owner: 10Marostegui) [14:49:12] (03Merged) 10jenkins-bot: api-gateway: Re-apply "Rest-gateway Read `ratelimit_class` and `user_id` from JWT" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233154 (https://phabricator.wikimedia.org/T405578) (owner: 10Daniel Kinzler) [14:49:13] (03Merged) 10jenkins-bot: rest-gateway: re.apply "add support for sessionJwt cookies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233155 (https://phabricator.wikimedia.org/T405578) (owner: 10Daniel Kinzler) [14:49:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1181 to s7 primary T416356', diff saved to https://phabricator.wikimedia.org/P88654 and previous config saved to /var/cache/conftool/dbconfig/20260204-144914-marostegui.json [14:49:22] (03CR) 10Daniel Kinzler: "Something is wrong with the minikube overrides after rebasing on top of the fix for including the service's value.yaml when testing in min" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 (owner: 10Daniel Kinzler) [14:49:29] (03CR) 10Daniel Kinzler: [C:04-1] redioscope: enable time bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 (owner: 10Daniel Kinzler) [14:49:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1236 T416356', diff saved to https://phabricator.wikimedia.org/P88655 and previous config saved to /var/cache/conftool/dbconfig/20260204-144951-marostegui.json [14:50:22] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1234366 (owner: 10Majavah) [14:51:17] (03CR) 10Majavah: [C:03+2] wmflib: deep_merge: Do not duplicate array values [puppet] - 10https://gerrit.wikimedia.org/r/1234366 (owner: 10Majavah) [14:51:22] urbanecm: I have verified the audio transcoding has been fixed. Thanks! [14:51:27] awesome [14:52:06] (03CR) 10Btullis: [C:03+1] "Looks good to me. One small whitespace nit." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236756 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [14:52:06] (03Abandoned) 10Alexandros Kosiaris: toolhub: make extraFQDNs specific to codfw, eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/954290 (owner: 10Alexandros Kosiaris) [14:52:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1236.eqiad.wmnet with reason: Schema change [14:52:39] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:53:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:53:05] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1023.eqiad.wmnet with OS bullseye [14:53:10] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11583677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1023.eqiad.wmnet with OS bullseye completed: - ms-fe1023 (**PASS**) - Host successfully... [14:53:11] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:53:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1236.eqiad.wmnet with reason: Maintenance [14:53:56] jouncebot: nowandnext [14:53:56] For the next 0 hour(s) and 6 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T1400) [14:53:56] In 0 hour(s) and 6 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T1500) [14:54:21] (03Abandoned) 10Alexandros Kosiaris: changeprop: Change normal_rule_processing to histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/937090 (owner: 10Alexandros Kosiaris) [14:54:22] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:54:23] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:54:30] PROBLEM - Swift https backend on ms-fe1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:54:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:54:51] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1022.eqiad.wmnet with OS bullseye [14:54:57] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11583685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1022.eqiad.wmnet with OS bullseye completed: - ms-fe1022 (**PASS**) - Removed from Pupp... [14:56:15] PROBLEM - Swift https backend on ms-fe1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [14:56:27] (03PS1) 10Urbanecm: Revert "DatabaseUserImpactStore: log attempts to save zero pageviews values" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236761 (https://phabricator.wikimedia.org/T414080) [14:56:31] (03PS1) 10Urbanecm: Revert "DatabaseUserImpactStore: log attempts to save zero pageviews values" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236762 (https://phabricator.wikimedia.org/T414080) [14:56:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236761 (https://phabricator.wikimedia.org/T414080) (owner: 10Urbanecm) [14:56:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236762 (https://phabricator.wikimedia.org/T414080) (owner: 10Urbanecm) [14:58:00] (03CR) 10Btullis: [C:03+1] "I'm not super-happy about hard-coding the SSH fingerprints, but we can always come back to it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236729 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [15:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T1500) [15:00:09] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11583704 (10Jclark-ctr) [15:00:15] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11583710 (10Jclark-ctr) 05Open→03Resolved [15:01:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T415786)', diff saved to https://phabricator.wikimedia.org/P88656 and previous config saved to /var/cache/conftool/dbconfig/20260204-150124-marostegui.json [15:01:28] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [15:01:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [15:01:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2168 (T415786)', diff saved to https://phabricator.wikimedia.org/P88657 and previous config saved to /var/cache/conftool/dbconfig/20260204-150138-marostegui.json [15:05:07] PROBLEM - Swift https backend on ms-fe1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:06:50] (03CR) 10FNegri: [C:03+1] "PCC looks good, ship it" [puppet] - 10https://gerrit.wikimedia.org/r/1234357 (https://phabricator.wikimedia.org/T398214) (owner: 10Majavah) [15:07:07] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::instance: Convert root keys list to YAML [puppet] - 10https://gerrit.wikimedia.org/r/1234357 (https://phabricator.wikimedia.org/T398214) (owner: 10Majavah) [15:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:44] (03Merged) 10jenkins-bot: Revert "DatabaseUserImpactStore: log attempts to save zero pageviews values" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236761 (https://phabricator.wikimedia.org/T414080) (owner: 10Urbanecm) [15:09:46] (03Merged) 10jenkins-bot: Revert "DatabaseUserImpactStore: log attempts to save zero pageviews values" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236762 (https://phabricator.wikimedia.org/T414080) (owner: 10Urbanecm) [15:10:05] (03PS2) 10Zabe: mediawiki: Do not run updateSpecialPages for DeadendPages on commons [puppet] - 10https://gerrit.wikimedia.org/r/1225119 (https://phabricator.wikimedia.org/T371662) [15:10:09] (03CR) 10Ladsgroup: [C:03+2] mediawiki: Do not run updateSpecialPages for DeadendPages on commons [puppet] - 10https://gerrit.wikimedia.org/r/1225119 (https://phabricator.wikimedia.org/T371662) (owner: 10Zabe) [15:10:11] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mediawiki: Do not run updateSpecialPages for DeadendPages on commons [puppet] - 10https://gerrit.wikimedia.org/r/1225119 (https://phabricator.wikimedia.org/T371662) (owner: 10Zabe) [15:10:20] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1236761|Revert "DatabaseUserImpactStore: log attempts to save zero pageviews values" (T414080)]], [[gerrit:1236762|Revert "DatabaseUserImpactStore: log attempts to save zero pageviews values" (T414080)]] [15:10:23] T414080: x1 increase in writes results in a large increase of binlog files (over 2000) - https://phabricator.wikimedia.org/T414080 [15:10:35] (03CR) 10Brouberol: airflow: allow the definition of rsync/ssh configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236756 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [15:10:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:11:36] 06SRE, 06Infrastructure-Foundations, 10netops: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769#11583801 (10Aklapper) @BBlack: Another ping [15:12:04] !log restarting Cassandra on [aqs2002.codfw.wmnet,aqs1010.eqiad.wmnet] to canary Java 11.0.30 — [15:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:25] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1236761|Revert "DatabaseUserImpactStore: log attempts to save zero pageviews values" (T414080)]], [[gerrit:1236762|Revert "DatabaseUserImpactStore: log attempts to save zero pageviews values" (T414080)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:12:58] !log restarting Cassandra on [aqs2002.codfw.wmnet,aqs1010.eqiad.wmnet] to canary Java 11.0.30 — T416492 [15:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:01] T416492: Cassandra restarts for Java 11.0.30 security update - https://phabricator.wikimedia.org/T416492 [15:13:21] (03PS1) 10Andrew Bogott: Add site.pp entry for bast1004 [puppet] - 10https://gerrit.wikimedia.org/r/1236765 (https://phabricator.wikimedia.org/T416254) [15:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [15:21:48] (03PS1) 10Bking: data-platform: don't enforce linting on seldom-used series [alerts] - 10https://gerrit.wikimedia.org/r/1236766 (https://phabricator.wikimedia.org/T412447) [15:21:54] !log urbanecm@deploy2002 urbanecm: Continuing with sync [15:24:34] (03CR) 10Muehlenhoff: Add site.pp entry for bast1004 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1236765 (https://phabricator.wikimedia.org/T416254) (owner: 10Andrew Bogott) [15:26:40] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1236761|Revert "DatabaseUserImpactStore: log attempts to save zero pageviews values" (T414080)]], [[gerrit:1236762|Revert "DatabaseUserImpactStore: log attempts to save zero pageviews values" (T414080)]] (duration: 16m 20s) [15:26:44] T414080: x1 increase in writes results in a large increase of binlog files (over 2000) - https://phabricator.wikimedia.org/T414080 [15:27:18] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254#11583889 (10MoritzMuehlenhoff) [15:28:06] !log ladsgroup@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [15:28:13] (03CR) 10Slyngshede: "Seems fine logically. Note that the test will parse regardless of the lvs_realserver_mss_value > 0" [alerts] - 10https://gerrit.wikimedia.org/r/1236755 (https://phabricator.wikimedia.org/T400155) (owner: 10Vgutierrez) [15:28:20] (03CR) 10Slyngshede: [C:03+1] traffic: avoid firing FermMSS alert if observed MSS is 0 [alerts] - 10https://gerrit.wikimedia.org/r/1236755 (https://phabricator.wikimedia.org/T400155) (owner: 10Vgutierrez) [15:28:36] (03PS1) 10Muehlenhoff: Configure UEFI partman config for bast1004 [puppet] - 10https://gerrit.wikimedia.org/r/1236767 (https://phabricator.wikimedia.org/T416254) [15:28:50] (03CR) 10Hashar: [C:03+1] Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" [vendor] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236693 (https://phabricator.wikimedia.org/T416456) (owner: 10Zabe) [15:28:59] (03CR) 10Hashar: [C:03+1] Revert "Updated lcobucci/jwt from 4.1.5 to 4.3.0" [core] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236692 (https://phabricator.wikimedia.org/T416456) (owner: 10Zabe) [15:29:16] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:29:42] !log ladsgroup@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [15:29:43] (03CR) 10Andrew Bogott: [C:04-1] "actually this needs uefi preseed entry too" [puppet] - 10https://gerrit.wikimedia.org/r/1236765 (https://phabricator.wikimedia.org/T416254) (owner: 10Andrew Bogott) [15:29:45] (03PS2) 10Bking: data-platform: don't enforce linting on seldom-used series [alerts] - 10https://gerrit.wikimedia.org/r/1236766 (https://phabricator.wikimedia.org/T412447) [15:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T1500) [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T1530) [15:31:44] (03CR) 10Vgutierrez: [C:03+2] traffic: avoid firing FermMSS alert if observed MSS is 0 [alerts] - 10https://gerrit.wikimedia.org/r/1236755 (https://phabricator.wikimedia.org/T400155) (owner: 10Vgutierrez) [15:32:04] (03PS2) 10Andrew Bogott: Add site.pp and preseed entry for bast1004 [puppet] - 10https://gerrit.wikimedia.org/r/1236765 (https://phabricator.wikimedia.org/T416254) [15:32:11] (03CR) 10Elukey: [C:03+1] airflow: allow the definition of rsync/ssh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236756 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [15:32:29] (03CR) 10Elukey: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236729 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [15:32:31] (03CR) 10Andrew Bogott: Add site.pp and preseed entry for bast1004 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1236765 (https://phabricator.wikimedia.org/T416254) (owner: 10Andrew Bogott) [15:32:57] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns4003.wikimedia.org [reason: bird2 upgrade] [15:33:49] (03CR) 10Elukey: [C:03+1] Configure UEFI partman config for bast1004 [puppet] - 10https://gerrit.wikimedia.org/r/1236767 (https://phabricator.wikimedia.org/T416254) (owner: 10Muehlenhoff) [15:33:55] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host bast1004.eqiad.wmnet with OS trixie [15:34:06] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254#11583917 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host bast1004.eqiad.wmnet with OS trixie [15:34:12] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host bast1004.eqiad.wmnet with OS trixie [15:34:16] (03CR) 10Elukey: [C:03+1] Remove access for chandra-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1236754 (owner: 10Muehlenhoff) [15:34:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:23] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254#11583918 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host bast1004.eqiad.wmnet with OS trixie executed with errors: - bast1004 (**FAIL... [15:34:26] !log upgrade to bird 2.18 on dns4003: T413740 [15:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:28] T413740: Backport and test Bird 2.18 - https://phabricator.wikimedia.org/T413740 [15:35:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:02] (03PS3) 10Bking: data-platform: Add site tags, don't enforce linting on probe series [alerts] - 10https://gerrit.wikimedia.org/r/1236766 (https://phabricator.wikimedia.org/T412447) [15:37:41] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1236765 (https://phabricator.wikimedia.org/T416254) (owner: 10Andrew Bogott) [15:38:28] !log restarting Cassandra on aqs[2001,2003-2012] & aqs[1011,1014-1027 to apply Java 11.0.30 — T416492 [15:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:31] T416492: Cassandra restarts for Java 11.0.30 security update - https://phabricator.wikimedia.org/T416492 [15:39:24] !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None [15:39:29] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254#11583927 (10MoritzMuehlenhoff) [15:39:32] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns4003.wikimedia.org [reason: [end] bird2 upgrade] [15:39:44] (03PS2) 10Joal: Add turnilo-next and turnilo to wmnet/wm.org [dns] - 10https://gerrit.wikimedia.org/r/1236740 (https://phabricator.wikimedia.org/T416115) [15:39:53] (03CR) 10Joal: Add turnilo-next and turnilo to wmnet/wm.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1236740 (https://phabricator.wikimedia.org/T416115) (owner: 10Joal) [15:42:54] (03CR) 10Brouberol: [C:03+2] airflow: allow the definition of rsync/ssh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236756 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [15:42:57] (03CR) 10Brouberol: [C:03+2] airflow-sre: enable ssh access from task pods to the puppetservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236729 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [15:43:09] (03PS4) 10Brouberol: airflow: allow the definition of rsync/ssh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236756 (https://phabricator.wikimedia.org/T402512) [15:43:09] (03PS6) 10Brouberol: airflow-sre: enable ssh access from task pods to the puppetservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236729 (https://phabricator.wikimedia.org/T402512) [15:43:15] (03CR) 10CI reject: [V:04-1] airflow-sre: enable ssh access from task pods to the puppetservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236729 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [15:43:34] (03CR) 10Muehlenhoff: [C:03+2] Remove access for chandra-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1236754 (owner: 10Muehlenhoff) [15:46:55] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host bast1004 [15:46:55] !log jclark@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host bast1004 [15:47:05] (03CR) 10Gehel: data-platform: Add site tags, don't enforce linting on probe series (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1236766 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [15:47:10] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [15:47:41] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs[2001,2003-2012,1011-1021]*: Applying upgrade to Java 11.0.30 - eevans@cumin1003 [15:48:20] (03PS1) 10Marostegui: Revert "db1236: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1236768 [15:48:37] (03CR) 10Bking: data-platform: Add site tags, don't enforce linting on probe series (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1236766 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [15:48:51] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1236 gradually with 4 steps - After schema change [15:48:59] (03CR) 10Marostegui: [C:03+2] Revert "db1236: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1236768 (owner: 10Marostegui) [15:49:34] !log jclark@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:49:40] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T416494 (10Nicholusmuwonge_wmde) 03NEW [15:50:10] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [15:51:01] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging ChandraWMDE out of all services on: 2497 hosts [15:53:36] (03CR) 10Andrew Bogott: [C:03+2] Add site.pp and preseed entry for bast1004 [puppet] - 10https://gerrit.wikimedia.org/r/1236765 (https://phabricator.wikimedia.org/T416254) (owner: 10Andrew Bogott) [15:53:42] 06SRE, 10LDAP-Access-Requests: Grant Access to !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt bast1004 - jclark@cumin1003" [15:54:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt bast1004 - jclark@cumin1003" [15:54:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:54:48] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host bast1004 [15:55:20] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1236702 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [15:55:58] (03CR) 10Brouberol: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236729 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [15:56:05] (03CR) 10Brouberol: airflow-sre: enable ssh access from task pods to the puppetservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236729 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [15:56:08] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host bast1004 [15:56:46] (03Abandoned) 10Muehlenhoff: Configure UEFI partman config for bast1004 [puppet] - 10https://gerrit.wikimedia.org/r/1236767 (https://phabricator.wikimedia.org/T416254) (owner: 10Muehlenhoff) [16:00:24] 10ops-eqsin, 06SRE: Unresponsive management for cp5022.mgmt:22 - https://phabricator.wikimedia.org/T416193#11584070 (10RobH) 05Open→03Resolved a:03RobH T414411 host is down [16:01:16] (03PS2) 10Cathal Mooney: wikimedia.org: add IPv6 AAAA record for ns1 [dns] - 10https://gerrit.wikimedia.org/r/1236354 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [16:01:56] (03CR) 10Cathal Mooney: "ok you have a completely upside down view about what I like to do for fun :P" [dns] - 10https://gerrit.wikimedia.org/r/1236354 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [16:02:05] (03CR) 10CI reject: [V:04-1] wikimedia.org: add IPv6 AAAA record for ns1 [dns] - 10https://gerrit.wikimedia.org/r/1236354 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [16:03:24] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11584077 (10RobH) After having Jin check, this system has a failure of "The system board 5V SW PG voltage is outside of range." on the front LCD he plugged into it. The warranty expired in October 20... [16:04:45] (03CR) 10Btullis: [C:03+1] airflow: allow the definition of rsync/ssh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236756 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [16:05:23] (03CR) 10Brouberol: [C:03+2] airflow: allow the definition of rsync/ssh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236756 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [16:05:26] (03CR) 10Brouberol: [C:03+2] airflow-sre: enable ssh access from task pods to the puppetservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236729 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [16:05:38] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install frqueue2004 - https://phabricator.wikimedia.org/T416251#11584084 (10Jhancock.wm) @Dwisehaupt or @Jgreen can i rack this in the new rack? or is this going in the og one at codfw? [16:06:28] (03PS1) 10Daniel Kinzler: rest-gateway: fix staging tests for jwt [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236770 [16:06:44] (03CR) 10Vgutierrez: [C:03+1] "VTCs are happy" [puppet] - 10https://gerrit.wikimedia.org/r/1236702 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [16:13:44] !log bumping rate limit of non-standard thumb sizes to medium browser score (T402792 T414805) [16:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:49] T402792: Consider rate limiting non-standard thumbnail sizes - https://phabricator.wikimedia.org/T402792 [16:13:50] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [16:15:54] (03CR) 10Fabfur: [C:03+2] cache::upload: enable global ratelimiting for bot (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1236702 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [16:16:32] (03PS3) 10Cathal Mooney: wikimedia.org: add IPv6 AAAA record for ns1 [dns] - 10https://gerrit.wikimedia.org/r/1236354 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [16:16:38] (03CR) 10Dduvall: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1236730 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [16:18:30] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host bast1004.wikimedia.org with OS trixie [16:18:43] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254#11584106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host bast1004.wikimedia.org with OS trixie [16:18:45] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host bast1004.wikimedia.org with OS trixie [16:18:52] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254#11584110 (10Jclark-ctr) [16:18:55] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254#11584111 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host bast1004.wikimedia.org with OS trixie executed with errors: - bast1004 (**FA... [16:22:40] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host bast1004.wikimedia.org with OS trixie [16:23:35] (03CR) 10Ssingh: [C:03+1] "We will still wait to merge to get clarity on if we are publishing the glue records on the registrar today." [dns] - 10https://gerrit.wikimedia.org/r/1236354 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [16:24:46] (03CR) 10Ssingh: [C:03+1] wikimedia.org: add IPv6 AAAA record for ns1 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1236354 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [16:33:18] (03CR) 10Pmiazga: [C:03+1] rest-gateway: fix staging tests for jwt [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236770 (owner: 10Daniel Kinzler) [16:34:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1236 gradually with 4 steps - After schema change [16:39:14] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on bast1004.wikimedia.org with reason: host reimage [16:39:29] 10ops-eqsin: Unresponsive management for cp5022.mgmt:22 - https://phabricator.wikimedia.org/T416499 (10phaultfinder) 03NEW [16:40:39] (03PS1) 10Dwisehaupt: frack: Update dns handles for frack colo work [dns] - 10https://gerrit.wikimedia.org/r/1236779 (https://phabricator.wikimedia.org/T403035) [16:41:20] (03PS1) 10Urbanecm: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236780 (https://phabricator.wikimedia.org/T415312) [16:41:43] (03CR) 10Urbanecm: [C:03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236780 (https://phabricator.wikimedia.org/T415312) (owner: 10Urbanecm) [16:43:08] (03PS5) 10Alexandros Kosiaris: base::sysctl: Switch priority of the ubuntu-defaults stanza [puppet] - 10https://gerrit.wikimedia.org/r/1228583 (https://phabricator.wikimedia.org/T352956) [16:43:08] (03PS6) 10Alexandros Kosiaris: services_proxy: Switch listen_ipv6 to true by default [puppet] - 10https://gerrit.wikimedia.org/r/984105 (https://phabricator.wikimedia.org/T255568) [16:43:16] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984105 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [16:44:16] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236780 (https://phabricator.wikimedia.org/T415312) (owner: 10Urbanecm) [16:44:54] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast1004.wikimedia.org with reason: host reimage [16:45:45] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: fix staging tests for jwt [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236770 (owner: 10Daniel Kinzler) [16:47:24] !log urbanecm@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [16:47:47] (03Merged) 10jenkins-bot: rest-gateway: fix staging tests for jwt [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236770 (owner: 10Daniel Kinzler) [16:49:37] !log urbanecm@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [16:50:05] !log urbanecm@deploy2002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [16:50:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T415786)', diff saved to https://phabricator.wikimedia.org/P88664 and previous config saved to /var/cache/conftool/dbconfig/20260204-165022-marostegui.json [16:50:26] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [16:50:37] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1236703 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [16:52:33] !log urbanecm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [16:53:15] !log urbanecm@deploy2002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [16:54:49] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11584257 (10RobH) [16:55:08] !log urbanecm@deploy2002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [16:55:46] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11584262 (10RobH) [16:58:56] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11584264 (10RobH) [16:59:25] (03CR) 10FNegri: [C:03+1] aptrepo: Drop packages for Kubeadm/1.30 [puppet] - 10https://gerrit.wikimedia.org/r/1226878 (https://phabricator.wikimedia.org/T372697) (owner: 10Majavah) [17:00:04] (03CR) 10FNegri: "Sorry I missed this one, now it needs rebasing." [puppet] - 10https://gerrit.wikimedia.org/r/1226879 (https://phabricator.wikimedia.org/T379047) (owner: 10Majavah) [17:02:27] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [17:02:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [17:02:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast1004.wikimedia.org with OS trixie [17:03:46] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host bast1004.wikimedia.org with OS trixie [17:03:55] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254#11584282 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host bast1004.wikimedia.org with OS trixie [17:05:12] FIRING: ProbeDown: Service zotero:4969 has failed probes (http_zotero_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#zotero:4969 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:05:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P88665 and previous config saved to /var/cache/conftool/dbconfig/20260204-170530-marostegui.json [17:09:16] RESOLVED: ProbeDown: Service zotero:4969 has failed probes (http_zotero_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#zotero:4969 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:12:27] (03CR) 10AOkoth: [C:03+1] wmnet: upgrade vrts from the "without multiple backends" section [dns] - 10https://gerrit.wikimedia.org/r/1236384 (owner: 10Dzahn) [17:14:23] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#11584292 (10elukey) Coming back again for the deletion. Using the `swift` command is a lot of pain, so I tried `s3cmd`... [17:17:50] (03CR) 10Jgreen: [C:03+1] frack: Update dns handles for frack colo work [dns] - 10https://gerrit.wikimedia.org/r/1236779 (https://phabricator.wikimedia.org/T403035) (owner: 10Dwisehaupt) [17:20:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P88666 and previous config saved to /var/cache/conftool/dbconfig/20260204-172039-marostegui.json [17:21:13] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on bast1004.wikimedia.org with reason: host reimage [17:21:23] 06SRE, 06collaboration-services, 06Security-Team: Grant sbassett and aranyap expanded logstash access - https://phabricator.wikimedia.org/T416501 (10sbassett) 03NEW [17:24:11] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast1004.wikimedia.org with reason: host reimage [17:32:39] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Nicholusmuwonge - https://phabricator.wikimedia.org/T416494#11584367 (10Aklapper) [17:35:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T415786)', diff saved to https://phabricator.wikimedia.org/P88667 and previous config saved to /var/cache/conftool/dbconfig/20260204-173547-marostegui.json [17:35:51] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [17:36:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [17:36:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2182 (T415786)', diff saved to https://phabricator.wikimedia.org/P88668 and previous config saved to /var/cache/conftool/dbconfig/20260204-173612-marostegui.json [17:41:48] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast1004.wikimedia.org with OS trixie [17:41:58] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254#11584389 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host bast1004.wikimedia.org with OS trixie completed: - bast1004 (**PASS**) - Downtimed on Icinga/Ale... [17:42:29] (03CR) 10Dwisehaupt: [C:03+2] frack: Update dns handles for frack colo work [dns] - 10https://gerrit.wikimedia.org/r/1236779 (https://phabricator.wikimedia.org/T403035) (owner: 10Dwisehaupt) [17:42:52] !log dwisehaupt@dns1004 START - running authdns-update [17:44:07] !log dwisehaupt@dns1004 END - running authdns-update [17:46:24] (03PS3) 10Volans: wmcs: fix infra-tracing-nfs [puppet] - 10https://gerrit.wikimedia.org/r/1231034 (https://phabricator.wikimedia.org/T415199) [17:51:27] (03CR) 10Volans: "Re-tested on toolsbeta-test-k8s-worker-nfs-10, ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/1231034 (https://phabricator.wikimedia.org/T415199) (owner: 10Volans) [17:56:06] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:57:06] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:57:38] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:58:38] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:58:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-drmrs (2620:0:860:fe0a::2) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_drmrs&var-bgp_neighbor=cr2-drmrs - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T1800) [18:03:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-drmrs (2620:0:860:fe0a::2) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_drmrs&var-bgp_neighbor=cr2-drmrs - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:04:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [18:04:59] Deployment zotero-production in zotero at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=zotero&var-deployment=zotero-production - ... [18:05:02] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:06:03] 06SRE, 10LDAP-Access-Requests: Add Jacob Thwaites WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T416358#11584487 (10KFrancis) Hello all, the NDA has been sent for signatures. I'll confirm when it's complete. Thanks! [18:09:16] FIRING: ProbeDown: Service zotero:4969 has failed probes (http_zotero_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#zotero:4969 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:14:16] RESOLVED: ProbeDown: Service zotero:4969 has failed probes (http_zotero_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#zotero:4969 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:14:20] (03CR) 10Cathal Mooney: wikimedia.org: add IPv6 AAAA record for ns1 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1236354 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [18:14:36] (03PS4) 10Cathal Mooney: wikimedia.org: add IPv6 AAAA record for ns1 [dns] - 10https://gerrit.wikimedia.org/r/1236354 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [18:15:05] (03CR) 10Cathal Mooney: wikimedia.org: add IPv6 AAAA record for ns1 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1236354 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [18:19:11] (03CR) 10Dzahn: [C:03+2] wmnet: upgrade vrts from the "without multiple backends" section [dns] - 10https://gerrit.wikimedia.org/r/1236384 (owner: 10Dzahn) [18:19:16] FIRING: ProbeDown: Service zotero:4969 has failed probes (http_zotero_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#zotero:4969 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:19:39] !log dzahn@dns1004 START - running authdns-update [18:20:38] !log dzahn@dns1004 END - running authdns-update [18:21:14] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [18:21:19] !log dzahn@dns1004 START - running authdns-update [18:21:37] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs[2001,2003-2012,1011-1021]*: Applying upgrade to Java 11.0.30 - eevans@cumin1003 [18:21:52] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [18:22:30] !log dzahn@dns1004 END - running authdns-update [18:24:16] RESOLVED: ProbeDown: Service zotero:4969 has failed probes (http_zotero_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#zotero:4969 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:29:34] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [18:29:57] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [18:30:17] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11584635 (10wiki_willy) Hey @RobH - did Jin say what kind of initial troubleshooting he did? Like did he do a power drain, reseat certain parts, etc? I think we can go ahead and purchase parts to se... [18:34:24] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11584652 (10RobH) He did full troubleshooting with photos with me in a google chat, it included the following: * confirming the power ports on the PDU towers were outputting power * confirming the pow... [18:36:18] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11584657 (10RobH) We cannot purchase replacement parts via the Dell website I linked in, which is the Dell site linked when you try to open a case without warranty support. The alternative is to try... [18:37:15] (03PS1) 10Cathal Mooney: wikimedia.org: add IPv6 AAAA record for ns0 [dns] - 10https://gerrit.wikimedia.org/r/1236798 (https://phabricator.wikimedia.org/T81605) [18:39:36] (03CR) 10Dzahn: [C:03+2] zookeeper/zuul: use standard port 2281 for TLS secureClientPort [puppet] - 10https://gerrit.wikimedia.org/r/1236730 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [18:41:00] (03PS2) 10Dzahn: zookeeper: set keystore format to PKCS12 when enabling TLS (for zuul) [puppet] - 10https://gerrit.wikimedia.org/r/1236735 (https://phabricator.wikimedia.org/T405119) [18:41:05] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11584686 (10wiki_willy) Sounds good @RobH, that plan works for me as well. Do you know if Jin has access to any of these parts by any chance? If he is able to get a hold of them, he could just add t... [18:42:16] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11584697 (10RobH) I'll ask, also going to ask in dc ops meeting if anyone has a spare r450 they can crack open to check for the part # of the power distribution board and the mainboard. [18:44:49] 06SRE, 06Infrastructure-Foundations: Build OpenGear serial port config from Netbox - https://phabricator.wikimedia.org/T415345#11584711 (10cmooney) >>! In T415345#11578174, @ayounsi wrote: > Nice, I copy pasted what you did in Netbox's "render config" feature, that's the result : https://netbox-next.wikimedia.... [18:46:09] (03PS7) 10Daniel Kinzler: redioscope: enable time bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 [18:46:22] (03CR) 10CI reject: [V:04-1] redioscope: enable time bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 (owner: 10Daniel Kinzler) [18:47:36] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Applying upgrade to Java 11.0.30 - eevans@cumin1003 [18:48:21] !log restart Cassandra to apply Java 11.0.30 upgrade, restbase/eqiad — T416492 [18:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:24] T416492: Cassandra restarts for Java 11.0.30 security update - https://phabricator.wikimedia.org/T416492 [18:51:17] (03PS4) 10Daniel Kinzler: rest gateway: include service values.yaml when testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229119 [18:52:03] (03PS18) 10Daniel Kinzler: rest gateway: add tests for chart rendering [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225085 [18:52:19] (03PS4) 10Daniel Kinzler: rediscope: lower cpu and memoy limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233161 [18:52:23] (03PS8) 10Daniel Kinzler: redioscope: enable time bucket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 [18:52:35] (03CR) 10Ssingh: [C:03+1] wikimedia.org: add IPv6 AAAA record for ns0 [dns] - 10https://gerrit.wikimedia.org/r/1236798 (https://phabricator.wikimedia.org/T81605) (owner: 10Cathal Mooney) [18:54:26] (03CR) 10Scott French: "Following up here from #wikimedia-sre:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236710 (owner: 10Mvolz) [18:55:16] (03CR) 10Dzahn: [C:03+2] zookeeper: set keystore format to PKCS12 when enabling TLS (for zuul) [puppet] - 10https://gerrit.wikimedia.org/r/1236735 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [18:58:45] (03CR) 10Cathal Mooney: [C:04-1] "Not to be merged until ns1 in place and we are happy." [dns] - 10https://gerrit.wikimedia.org/r/1236798 (https://phabricator.wikimedia.org/T81605) (owner: 10Cathal Mooney) [19:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T1900) [19:00:48] (03PS1) 10Cathal Mooney: wikimedia.org: add IPv6 AAAA record for ns2 [dns] - 10https://gerrit.wikimedia.org/r/1236803 (https://phabricator.wikimedia.org/T81605) [19:01:13] (still blocked, await review on vendor patches.) [19:02:47] (03CR) 10Cathal Mooney: [C:04-1] "Not to be merged until we are happy with ns1 and ns0 progress" [dns] - 10https://gerrit.wikimedia.org/r/1236803 (https://phabricator.wikimedia.org/T81605) (owner: 10Cathal Mooney) [19:08:50] (03CR) 10Ssingh: [C:03+1] wikimedia.org: add IPv6 AAAA record for ns2 [dns] - 10https://gerrit.wikimedia.org/r/1236803 (https://phabricator.wikimedia.org/T81605) (owner: 10Cathal Mooney) [19:09:03] (03CR) 10Ssingh: [C:04-1] "[+1 on CR, -1 so we don't merge]" [dns] - 10https://gerrit.wikimedia.org/r/1236803 (https://phabricator.wikimedia.org/T81605) (owner: 10Cathal Mooney) [19:13:48] (03PS1) 10Mvolz: Revert "Retry update of zotero" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236806 [19:14:11] (03CR) 10Mvolz: [C:03+2] Revert "Retry update of zotero" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236806 (owner: 10Mvolz) [19:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [19:16:01] (03Merged) 10jenkins-bot: Revert "Retry update of zotero" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236806 (owner: 10Mvolz) [19:16:47] jouncebot: nowandnext [19:16:47] For the next 1 hour(s) and 43 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T1900) [19:16:47] In 1 hour(s) and 43 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T2100) [19:19:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T415786)', diff saved to https://phabricator.wikimedia.org/P88670 and previous config saved to /var/cache/conftool/dbconfig/20260204-191947-marostegui.json [19:19:51] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [19:19:51] (03PS1) 10Dzahn: zuul: move .p12 keystore file under the zookeeper config path [puppet] - 10https://gerrit.wikimedia.org/r/1236809 (https://phabricator.wikimedia.org/T405119) [19:22:11] (03CR) 10CDanis: [C:03+1] gerrit: add GerritHaProxy* alerts [alerts] - 10https://gerrit.wikimedia.org/r/1236746 (https://phabricator.wikimedia.org/T416189) (owner: 10Jelto) [19:22:28] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254#11584873 (10Jclark-ctr) [19:22:31] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254#11584875 (10Jclark-ctr) 05Open→03Resolved [19:23:57] (03PS2) 10Dzahn: zuul: move .p12 keystore file under the zookeeper config path [puppet] - 10https://gerrit.wikimedia.org/r/1236809 (https://phabricator.wikimedia.org/T405119) [19:27:55] (03CR) 10Dzahn: [C:03+2] zuul: move .p12 keystore file under the zookeeper config path [puppet] - 10https://gerrit.wikimedia.org/r/1236809 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [19:29:16] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:34:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P88671 and previous config saved to /var/cache/conftool/dbconfig/20260204-193455-marostegui.json [19:38:32] (03PS1) 10Bartosz Dziewoński: Add messages for 'local-bot' global group [extensions/WikimediaMessages] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236811 (https://phabricator.wikimedia.org/T415588) [19:38:43] (03PS1) 10Bartosz Dziewoński: Add messages for 'local-bot' global group [extensions/WikimediaMessages] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236812 (https://phabricator.wikimedia.org/T415588) [19:39:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236811 (https://phabricator.wikimedia.org/T415588) (owner: 10Bartosz Dziewoński) [19:39:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236812 (https://phabricator.wikimedia.org/T415588) (owner: 10Bartosz Dziewoński) [19:43:06] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [19:46:23] I would like to do a revert of a deploy I did earlier in the day because it's flagging... since the train isn't happening is there any objections if I go now? [19:47:03] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for fasw2-e16-eqiad pair - cmooney@cumin1003" [19:47:07] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for fasw2-e16-eqiad pair - cmooney@cumin1003" [19:47:07] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:47:12] It shouldn't interfere with any backports to mediawiki as it's a service. [19:48:26] Mvolz: you can go ahead :) [19:48:35] thanks! [19:50:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P88672 and previous config saved to /var/cache/conftool/dbconfig/20260204-195004-marostegui.json [19:50:37] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [19:50:55] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [19:51:18] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [19:51:43] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [19:51:47] (03PS1) 10Cathal Mooney: Update public ssh key for user astein [homer/public] - 10https://gerrit.wikimedia.org/r/1236814 (https://phabricator.wikimedia.org/T415345) [19:52:02] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [19:52:21] (03PS1) 10Dzahn: zuul: use chained certificate incl CA for zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/1236815 (https://phabricator.wikimedia.org/T405119) [19:52:33] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [19:53:06] (03CR) 10Dzahn: "i think you linked to the wrong ticket by accident" [homer/public] - 10https://gerrit.wikimedia.org/r/1236814 (https://phabricator.wikimedia.org/T415345) (owner: 10Cathal Mooney) [19:54:21] (03PS2) 10Cathal Mooney: Update public ssh key for user astein [homer/public] - 10https://gerrit.wikimedia.org/r/1236814 (https://phabricator.wikimedia.org/T413826) [19:54:38] mutante: well spotted on the bad phab task id thanks! [19:54:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [19:54:44] Deployment zotero-production in zotero at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=zotero&var-deployment=zotero-production - ... [19:54:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [19:55:01] * swfrench-wmf thumbs up [19:55:09] (03CR) 10Dzahn: [C:03+1] "yea, this one matches what is in the ticket" [homer/public] - 10https://gerrit.wikimedia.org/r/1236814 (https://phabricator.wikimedia.org/T413826) (owner: 10Cathal Mooney) [19:55:17] topranks: np [19:55:17] thanks, M.volz! [19:57:55] (03CR) 10Cathal Mooney: [C:03+2] Update public ssh key for user astein [homer/public] - 10https://gerrit.wikimedia.org/r/1236814 (https://phabricator.wikimedia.org/T413826) (owner: 10Cathal Mooney) [19:58:14] all done! [19:59:10] (03Merged) 10jenkins-bot: Update public ssh key for user astein [homer/public] - 10https://gerrit.wikimedia.org/r/1236814 (https://phabricator.wikimedia.org/T413826) (owner: 10Cathal Mooney) [20:03:50] (03PS1) 10Bking: opensearch on k8s: enable blackbox probes for opensearch-test ns [puppet] - 10https://gerrit.wikimedia.org/r/1236818 (https://phabricator.wikimedia.org/T416345) [20:05:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T415786)', diff saved to https://phabricator.wikimedia.org/P88673 and previous config saved to /var/cache/conftool/dbconfig/20260204-200512-marostegui.json [20:05:18] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [20:05:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2198.codfw.wmnet with reason: Maintenance [20:12:56] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1236815/7981/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1236815 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [20:52:23] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Applying upgrade to Java 11.0.30 - eevans@cumin1003 [20:59:16] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T2100). [21:00:05] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:22] hi [21:01:16] my patch just defines some new localisation messages, which i want to use on-wiki; nothing to test on mwdebug [21:04:35] MatmaRex: That will be a slow deployment due to the l10n changes but it looks like you have the window to yourself! [21:05:06] yeah. could you be so nice and shp it? :) i don't have access to do it myself [21:05:16] Sure [21:05:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236811 (https://phabricator.wikimedia.org/T415588) (owner: 10Bartosz Dziewoński) [21:06:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236812 (https://phabricator.wikimedia.org/T415588) (owner: 10Bartosz Dziewoński) [21:06:00] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Applying upgrade to Java 11.0.30 — T416492 - eevans@cumin1003 [21:06:03] T416492: Cassandra restarts for Java 11.0.30 security update - https://phabricator.wikimedia.org/T416492 [21:06:24] !log restart Cassandra to apply Java 11.0.30 upgrade, restbase/codfw — T416492 [21:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:05] (03Merged) 10jenkins-bot: Add messages for 'local-bot' global group [extensions/WikimediaMessages] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236811 (https://phabricator.wikimedia.org/T415588) (owner: 10Bartosz Dziewoński) [21:07:14] (03Merged) 10jenkins-bot: Add messages for 'local-bot' global group [extensions/WikimediaMessages] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236812 (https://phabricator.wikimedia.org/T415588) (owner: 10Bartosz Dziewoński) [21:07:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11585115 (10RobH) [21:07:49] !log dancy@deploy2002 Started scap sync-world: Backport for [[gerrit:1236811|Add messages for 'local-bot' global group (T415588)]], [[gerrit:1236812|Add messages for 'local-bot' global group (T415588)]] [21:07:53] T415588: Add rate limit class for accounts that are in a local bot group on any wiki - https://phabricator.wikimedia.org/T415588 [21:09:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11585124 (10RobH) [21:09:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11585125 (10RobH) [21:27:49] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11585160 (10RobH) Ok, Jenn checked inside the R450 and it is indeed a stand alone power distro board. {F71676229} {F71676231} John might have two hosts abandoned from T342455 (he is checking) and... [21:34:13] !log dancy@deploy2002 matmarex, dancy: Backport for [[gerrit:1236811|Add messages for 'local-bot' global group (T415588)]], [[gerrit:1236812|Add messages for 'local-bot' global group (T415588)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:34:16] T415588: Add rate limit class for accounts that are in a local bot group on any wiki - https://phabricator.wikimedia.org/T415588 [21:34:32] (03CR) 10Ryan Kemper: [C:03+1] data-platform: Add site tags, don't enforce linting on probe series [alerts] - 10https://gerrit.wikimedia.org/r/1236766 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [21:34:41] !log dancy@deploy2002 matmarex, dancy: Continuing with sync [21:40:25] (03CR) 10Ryan Kemper: [C:03+1] "I'm fine merging this version with the pint disables to stop the noise; I vote we test a followup patch that removes just the pint part af" [alerts] - 10https://gerrit.wikimedia.org/r/1236766 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [21:41:21] (03CR) 10Bking: [C:03+2] data-platform: Add site tags, don't enforce linting on probe series [alerts] - 10https://gerrit.wikimedia.org/r/1236766 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [21:47:49] !log dancy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1236811|Add messages for 'local-bot' global group (T415588)]], [[gerrit:1236812|Add messages for 'local-bot' global group (T415588)]] (duration: 40m 00s) [21:47:52] T415588: Add rate limit class for accounts that are in a local bot group on any wiki - https://phabricator.wikimedia.org/T415588 [21:48:27] I'll deploy one more patch [21:49:31] thanks dancy! [21:49:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235551 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [21:49:38] yw [21:50:26] (03Merged) 10jenkins-bot: Migrate EmailAuth config, step 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235551 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [21:51:02] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1235551|Migrate EmailAuth config, step 1 (T404334)]] [21:51:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2200.codfw.wmnet with reason: Maintenance [21:55:17] !log tgr@deploy2002 tgr: Backport for [[gerrit:1235551|Migrate EmailAuth config, step 1 (T404334)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:56:21] !log tgr@deploy2002 tgr: Continuing with sync [21:57:57] (03CR) 10Ryan Kemper: [C:03+1] "Looks good; we've confirmed in previous patches that the production line is necessary for blackbox checks, and for non-lvs-backed services" [puppet] - 10https://gerrit.wikimedia.org/r/1236818 (https://phabricator.wikimedia.org/T416345) (owner: 10Bking) [22:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T2200) [22:02:29] (03CR) 10Bking: [C:03+2] opensearch on k8s: enable blackbox probes for opensearch-test ns [puppet] - 10https://gerrit.wikimedia.org/r/1236818 (https://phabricator.wikimedia.org/T416345) (owner: 10Bking) [22:02:30] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1235551|Migrate EmailAuth config, step 1 (T404334)]] (duration: 11m 28s) [22:03:36] (03PS1) 10Bartosz Dziewoński: Remove unused 'editor' right from plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236843 [22:03:38] !log late UTC deploys done [22:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:30] (03PS2) 10Majavah: aptrepo: Drop packages for Kubeadm/1.30 [puppet] - 10https://gerrit.wikimedia.org/r/1226878 (https://phabricator.wikimedia.org/T372697) [22:06:30] (03PS2) 10Majavah: aptrepo: Import packages for Kubeadm/1.32 [puppet] - 10https://gerrit.wikimedia.org/r/1226879 (https://phabricator.wikimedia.org/T379047) [22:07:08] (03CR) 10CI reject: [V:04-1] aptrepo: Import packages for Kubeadm/1.32 [puppet] - 10https://gerrit.wikimedia.org/r/1226879 (https://phabricator.wikimedia.org/T379047) (owner: 10Majavah) [22:08:33] (03PS3) 10Majavah: aptrepo: Drop packages for Kubeadm/1.30 [puppet] - 10https://gerrit.wikimedia.org/r/1226878 (https://phabricator.wikimedia.org/T372697) [22:08:33] (03PS3) 10Majavah: aptrepo: Import packages for Kubeadm/1.32 [puppet] - 10https://gerrit.wikimedia.org/r/1226879 (https://phabricator.wikimedia.org/T379047) [22:09:33] (03CR) 10Majavah: [C:03+2] aptrepo: Drop packages for Kubeadm/1.30 [puppet] - 10https://gerrit.wikimedia.org/r/1226878 (https://phabricator.wikimedia.org/T372697) (owner: 10Majavah) [22:09:43] (03CR) 10Majavah: [C:03+2] aptrepo: Import packages for Kubeadm/1.32 [puppet] - 10https://gerrit.wikimedia.org/r/1226879 (https://phabricator.wikimedia.org/T379047) (owner: 10Majavah) [22:15:50] (03PS1) 10Dzahn: zuul/zookeeper: debug (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1236845 [22:15:54] (03PS1) 10Ryan Kemper: opensearch-on-k8s: alphabetize ordering [puppet] - 10https://gerrit.wikimedia.org/r/1236846 [22:15:54] (03PS1) 10Ryan Kemper: opensearch-semantic-search[-test]: add config [puppet] - 10https://gerrit.wikimedia.org/r/1236847 (https://phabricator.wikimedia.org/T414691) [22:17:18] (03CR) 10Bking: [C:03+2] opensearch-semantic-search[-test]: add config [puppet] - 10https://gerrit.wikimedia.org/r/1236847 (https://phabricator.wikimedia.org/T414691) (owner: 10Ryan Kemper) [22:17:45] (03CR) 10Bking: [C:03+2] opensearch-on-k8s: alphabetize ordering [puppet] - 10https://gerrit.wikimedia.org/r/1236846 (owner: 10Ryan Kemper) [22:20:12] (03PS1) 10Ryan Kemper: data-platform: Remove unnecessary pint disable directives [alerts] - 10https://gerrit.wikimedia.org/r/1236848 (https://phabricator.wikimedia.org/T412447) [22:26:29] !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [22:27:08] !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [22:27:36] !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [22:28:17] !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [22:34:32] (03CR) 10Mszwarc: [C:03+1] Remove unused 'editor' right from plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236843 (owner: 10Bartosz Dziewoński) [22:37:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11585490 (10RKemper) Unsurprisingly there's some post-swap steps for us (DPE SRE) to resolve: `There are offline or missing virtual drives with preserved... [22:40:04] RECOVERY - Host an-worker1187 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [22:40:20] PROBLEM - SSH on an-worker1187 is CRITICAL: connect to address 10.64.138.10 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:51:56] 06SRE, 10LDAP-Access-Requests: Grant Access to bitu-account-managers(?) for reedy - https://phabricator.wikimedia.org/T416062#11585499 (10Reedy) >>! In T416062#11582392, @MoritzMuehlenhoff wrote: > @Reedy Due to an oversight "Bitu-account-managers" was only requesteable on the test instance for Bitu. This has... [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260204T2300) [23:01:36] (03CR) 10Majavah: [C:03+1] data-platform: Remove unnecessary pint disable directives [alerts] - 10https://gerrit.wikimedia.org/r/1236848 (https://phabricator.wikimedia.org/T412447) (owner: 10Ryan Kemper) [23:04:08] (03PS1) 10Ryan Kemper: wdqs: detune BlazegraphFailedServerRatioIncrease [alerts] - 10https://gerrit.wikimedia.org/r/1236852 (https://phabricator.wikimedia.org/T414306) [23:05:03] (03PS2) 10Ryan Kemper: data-platform: Remove vestigial pint disable directives [alerts] - 10https://gerrit.wikimedia.org/r/1236848 (https://phabricator.wikimedia.org/T412447) [23:05:44] (03PS3) 10Ryan Kemper: data-platform: Remove vestigial pint disable directives [alerts] - 10https://gerrit.wikimedia.org/r/1236848 (https://phabricator.wikimedia.org/T412447) [23:06:48] (03PS2) 10Bartosz Dziewoński: Remove unused 'editor' right from plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236843 [23:10:54] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Applying upgrade to Java 11.0.30 — T416492 - eevans@cumin1003 [23:10:58] T416492: Cassandra restarts for Java 11.0.30 security update - https://phabricator.wikimedia.org/T416492 [23:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [23:15:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2208.codfw.wmnet with reason: Maintenance [23:16:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2208 (T415786)', diff saved to https://phabricator.wikimedia.org/P88674 and previous config saved to /var/cache/conftool/dbconfig/20260204-231600-marostegui.json [23:16:04] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [23:16:08] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: FY2526 Q3:rack/setup/install restbase2039 - https://phabricator.wikimedia.org/T416538 (10RobH) 03NEW [23:16:40] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: FY2526 Q3:rack/setup/install restbase2039 - https://phabricator.wikimedia.org/T416538#11585559 (10RobH) a:03Eevans @eevans, Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operati... [23:29:16] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:30:25] (03CR) 10Ryan Kemper: [C:03+2] data-platform: Remove vestigial pint disable directives [alerts] - 10https://gerrit.wikimedia.org/r/1236848 (https://phabricator.wikimedia.org/T412447) (owner: 10Ryan Kemper) [23:31:01] (03PS1) 10Santiago Faci: Renaming `MetricsPlatform` => `TestKitchen` [extensions/ReadingLists] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236854 (https://phabricator.wikimedia.org/T414435) [23:31:39] (03Merged) 10jenkins-bot: data-platform: Remove vestigial pint disable directives [alerts] - 10https://gerrit.wikimedia.org/r/1236848 (https://phabricator.wikimedia.org/T412447) (owner: 10Ryan Kemper) [23:41:34] (03PS3) 10Jasmine: wikikube: decommission worker[2052-2054,2063,2079-2084,2096-2101].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1227431 (https://phabricator.wikimedia.org/T409103) [23:42:43] (03CR) 10Jasmine: wikikube: decommission worker[2052-2054,2063,2079-2084,2096-2101].codfw.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227431 (https://phabricator.wikimedia.org/T409103) (owner: 10Jasmine) [23:44:30] (03CR) 10Jasmine: wikikube: decommission wikikube-worker[2116-2123,2216-2241].codfw.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227454 (https://phabricator.wikimedia.org/T409104) (owner: 10Jasmine) [23:45:00] (03CR) 10Jasmine: wikikube: decommission wikikube-worker[2116-2123,2216-2241].codfw.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227454 (https://phabricator.wikimedia.org/T409104) (owner: 10Jasmine) [23:45:41] (03PS2) 10Jasmine: wikikube: decommission wikikube-worker[2116-2123,2216-2241].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1227454 (https://phabricator.wikimedia.org/T409104) [23:45:52] (03PS1) 10Santiago Faci: readingListAB.js: Updated to use mw.testKitchen [extensions/WikimediaEvents] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1236856 (https://phabricator.wikimedia.org/T414435)