[00:04:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10920950 (10Jclark-ctr) [00:05:48] (03PS1) 10Dzahn: prometheus: structure for misc checks per SRE subteam (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1159612 [00:06:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10920952 (10Jclark-ctr) @VRiley-WMF i was able to get an-worker1185 to pass by starting over with the server I used the retire server option in... [00:06:14] (03CR) 10CI reject: [V:04-1] prometheus: structure for misc checks per SRE subteam (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1159612 (owner: 10Dzahn) [00:10:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1159613 [00:10:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1159613 (owner: 10TrainBranchBot) [00:13:28] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:13:46] jhancock@cumin2002 provision (PID 763330) is awaiting input [00:17:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:24:39] (03CR) 10Ssingh: [C:03+1] Revert^2 "hiera: Add lvs1016 to high-traffic1" [puppet] - 10https://gerrit.wikimedia.org/r/1159592 (owner: 10BCornwall) [00:29:52] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1159613 (owner: 10TrainBranchBot) [00:30:51] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest2005'] [00:31:00] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest2005'] [00:32:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [00:32:50] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10920977 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2005.codfw.wmnet with OS bookworm [00:34:04] (03CR) 10Ssingh: [C:03+1] hiera: Issue a separate LE cert for upload cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159510 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [00:34:33] (03CR) 10Ssingh: [C:03+1] hiera: Issue a separate GTS cert for upload cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159511 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [00:46:38] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/98f4d8fda8cd53caa560fb8f7e1f626fe0dc53dfe314fefdb975fc54791409f1/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:56:03] jhancock@cumin2002 reimage (PID 797016) is awaiting input [00:56:21] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-eqsin and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [00:56:25] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [01:06:38] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:08:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.6 [core] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1159621 (https://phabricator.wikimedia.org/T392176) [01:08:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.6 [core] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1159621 (https://phabricator.wikimedia.org/T392176) (owner: 10TrainBranchBot) [01:16:28] RECOVERY - dump of x3 in codfw on backupmon1001 is OK: Last dump for x3 at codfw (db2200) taken on 2025-06-17 00:44:31 (35 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:20:35] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.6 [core] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1159621 (https://phabricator.wikimedia.org/T392176) (owner: 10TrainBranchBot) [01:30:40] (03PS1) 10Mimurawil: Configure instrument for CheckUser - UserInfoCard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) [01:50:44] RECOVERY - dump of x3 in eqiad on backupmon1001 is OK: Last dump for x3 at eqiad (db1216) taken on 2025-06-17 00:56:25 (35 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:56:22] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp70[02-16].magru.wmnet} and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [01:56:26] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T0200) [02:14:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [02:14:44] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10921107 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2005.codfw.wmnet with OS bookworm executed with errors: - sretest2005 (... [02:21:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:25:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:27:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [02:27:12] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10921117 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2005.codfw.wmnet with OS bookworm [02:44:55] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [02:45:05] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10921130 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2005.codfw.wmnet with OS bookworm executed with errors: - sretest2005 (... [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T0300) [03:01:54] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159637 (https://phabricator.wikimedia.org/T392176) [03:01:56] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159637 (https://phabricator.wikimedia.org/T392176) (owner: 10TrainBranchBot) [03:02:49] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159637 (https://phabricator.wikimedia.org/T392176) (owner: 10TrainBranchBot) [03:03:12] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.6 refs T392176 [03:03:16] T392176: 1.45.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T392176 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T0400) [04:13:28] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:47:56] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.6 refs T392176 (duration: 104m 43s) [04:48:00] T392176: 1.45.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T392176 [04:49:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1193.eqiad.wmnet with reason: Maintenance [04:52:27] (03PS1) 10Marostegui: db1197: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159672 (https://phabricator.wikimedia.org/T396549) [04:53:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1197', diff saved to https://phabricator.wikimedia.org/P78095 and previous config saved to /var/cache/conftool/dbconfig/20250617-045351-marostegui.json [04:54:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1197.eqiad.wmnet with reason: Maintenance [04:55:32] (03CR) 10Marostegui: [C:03+2] db1197: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159672 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [05:01:52] PROBLEM - librenms.wikimedia.org tls expiry on netmon1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [05:02:28] PROBLEM - librenms.wikimedia.org requires authentication on netmon1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [05:02:40] PROBLEM - SSH on netmon1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:03:48] FIRING: PuppetFailure: Puppet has failed on testreduce1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:04:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78096 and previous config saved to /var/cache/conftool/dbconfig/20250617-050405-root.json [05:05:15] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [05:05:34] RECOVERY - SSH on netmon1003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:05:42] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:06:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1027', diff saved to https://phabricator.wikimedia.org/P78097 and previous config saved to /var/cache/conftool/dbconfig/20250617-050618-marostegui.json [05:06:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1027.eqiad.wmnet with reason: Maintenance [05:08:18] RECOVERY - librenms.wikimedia.org requires authentication on netmon1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 661 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [05:08:40] PROBLEM - SSH on netmon1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:10:42] FIRING: [9x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:28] PROBLEM - librenms.wikimedia.org requires authentication on netmon1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [05:12:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2027 to es3 codfw master', diff saved to https://phabricator.wikimedia.org/P78098 and previous config saved to /var/cache/conftool/dbconfig/20250617-051212-root.json [05:12:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2029', diff saved to https://phabricator.wikimedia.org/P78099 and previous config saved to /var/cache/conftool/dbconfig/20250617-051231-marostegui.json [05:13:48] FIRING: [2x] PuppetFailure: Puppet has failed on parsoidtest1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:14:03] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2029.codfw.wmnet with reason: Maintenance [05:15:42] FIRING: [9x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78100 and previous config saved to /var/cache/conftool/dbconfig/20250617-051612-root.json [05:18:08] (03PS1) 10KartikMistry: WIP: machinetranslation: Use s3 storage for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159696 (https://phabricator.wikimedia.org/T335491) [05:19:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78101 and previous config saved to /var/cache/conftool/dbconfig/20250617-051911-root.json [05:19:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78102 and previous config saved to /var/cache/conftool/dbconfig/20250617-051927-root.json [05:21:46] 10SRE-swift-storage, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX), 13Patch-For-Review: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10921280 (10KartikMistry) @klausman is there any reason why we can't see following in the diff in staging?... [05:23:18] RECOVERY - librenms.wikimedia.org requires authentication on netmon1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 661 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [05:23:30] RECOVERY - SSH on netmon1003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:23:42] RECOVERY - librenms.wikimedia.org tls expiry on netmon1003 is OK: OK - Certificate librenms.wikimedia.org will expire on Thu 17 Jul 2025 05:41:05 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [05:25:42] RESOLVED: [9x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:26:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2145.codfw.wmnet with reason: Maintenance [05:26:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T396130)', diff saved to https://phabricator.wikimedia.org/P78103 and previous config saved to /var/cache/conftool/dbconfig/20250617-052639-marostegui.json [05:26:44] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:28:28] FIRING: SystemdUnitFailed: wmf_auto_restart_ipmiseld.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:31:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P78104 and previous config saved to /var/cache/conftool/dbconfig/20250617-053117-root.json [05:34:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78105 and previous config saved to /var/cache/conftool/dbconfig/20250617-053416-root.json [05:34:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78106 and previous config saved to /var/cache/conftool/dbconfig/20250617-053433-root.json [05:35:02] (03PS1) 10Marostegui: db1207: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1159722 (https://phabricator.wikimedia.org/T396706) [05:35:18] (03CR) 10Marostegui: "Host is all green on icinga" [puppet] - 10https://gerrit.wikimedia.org/r/1159722 (https://phabricator.wikimedia.org/T396706) (owner: 10Marostegui) [05:35:30] (03CR) 10Marostegui: [C:03+2] db1207: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1159722 (https://phabricator.wikimedia.org/T396706) (owner: 10Marostegui) [05:39:56] (03PS1) 10Marostegui: mariadb: Promote db1207 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/1159727 (https://phabricator.wikimedia.org/T396706) [05:42:09] (03PS1) 10Marostegui: mariadb backups: Change m1 host [puppet] - 10https://gerrit.wikimedia.org/r/1159731 (https://phabricator.wikimedia.org/T396706) [05:46:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:46:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P78107 and previous config saved to /var/cache/conftool/dbconfig/20250617-054623-root.json [05:46:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:46:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:49:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1197 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78108 and previous config saved to /var/cache/conftool/dbconfig/20250617-054922-root.json [05:49:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78109 and previous config saved to /var/cache/conftool/dbconfig/20250617-054938-root.json [05:50:12] (03PS2) 10KartikMistry: WIP: machinetranslation: Use s3 storage for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159696 (https://phabricator.wikimedia.org/T335491) [05:51:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:51:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:52:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T396130)', diff saved to https://phabricator.wikimedia.org/P78110 and previous config saved to /var/cache/conftool/dbconfig/20250617-055218-marostegui.json [05:52:23] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T0600) [06:00:05] marostegui, Amir1, and federico3: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T0600) [06:01:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78111 and previous config saved to /var/cache/conftool/dbconfig/20250617-060129-root.json [06:03:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1028', diff saved to https://phabricator.wikimedia.org/P78112 and previous config saved to /var/cache/conftool/dbconfig/20250617-060347-marostegui.json [06:04:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1028.eqiad.wmnet with reason: Maintenance [06:04:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2029 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78113 and previous config saved to /var/cache/conftool/dbconfig/20250617-060444-root.json [06:06:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1029', diff saved to https://phabricator.wikimedia.org/P78114 and previous config saved to /var/cache/conftool/dbconfig/20250617-060640-marostegui.json [06:06:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1029.eqiad.wmnet with reason: Maintenance [06:07:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P78115 and previous config saved to /var/cache/conftool/dbconfig/20250617-060725-marostegui.json [06:09:52] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: move soon-to-be-decommed hosts to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159517 (https://phabricator.wikimedia.org/T395855) (owner: 10Bking) [06:09:53] (03CR) 10Ryan Kemper: [C:03+2] cirrussearch: move soon-to-be-decommed hosts to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159517 (https://phabricator.wikimedia.org/T395855) (owner: 10Bking) [06:10:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78116 and previous config saved to /var/cache/conftool/dbconfig/20250617-061052-root.json [06:11:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:13:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78117 and previous config saved to /var/cache/conftool/dbconfig/20250617-061356-root.json [06:14:31] (03PS2) 10Stevemunene: replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135049 (https://phabricator.wikimedia.org/T374922) [06:14:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1030', diff saved to https://phabricator.wikimedia.org/P78118 and previous config saved to /var/cache/conftool/dbconfig/20250617-061441-marostegui.json [06:14:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1030.eqiad.wmnet with reason: Maintenance [06:16:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78119 and previous config saved to /var/cache/conftool/dbconfig/20250617-061635-root.json [06:17:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-eqiad and NTT (192.80.17.185) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:17:47] (03CR) 10Brouberol: [C:03+1] Add our legacy archiva instance to kubernetes external_services [puppet] - 10https://gerrit.wikimedia.org/r/1159563 (https://phabricator.wikimedia.org/T392244) (owner: 10Btullis) [06:18:19] (03CR) 10Brouberol: [C:03+1] Allow blunderbuss to contact archiva [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159579 (https://phabricator.wikimedia.org/T392244) (owner: 10Btullis) [06:21:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P78120 and previous config saved to /var/cache/conftool/dbconfig/20250617-062104-root.json [06:21:21] (03PS4) 10Stevemunene: zookeeper: remove an-conf100[1-3] from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135028 (https://phabricator.wikimedia.org/T374922) [06:22:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P78121 and previous config saved to /var/cache/conftool/dbconfig/20250617-062233-marostegui.json [06:25:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78122 and previous config saved to /var/cache/conftool/dbconfig/20250617-062558-root.json [06:28:53] (03CR) 10Muehlenhoff: [C:03+2] Add ncredir7004 to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1156815 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [06:29:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78123 and previous config saved to /var/cache/conftool/dbconfig/20250617-062902-root.json [06:30:08] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [06:30:30] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:30:42] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:31:20] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:31:32] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54082 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:32:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-eqiad and NTT (192.80.17.185) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:34:18] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [06:36:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78124 and previous config saved to /var/cache/conftool/dbconfig/20250617-063610-root.json [06:36:24] (03Abandoned) 10Stevemunene: hdfs: replace an-conf100[1-3] with an-conf100[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/1135031 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [06:37:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T396130)', diff saved to https://phabricator.wikimedia.org/P78125 and previous config saved to /var/cache/conftool/dbconfig/20250617-063740-marostegui.json [06:37:45] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:37:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2146.codfw.wmnet with reason: Maintenance [06:38:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T396130)', diff saved to https://phabricator.wikimedia.org/P78126 and previous config saved to /var/cache/conftool/dbconfig/20250617-063803-marostegui.json [06:39:38] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2003.codfw.wmnet to drbd [06:40:06] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10921416 (10ops-monitoring-bot) VM aux-k8s-etcd2003.codfw.wmnet switching disk type to drbd [06:41:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78127 and previous config saved to /var/cache/conftool/dbconfig/20250617-064104-root.json [06:43:51] !log jmm@puppetserver1001 conftool action : set/weight=1; selector: name=ncredir7004.magru.wmnet [06:44:04] !log jmm@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir7004.magru.wmnet [06:44:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78128 and previous config saved to /var/cache/conftool/dbconfig/20250617-064408-root.json [06:44:55] (03CR) 10Kosta Harlan: [C:04-1] Configure instrument for CheckUser - UserInfoCard (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) (owner: 10Mimurawil) [06:46:30] (03PS1) 10Muehlenhoff: Add doh7004/durum7004 [puppet] - 10https://gerrit.wikimedia.org/r/1159916 (https://phabricator.wikimedia.org/T394263) [06:51:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78129 and previous config saved to /var/cache/conftool/dbconfig/20250617-065115-root.json [06:56:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1028 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78130 and previous config saved to /var/cache/conftool/dbconfig/20250617-065610-root.json [06:56:48] (03PS1) 10Stevemunene: superset:pull kerberos server values from global values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159926 (https://phabricator.wikimedia.org/T395412) [06:59:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1029 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78131 and previous config saved to /var/cache/conftool/dbconfig/20250617-065914-root.json [06:59:22] (03PS3) 10KartikMistry: Enable the Contribute menu (6th group) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152558 (https://phabricator.wikimedia.org/T380930) [06:59:58] PROBLEM - Host aux-k8s-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [07:00:04] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T0700). Please do the needful. [07:00:04] kart_ and Tchanders: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:10] hi hi [07:00:12] i can deploy toda [07:00:12] o/ [07:00:16] hey Tchanders! [07:00:35] !log jmm@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir7002.magru.wmnet [07:00:35] kart_: around? [07:00:45] yes [07:01:03] (03CR) 10Urbanecm: [C:03+2] Enable the Contribute menu (6th group) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152558 (https://phabricator.wikimedia.org/T380930) (owner: 10KartikMistry) [07:01:11] urbanecm: Can I try spiderpig interface? :) [07:01:18] oh. +2ed :) [07:01:18] kart_: okay :) [07:01:25] (03CR) 10Urbanecm: Enable the Contribute menu (6th group) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152558 (https://phabricator.wikimedia.org/T380930) (owner: 10KartikMistry) [07:01:29] removed, go ahead :)) [07:01:34] ah :) [07:02:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152558 (https://phabricator.wikimedia.org/T380930) (owner: 10KartikMistry) [07:02:35] (03PS1) 10Urbanecm: feat(LevelingUp): Measure the delay between actual and intended notification timestamp [extensions/GrowthExperiments] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159927 (https://phabricator.wikimedia.org/T395260) [07:02:47] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2003.codfw.wmnet to drbd [07:02:49] (03Merged) 10jenkins-bot: Enable the Contribute menu (6th group) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152558 (https://phabricator.wikimedia.org/T380930) (owner: 10KartikMistry) [07:03:29] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1152558|Enable the Contribute menu (6th group) (T380930)]] [07:03:33] T380930: Enable the Contribute menu in 6th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T380930 [07:03:34] RECOVERY - Host aux-k8s-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.77 ms [07:03:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T396130)', diff saved to https://phabricator.wikimedia.org/P78132 and previous config saved to /var/cache/conftool/dbconfig/20250617-070334-marostegui.json [07:03:39] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:06:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78133 and previous config saved to /var/cache/conftool/dbconfig/20250617-070621-root.json [07:08:12] !log kartik@deploy1003 kartik: Backport for [[gerrit:1152558|Enable the Contribute menu (6th group) (T380930)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:09:28] !log kartik@deploy1003 kartik: Continuing with sync [07:10:05] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1159550 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [07:10:49] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [07:11:23] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [07:11:43] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1159550 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [07:11:48] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2003.codfw.wmnet to plain [07:11:59] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [07:12:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10921438 (10ops-monitoring-bot) VM aux-k8s-etcd2003.codfw.wmnet switching disk type to plain [07:12:33] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2003.codfw.wmnet to plain [07:18:19] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152558|Enable the Contribute menu (6th group) (T380930)]] (duration: 14m 49s) [07:18:23] T380930: Enable the Contribute menu in 6th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T380930 [07:18:32] (03PS1) 10Muehlenhoff: Remove ganeti2023/ganeti2024 as Ganeti servers [puppet] - 10https://gerrit.wikimedia.org/r/1159937 (https://phabricator.wikimedia.org/T396590) [07:18:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P78134 and previous config saved to /var/cache/conftool/dbconfig/20250617-071842-marostegui.json [07:20:11] urbanecm: done! [07:20:23] Tchanders: wanna deploy yourself, or should i? [07:20:38] urbanecm: I'm happy to do it! [07:20:48] I'll do the temp accounts one, but leave the other one [07:20:52] sounds good! :) [07:20:56] (03PS1) 10Stevemunene: spark history:pull kerberos host values from global values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159940 (https://phabricator.wikimedia.org/T395412) [07:21:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78135 and previous config saved to /var/cache/conftool/dbconfig/20250617-072127-root.json [07:21:38] (03CR) 10Brouberol: [C:03+1] "spot on" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159940 (https://phabricator.wikimedia.org/T395412) (owner: 10Stevemunene) [07:21:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155683 (https://phabricator.wikimedia.org/T396464) (owner: 10Tchanders) [07:22:31] (03Merged) 10jenkins-bot: temp accounts: Enable temp account creation on three wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155683 (https://phabricator.wikimedia.org/T396464) (owner: 10Tchanders) [07:22:48] (03CR) 10Jcrespo: [C:03+1] mariadb backups: Change m1 host [puppet] - 10https://gerrit.wikimedia.org/r/1159731 (https://phabricator.wikimedia.org/T396706) (owner: 10Marostegui) [07:22:54] !log tchanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1155683|temp accounts: Enable temp account creation on three wikis (T396464)]] [07:22:59] T396464: Temp Accounts: 17 June, 2025 deployment - https://phabricator.wikimedia.org/T396464 [07:25:21] !log tchanders@deploy1003 tchanders: Backport for [[gerrit:1155683|temp accounts: Enable temp account creation on three wikis (T396464)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:25:28] I'll start testing [07:26:06] (03CR) 10Muehlenhoff: [C:03+2] "There is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1155720 for the removal of the role" [puppet] - 10https://gerrit.wikimedia.org/r/1159293 (https://phabricator.wikimedia.org/T395557) (owner: 10Muehlenhoff) [07:26:51] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host gitlab-runner2003.codfw.wmnet with OS bookworm [07:28:11] (03CR) 10Muehlenhoff: [C:03+2] Add doh7004/durum7004 [puppet] - 10https://gerrit.wikimedia.org/r/1159916 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [07:28:18] Tchanders: lgtm [07:33:11] lgtm too - continuing [07:33:18] !log tchanders@deploy1003 tchanders: Continuing with sync [07:33:33] (03PS1) 10Muehlenhoff: Remove ncredir7002 [puppet] - 10https://gerrit.wikimedia.org/r/1159983 (https://phabricator.wikimedia.org/T394263) [07:33:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P78136 and previous config saved to /var/cache/conftool/dbconfig/20250617-073350-marostegui.json [07:34:02] !log installing python3.11 security updates [07:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:28] (03PS1) 10Tchanders: Assign global IP viewer to stewards, to avoid log spam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159987 (https://phabricator.wikimedia.org/T376315) [07:38:34] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: activate store memcached across the board [puppet] - 10https://gerrit.wikimedia.org/r/1156343 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [07:40:27] !log tchanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155683|temp accounts: Enable temp account creation on three wikis (T396464)]] (duration: 17m 32s) [07:40:31] T396464: Temp Accounts: 17 June, 2025 deployment - https://phabricator.wikimedia.org/T396464 [07:44:52] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2003.codfw.wmnet with reason: host reimage [07:46:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:48:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2003.codfw.wmnet with reason: host reimage [07:48:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T396130)', diff saved to https://phabricator.wikimedia.org/P78137 and previous config saved to /var/cache/conftool/dbconfig/20250617-074857-marostegui.json [07:49:02] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:49:13] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2153.codfw.wmnet with reason: Maintenance [07:49:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T396130)', diff saved to https://phabricator.wikimedia.org/P78138 and previous config saved to /var/cache/conftool/dbconfig/20250617-074920-marostegui.json [07:49:25] Tchanders: do you have anything else to deploy, or can i go now? [07:50:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:51:20] (03CR) 10Majavah: [C:03+1] Rename cloud-in to cloud-vrf-in (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1159415 (owner: 10Cathal Mooney) [07:51:29] prometheus data refuses to connect for me [07:51:54] ^see federico3 [07:51:55] (03CR) 10Urbanecm: [C:03+2] feat(LevelingUp): Measure the delay between actual and intended notification timestamp [extensions/GrowthExperiments] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159927 (https://phabricator.wikimedia.org/T395260) (owner: 10Urbanecm) [07:52:04] (not asking to do anything, just FYI) [07:52:56] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [07:53:04] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:53:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.539s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:53:39] jynus federico3 checking too, I merged a thanos patch [07:53:51] (03Merged) 10jenkins-bot: feat(LevelingUp): Measure the delay between actual and intended notification timestamp [extensions/GrowthExperiments] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159927 (https://phabricator.wikimedia.org/T395260) (owner: 10Urbanecm) [07:54:10] godog: thanks [07:54:28] (03PS1) 10Fabfur: Placeholder [puppet] - 10https://gerrit.wikimedia.org/r/1159995 [07:54:30] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1159927|feat(LevelingUp): Measure the delay between actual and intended notification timestamp (T395260)]] [07:54:31] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:54:35] T395260: Measure the difference between the intended and actual execution of delayed notification jobs - https://phabricator.wikimedia.org/T395260 [07:54:53] godog: thanks yeah seems to be a thanos thing [07:55:09] working for me again now [07:55:42] ok yes that was totally me, I apologise and we're back now [07:55:52] I accidentally too many titan hosts [07:55:52] np, thanks for the quick fix! [07:55:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:56:54] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1159927|feat(LevelingUp): Measure the delay between actual and intended notification timestamp (T395260)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:56:59] is the spike of NELs cominb back too (aka, they were from grafana?) [07:57:16] it seems so [07:57:52] !log urbanecm@deploy1003 urbanecm: Continuing with sync [07:58:04] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:58:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.539s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:58:28] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:00:08] (03CR) 10Filippo Giunchedi: [C:03+2] bird: remove check_anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/1159399 (https://phabricator.wikimedia.org/T374842) (owner: 10Filippo Giunchedi) [08:04:15] !log installing mariadb security updates (as shipped in Debian, not the wmf-mariadb packages we use for the main mariadb clusters) [08:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:53] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159927|feat(LevelingUp): Measure the delay between actual and intended notification timestamp (T395260)]] (duration: 10m 22s) [08:04:57] T395260: Measure the difference between the intended and actual execution of delayed notification jobs - https://phabricator.wikimedia.org/T395260 [08:05:18] (03CR) 10Ayounsi: [C:03+1] Remove ncredir7002 [puppet] - 10https://gerrit.wikimedia.org/r/1159983 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [08:07:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2003.codfw.wmnet with OS bookworm [08:09:13] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host gitlab-runner2004.codfw.wmnet with OS bookworm [08:09:51] (03Abandoned) 10Tiziano Fogli: monitoring services: add migration task T374842 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155142 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [08:13:28] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:38] (03CR) 10Cathal Mooney: Rename cloud-in to cloud-vrf-in (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1159415 (owner: 10Cathal Mooney) [08:14:43] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: move blackbox-exporter to log prober errors [puppet] - 10https://gerrit.wikimedia.org/r/1143810 (https://phabricator.wikimedia.org/T385022) (owner: 10Filippo Giunchedi) [08:14:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T396130)', diff saved to https://phabricator.wikimedia.org/P78139 and previous config saved to /var/cache/conftool/dbconfig/20250617-081443-marostegui.json [08:14:48] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:21:39] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye [08:21:52] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10921635 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2007.codfw.wmnet with OS bul... [08:27:30] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2004.codfw.wmnet with reason: host reimage [08:29:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P78140 and previous config saved to /var/cache/conftool/dbconfig/20250617-082951-marostegui.json [08:31:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-eqiad and NTT (192.80.17.185) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:33:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2004.codfw.wmnet with reason: host reimage [08:34:40] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10921678 (10MoritzMuehlenhoff) [08:35:17] 10SRE-swift-storage, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX), 13Patch-For-Review: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10921689 (10klausman) >>! In T335491#10921280, @KartikMistry wrote: > @klausman is there any reason why we c... [08:37:43] (03CR) 10Muehlenhoff: [C:03+2] Remove ncredir7002 [puppet] - 10https://gerrit.wikimedia.org/r/1159983 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [08:40:32] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts ncredir7002.magru.wmnet [08:41:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10921733 (10MoritzMuehlenhoff) [08:42:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10921737 (10MoritzMuehlenhoff) [08:42:41] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage [08:44:51] (03CR) 10Stevemunene: [C:03+2] spark history:pull kerberos host values from global values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159940 (https://phabricator.wikimedia.org/T395412) (owner: 10Stevemunene) [08:44:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P78141 and previous config saved to /var/cache/conftool/dbconfig/20250617-084458-marostegui.json [08:45:17] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [08:45:50] 10SRE-swift-storage, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX), 13Patch-For-Review: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10921755 (10klausman) Found it: the secrets were not wired up for staging because I had a brain fart when se... [08:46:11] (03PS1) 10Klausman: hiera: Add pseudosecrets for MT Thanos-Swift access also for staging [labs/private] - 10https://gerrit.wikimedia.org/r/1160032 [08:46:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage [08:46:17] (03CR) 10Klausman: [V:03+2 C:03+2] hiera: Add pseudosecrets for MT Thanos-Swift access also for staging [labs/private] - 10https://gerrit.wikimedia.org/r/1160032 (owner: 10Klausman) [08:46:32] (03Merged) 10jenkins-bot: spark history:pull kerberos host values from global values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159940 (https://phabricator.wikimedia.org/T395412) (owner: 10Stevemunene) [08:51:23] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7002.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [08:52:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2004.codfw.wmnet with OS bookworm [08:53:12] (03CR) 10Vgutierrez: [C:03+2] hiera: Issue a separate LE cert for upload cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159510 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [08:53:39] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7002.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [08:53:39] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:53:40] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7002.magru.wmnet [08:53:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10921797 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: `ncredir7002.magru.wmnet` - ncredir7002.magru.wmnet (**PASS**... [09:00:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T396130)', diff saved to https://phabricator.wikimedia.org/P78142 and previous config saved to /var/cache/conftool/dbconfig/20250617-090006-marostegui.json [09:00:11] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:00:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [09:00:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T396130)', diff saved to https://phabricator.wikimedia.org/P78143 and previous config saved to /var/cache/conftool/dbconfig/20250617-090028-marostegui.json [09:05:26] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [09:05:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:05:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T395241)', diff saved to https://phabricator.wikimedia.org/P78144 and previous config saved to /var/cache/conftool/dbconfig/20250617-090551-fceratto.json [09:06:17] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [09:06:25] (03CR) 10Elukey: "Left a couple of nits, lemme know it they make sense or not :)" [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [09:06:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-eqiad and NTT (192.80.17.185) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:06:53] (03CR) 10Kosta Harlan: [C:03+1] temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155684 (https://phabricator.wikimedia.org/T396465) (owner: 10Tchanders) [09:07:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [09:07:26] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2007.codfw.wmnet with OS bullseye [09:07:41] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10921878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2007.codfw.wmnet with OS bullsey... [09:13:48] FIRING: [2x] PuppetFailure: Puppet has failed on parsoidtest1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:14:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T395241)', diff saved to https://phabricator.wikimedia.org/P78145 and previous config saved to /var/cache/conftool/dbconfig/20250617-091404-fceratto.json [09:24:54] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: enable tracing for rule [puppet] - 10https://gerrit.wikimedia.org/r/1159355 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [09:25:02] (03PS2) 10Filippo Giunchedi: thanos: enable tracing for rule [puppet] - 10https://gerrit.wikimedia.org/r/1159355 (https://phabricator.wikimedia.org/T394318) [09:25:08] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] thanos: enable tracing for rule [puppet] - 10https://gerrit.wikimedia.org/r/1159355 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [09:25:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T396130)', diff saved to https://phabricator.wikimedia.org/P78146 and previous config saved to /var/cache/conftool/dbconfig/20250617-092538-marostegui.json [09:25:43] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:27:40] (03PS1) 10Vgutierrez: liberica: Spawn daemons using eBPF as root [puppet] - 10https://gerrit.wikimedia.org/r/1160038 (https://phabricator.wikimedia.org/T397053) [09:28:28] FIRING: SystemdUnitFailed: wmf_auto_restart_ipmiseld.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:29:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P78147 and previous config saved to /var/cache/conftool/dbconfig/20250617-092912-fceratto.json [09:29:55] (03PS1) 10Brouberol: airflow-analytics-test: enable scheduler -> KDC egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160040 (https://phabricator.wikimedia.org/T369845) [09:30:41] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Upgrading clouddbs T394372 [09:30:45] T394372: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372 [09:32:29] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160038 (https://phabricator.wikimedia.org/T397053) (owner: 10Vgutierrez) [09:34:06] (03CR) 10FNegri: [C:03+2] clouddb1017: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154805 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [09:34:20] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:34:31] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:37:46] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: enforce series limit for sidecar [puppet] - 10https://gerrit.wikimedia.org/r/1155190 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [09:37:49] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10922024 (10Jgiannelos) I double checked: * eqiad uses ` bucket = "tegola-swift-eqiad-v002" ` * codfw uses ` bucket = "te... [09:40:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P78148 and previous config saved to /var/cache/conftool/dbconfig/20250617-094045-marostegui.json [09:44:11] (03PS1) 10Urbanecm: fix: Gauge metrics use `::set` not `::observe` [extensions/GrowthExperiments] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1160043 (https://phabricator.wikimedia.org/T397135) [09:44:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P78149 and previous config saved to /var/cache/conftool/dbconfig/20250617-094419-fceratto.json [09:44:20] (03PS1) 10Urbanecm: fix: Gauge metrics use `::set` not `::observe` [extensions/GrowthExperiments] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160044 (https://phabricator.wikimedia.org/T397135) [09:44:32] jouncebot: nowandnext [09:44:32] No deployments scheduled for the next 0 hour(s) and 15 minute(s) [09:44:32] In 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1000) [09:44:36] (03CR) 10Aqu: [C:03+1] "TY!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160040 (https://phabricator.wikimedia.org/T369845) (owner: 10Brouberol) [09:44:39] (03CR) 10Urbanecm: [C:03+2] fix: Gauge metrics use `::set` not `::observe` [extensions/GrowthExperiments] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160044 (https://phabricator.wikimedia.org/T397135) (owner: 10Urbanecm) [09:44:42] (03CR) 10Urbanecm: [C:03+2] fix: Gauge metrics use `::set` not `::observe` [extensions/GrowthExperiments] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1160043 (https://phabricator.wikimedia.org/T397135) (owner: 10Urbanecm) [09:44:56] (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: enable scheduler -> KDC egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160040 (https://phabricator.wikimedia.org/T369845) (owner: 10Brouberol) [09:45:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1160043 (https://phabricator.wikimedia.org/T397135) (owner: 10Urbanecm) [09:45:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160044 (https://phabricator.wikimedia.org/T397135) (owner: 10Urbanecm) [09:46:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:46:47] (03PS1) 10Hnowlan: changeprop: increase concurrency on pcs_rerender_mobile_html_native [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160045 (https://phabricator.wikimedia.org/T397072) [09:50:01] 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade firmware (NIC and system) on ganeti1047 - https://phabricator.wikimedia.org/T396660#10922087 (10elukey) Firmware versions: ganeti1047 ` # BIOS 'Oem': {'Supermicro': {'@odata.type': '#SmcSoftwareInventoryExtensions.v1_0_0.SoftwareInventory', 'Un... [09:51:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:52:03] (03PS1) 10Filippo Giunchedi: team-o11y: alert on sidecar dropping queries [alerts] - 10https://gerrit.wikimedia.org/r/1160046 (https://phabricator.wikimedia.org/T394318) [09:53:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2232].codfw.wmnet,db[1207,1217,1250].eqiad.wmnet with reason: Primary switchover m1 T396706 [09:53:11] T396706: Switch m1 master db1250 -> db1207 - https://phabricator.wikimedia.org/T396706 [09:53:28] (03Merged) 10jenkins-bot: fix: Gauge metrics use `::set` not `::observe` [extensions/GrowthExperiments] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160044 (https://phabricator.wikimedia.org/T397135) (owner: 10Urbanecm) [09:53:32] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1207 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/1159727 (https://phabricator.wikimedia.org/T396706) (owner: 10Marostegui) [09:53:54] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet [09:55:18] (03PS1) 10Slyngshede: Makefile: force AMD64 as platform [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1160047 [09:55:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P78150 and previous config saved to /var/cache/conftool/dbconfig/20250617-095553-marostegui.json [09:56:04] (03CR) 10Filippo Giunchedi: [C:03+2] team-o11y: alert on sidecar dropping queries [alerts] - 10https://gerrit.wikimedia.org/r/1160046 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [09:57:22] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10922107 (10jijiki) @Jgiannelos thanks! [09:59:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T395241)', diff saved to https://phabricator.wikimedia.org/P78151 and previous config saved to /var/cache/conftool/dbconfig/20250617-095927-fceratto.json [09:59:30] (03Merged) 10jenkins-bot: fix: Gauge metrics use `::set` not `::observe` [extensions/GrowthExperiments] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1160043 (https://phabricator.wikimedia.org/T397135) (owner: 10Urbanecm) [10:00:00] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1160043|fix: Gauge metrics use `::set` not `::observe` (T397135)]], [[gerrit:1160044|fix: Gauge metrics use `::set` not `::observe` (T397135)]] [10:00:04] T397135: Error: Call to undefined method Wikimedia\Stats\Metrics\GaugeMetric::observe() - https://phabricator.wikimedia.org/T397135 [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1000) [10:02:29] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1160043|fix: Gauge metrics use `::set` not `::observe` (T397135)]], [[gerrit:1160044|fix: Gauge metrics use `::set` not `::observe` (T397135)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:03:06] !log Failover m1 from db1250 to db1207 - T396706 [10:03:10] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1160038 (https://phabricator.wikimedia.org/T397053) (owner: 10Vgutierrez) [10:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:10] T396706: Switch m1 master db1250 -> db1207 - https://phabricator.wikimedia.org/T396706 [10:04:15] Apologies, got interrupted by a real-life emergency. Will test and confirm soon. [10:04:28] (03CR) 10Marostegui: [C:03+2] mariadb backups: Change m1 host [puppet] - 10https://gerrit.wikimedia.org/r/1159731 (https://phabricator.wikimedia.org/T396706) (owner: 10Marostegui) [10:05:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1250.eqiad.wmnet with reason: Maintenance [10:06:59] (03CR) 10Jgiannelos: [C:03+1] changeprop: increase concurrency on pcs_rerender_mobile_html_native [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160045 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:07:08] (03PS4) 10Samwilson: InitialiseSettings: wgTemplateDataEnableDiscovery on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151831 (https://phabricator.wikimedia.org/T377975) [10:07:38] jouncebot: nowandnext [10:07:38] For the next 0 hour(s) and 52 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1000) [10:07:38] In 1 hour(s) and 52 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1200) [10:07:47] !log marostegui@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1250.eqiad.wmnet [10:08:10] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10922156 (10Marostegui) [10:08:21] * TheresNoTime sees a deploy is ongoing [10:08:28] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_ipmiseld.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:08:34] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts db1250.eqiad.wmnet [10:08:49] (03CR) 10Hnowlan: [C:03+2] changeprop: increase concurrency on pcs_rerender_mobile_html_native [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160045 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:10:12] (03CR) 10Btullis: [C:03+2] Add our legacy archiva instance to kubernetes external_services [puppet] - 10https://gerrit.wikimedia.org/r/1159563 (https://phabricator.wikimedia.org/T392244) (owner: 10Btullis) [10:10:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10922157 (10Marostegui) I've tried on db1250: ` [10:06:34] marostegui@cumin1002:~$ sudo cookbook sre.hardware.upgrade-firmware -c ssd "db1250*" Acquired lock for key /s... [10:10:26] (03PS1) 10Jgiannelos: Revert "RB sunset: debug spike in changeprop events" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160048 [10:10:33] (03Merged) 10jenkins-bot: changeprop: increase concurrency on pcs_rerender_mobile_html_native [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160045 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:10:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10922161 (10Marostegui) @RobH you can proceed with the above host (db1250) at your own convenience. [10:10:58] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Upgrading clouddbs T394372 [10:10:59] (03CR) 10Vgutierrez: [C:03+2] liberica: Spawn daemons using eBPF as root [puppet] - 10https://gerrit.wikimedia.org/r/1160038 (https://phabricator.wikimedia.org/T397053) (owner: 10Vgutierrez) [10:11:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T396130)', diff saved to https://phabricator.wikimedia.org/P78152 and previous config saved to /var/cache/conftool/dbconfig/20250617-101100-marostegui.json [10:11:02] T394372: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372 [10:11:06] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:11:12] (03CR) 10Filippo Giunchedi: "Can be abandoned now I think" [puppet] - 10https://gerrit.wikimedia.org/r/1105389 (https://phabricator.wikimedia.org/T368953) (owner: 10Herron) [10:11:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance [10:11:17] (03CR) 10Filippo Giunchedi: "Can be abandoned now I think" [puppet] - 10https://gerrit.wikimedia.org/r/1105395 (https://phabricator.wikimedia.org/T368953) (owner: 10Herron) [10:11:17] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet [10:11:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T396130)', diff saved to https://phabricator.wikimedia.org/P78153 and previous config saved to /var/cache/conftool/dbconfig/20250617-101123-marostegui.json [10:11:26] (03PS1) 10Marostegui: db1250: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1160050 (https://phabricator.wikimedia.org/T396648) [10:11:29] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10922168 (10Mvolz) >>! In T391852#10919212, @elukey wrote: > I am reopening this task since I assumed something about https://wikitech.wikimedia.o... [10:11:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10922173 (10Ladsgroup) That confused me too, you need to run the cookbook only from cumin200x, the file doesn't exist in cumin1002 [10:12:02] 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade firmware (NIC and system) on ganeti1047 - https://phabricator.wikimedia.org/T396660#10922175 (10elukey) While reviewing `/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor` I noticed that we have a mixture of `schedutil` and `powersave`, it may be worth to test if gan... [10:13:11] !log urbanecm@deploy1003 urbanecm: Continuing with sync [10:13:30] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [10:13:39] (03CR) 10FNegri: [C:03+2] clouddb1014: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154806 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [10:13:39] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:14:17] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [10:14:47] (03CR) 10Ayounsi: [C:03+1] Makefile: force AMD64 as platform [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1160047 (owner: 10Slyngshede) [10:14:48] !log marostegui@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1250.eqiad.wmnet [10:14:54] !log marostegui@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts db1250.eqiad.wmnet [10:15:04] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [10:15:37] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [10:15:47] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [10:15:58] !log marostegui@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1250.eqiad.wmnet [10:16:03] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [10:16:38] !log marostegui@cumin2002 START - Cookbook sre.hosts.reboot-single for host db1250.eqiad.wmnet [10:17:00] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [10:17:37] (03PS1) 10Hnowlan: changeprop: increase memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160052 (https://phabricator.wikimedia.org/T397072) [10:17:50] (03CR) 10Ladsgroup: [C:03+1] Undeploy VipsScaler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159599 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra) [10:18:23] (03PS2) 10Fabfur: cache,haproxy: remove old ipblock map files [puppet] - 10https://gerrit.wikimedia.org/r/1159461 (https://phabricator.wikimedia.org/T396621) [10:19:16] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting A:liberica-canary (T397053) [10:19:21] T397053: seamless upgrade triggers dropped packets with katran - https://phabricator.wikimedia.org/T397053 [10:19:31] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling A:liberica-canary [10:19:39] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling A:liberica-canary [10:19:42] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin pooling A:liberica-canary [10:19:49] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling A:liberica-canary [10:19:51] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting A:liberica-canary (T397053) [10:20:12] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160043|fix: Gauge metrics use `::set` not `::observe` (T397135)]], [[gerrit:1160044|fix: Gauge metrics use `::set` not `::observe` (T397135)]] (duration: 20m 12s) [10:20:18] T397135: Error: Call to undefined method Wikimedia\Stats\Metrics\GaugeMetric::observe() - https://phabricator.wikimedia.org/T397135 [10:21:30] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on my maps - https://phabricator.wikimedia.org/T397151 (10Fpisot) 03NEW [10:23:27] (03CR) 10Clément Goubert: [C:03+1] changeprop: increase memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160052 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:24:17] (03PS2) 10Hnowlan: changeprop: increase memory, cpu limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160052 (https://phabricator.wikimedia.org/T397072) [10:24:23] (03PS1) 10Filippo Giunchedi: team-o11y: fix ThanosSidecarDropQueries threshold [alerts] - 10https://gerrit.wikimedia.org/r/1160059 [10:24:38] (03CR) 10Clément Goubert: [C:03+1] changeprop: increase memory, cpu limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160052 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:26:06] (03CR) 10Filippo Giunchedi: [C:03+2] team-o11y: fix ThanosSidecarDropQueries threshold [alerts] - 10https://gerrit.wikimedia.org/r/1160059 (owner: 10Filippo Giunchedi) [10:26:40] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host db1250.eqiad.wmnet [10:26:43] !log marostegui@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts db1250.eqiad.wmnet [10:27:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10922233 (10Marostegui) 05Open→03Resolved Thank you Amir, that worked - db1250 done {P78154} [10:27:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10922237 (10Marostegui) [10:27:46] (03CR) 10Marostegui: [C:03+2] db1250: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1160050 (https://phabricator.wikimedia.org/T396648) (owner: 10Marostegui) [10:28:01] (03CR) 10Clément Goubert: "Something's weird, CI isn't getting a diff for prod, and the memory limit doesn't seem overwrote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160052 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:28:28] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet [10:29:55] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet [10:30:22] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on my maps - https://phabricator.wikimedia.org/T397151#10922246 (10Bugreporter) 05Open→03Invalid This is not a valid usecase. See https://switch2osm.org/providers/ instead, or if you just want to use in your personal project, create a reverse proxy... [10:30:23] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Upgrading clouddbs T394372 [10:30:28] T394372: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372 [10:31:33] 06SRE, 06Infrastructure-Foundations, 10netops: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153 (10cmooney) 03NEW p:05Triage→03Low [10:32:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[1217,1250].eqiad.wmnet with reason: Maintenance [10:32:52] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1217,1250].eqiad.wmnet with reason: Maintenance [10:33:28] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:34:19] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:34:35] (03Abandoned) 10Hnowlan: changeprop: increase memory, cpu limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160052 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:34:51] (03CR) 10FNegri: [C:03+2] clouddb1018: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154807 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [10:35:40] (03PS1) 10Filippo Giunchedi: thanos: bump sidecar series limit [puppet] - 10https://gerrit.wikimedia.org/r/1160070 (https://phabricator.wikimedia.org/T394318) [10:35:45] PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:36:03] PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:36:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T396130)', diff saved to https://phabricator.wikimedia.org/P78156 and previous config saved to /var/cache/conftool/dbconfig/20250617-103638-marostegui.json [10:36:43] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:37:19] (03PS1) 10Cathal Mooney: Mark HTTP(S) traffic from dumps with low-priority QoS mark [puppet] - 10https://gerrit.wikimedia.org/r/1160071 (https://phabricator.wikimedia.org/T397153) [10:37:24] haproxy alerts expected [10:39:20] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: bump sidecar series limit [puppet] - 10https://gerrit.wikimedia.org/r/1160070 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [10:40:21] (03CR) 10Tiziano Fogli: [C:03+1] thanos: bump sidecar series limit [puppet] - 10https://gerrit.wikimedia.org/r/1160070 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [10:40:32] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160071 (https://phabricator.wikimedia.org/T397153) (owner: 10Cathal Mooney) [10:42:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:42:52] (03PS2) 10Cathal Mooney: Mark HTTP(S) traffic from dumps with low-priority QoS mark [puppet] - 10https://gerrit.wikimedia.org/r/1160071 (https://phabricator.wikimedia.org/T397153) [10:43:20] (03CR) 10Hnowlan: [C:03+2] trafficserver: migrate html<->wikitext transforms out of restbase [puppet] - 10https://gerrit.wikimedia.org/r/1156813 (https://phabricator.wikimedia.org/T396856) (owner: 10Hnowlan) [10:43:29] jouncebot: nowandnext [10:43:29] For the next 0 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1000) [10:43:29] In 1 hour(s) and 16 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1200) [10:45:11] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160071 (https://phabricator.wikimedia.org/T397153) (owner: 10Cathal Mooney) [10:47:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:49:20] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on my maps - https://phabricator.wikimedia.org/T397151#10922331 (10Aklapper) 05Invalid→03Declined Per https://wikitech.wikimedia.org/wiki/Maps/External_usage, > maps.wikimedia.org tiles may only be used by Wikimedia wikis, and sites hosted by Wikim... [10:50:03] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1167.eqiad.wmnet with reason: Maintenance [10:50:22] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:50:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T382778)', diff saved to https://phabricator.wikimedia.org/P78157 and previous config saved to /var/cache/conftool/dbconfig/20250617-105028-ladsgroup.json [10:50:33] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [10:51:20] !log migrate transform APIs for wikitext<->html out of restbase [10:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P78158 and previous config saved to /var/cache/conftool/dbconfig/20250617-105145-marostegui.json [10:53:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T382778)', diff saved to https://phabricator.wikimedia.org/P78159 and previous config saved to /var/cache/conftool/dbconfig/20250617-105353-ladsgroup.json [10:54:17] (03PS1) 10Elukey: profile::pyrra: improve the istio SLOs template [puppet] - 10https://gerrit.wikimedia.org/r/1160075 (https://phabricator.wikimedia.org/T391852) [10:54:18] (03PS1) 10Elukey: profile::pyrra: fix Citoid's SLO targets [puppet] - 10https://gerrit.wikimedia.org/r/1160076 (https://phabricator.wikimedia.org/T391852) [10:54:55] (03CR) 10Ayounsi: [C:03+1] Mark HTTP(S) traffic from dumps with low-priority QoS mark [puppet] - 10https://gerrit.wikimedia.org/r/1160071 (https://phabricator.wikimedia.org/T397153) (owner: 10Cathal Mooney) [10:56:55] (03PS11) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) [10:57:14] (03PS1) 10Filippo Giunchedi: thanos: temp disable sidecar series limit [puppet] - 10https://gerrit.wikimedia.org/r/1160078 (https://phabricator.wikimedia.org/T394318) [10:58:09] (03PS2) 10Elukey: profile::pyrra: improve the istio SLOs template [puppet] - 10https://gerrit.wikimedia.org/r/1160075 (https://phabricator.wikimedia.org/T391852) [10:58:09] (03PS2) 10Elukey: profile::pyrra: fix Citoid's SLO targets [puppet] - 10https://gerrit.wikimedia.org/r/1160076 (https://phabricator.wikimedia.org/T391852) [10:58:58] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] thanos: temp disable sidecar series limit [puppet] - 10https://gerrit.wikimedia.org/r/1160078 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [10:59:04] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5981/co" [puppet] - 10https://gerrit.wikimedia.org/r/1160075 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [10:59:31] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:36] !log upload liberica 0.20 to apt.wm.o (bookworm-wikimedia) - T397053 [10:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:41] T397053: seamless upgrade triggers dropped packets with katran - https://phabricator.wikimedia.org/T397053 [11:00:09] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet [11:00:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [11:00:26] (03PS3) 10Elukey: profile::pyrra: improve the istio SLOs template [puppet] - 10https://gerrit.wikimedia.org/r/1160075 (https://phabricator.wikimedia.org/T391852) [11:00:27] (03PS3) 10Elukey: profile::pyrra: fix Citoid's SLO targets [puppet] - 10https://gerrit.wikimedia.org/r/1160076 (https://phabricator.wikimedia.org/T391852) [11:00:30] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing A:liberica-canary (T397053) [11:00:33] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling A:liberica-canary [11:00:41] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling A:liberica-canary [11:00:51] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin pooling A:liberica-canary [11:00:58] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling A:liberica-canary [11:01:00] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing A:liberica-canary (T397053) [11:01:50] (03PS12) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) [11:02:17] (03CR) 10Muehlenhoff: [C:03+2] Don't auto-restart atftpd on Bookworm and later [puppet] - 10https://gerrit.wikimedia.org/r/1156648 (owner: 10Muehlenhoff) [11:02:26] (03CR) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [11:03:46] (03PS1) 10Majavah: hieradata: Specify IPv4 addresses of Cloud VPS DNS recursors directly [puppet] - 10https://gerrit.wikimedia.org/r/1160081 (https://phabricator.wikimedia.org/T396448) [11:03:48] (03PS1) 10Marostegui: mariadb: Move db1250 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/1160082 (https://phabricator.wikimedia.org/T397152) [11:04:22] (03PS2) 10Effie Mouzeli: kubernetes: create mediawiki_experimental profile [puppet] - 10https://gerrit.wikimedia.org/r/1156392 (https://phabricator.wikimedia.org/T396767) [11:04:29] (03CR) 10Effie Mouzeli: kubernetes: create mediawiki_experimental profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1156392 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [11:04:52] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5983/console" [puppet] - 10https://gerrit.wikimedia.org/r/1160081 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [11:04:53] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [11:05:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.06%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [11:06:07] jouncebot: nowandnext [11:06:07] No deployments scheduled for the next 0 hour(s) and 53 minute(s) [11:06:07] In 0 hour(s) and 53 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1200) [11:06:47] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1250 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/1160082 (https://phabricator.wikimedia.org/T397152) (owner: 10Marostegui) [11:06:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P78160 and previous config saved to /var/cache/conftool/dbconfig/20250617-110652-marostegui.json [11:07:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151831 (https://phabricator.wikimedia.org/T377975) (owner: 10Samwilson) [11:08:18] (03Merged) 10jenkins-bot: InitialiseSettings: wgTemplateDataEnableDiscovery on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151831 (https://phabricator.wikimedia.org/T377975) (owner: 10Samwilson) [11:08:23] (03CR) 10Btullis: "Have you tested a restart of the hadoop-hdfs-zkfc.service service on an-master100[3-4] since increasing the cluster size?" [puppet] - 10https://gerrit.wikimedia.org/r/1135028 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [11:08:26] (03PS1) 10Majavah: hieradata: Fix codfw1dev IPv6 DNS recursor address [puppet] - 10https://gerrit.wikimedia.org/r/1160083 (https://phabricator.wikimedia.org/T396448) [11:08:27] (03PS1) 10Majavah: hieradata: Add eqiad1 DNS IPv6 VIPs [puppet] - 10https://gerrit.wikimedia.org/r/1160084 (https://phabricator.wikimedia.org/T396448) [11:08:28] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:37] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: add wikikube-worker-exp(1001|2001) [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [11:08:41] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1151831|InitialiseSettings: wgTemplateDataEnableDiscovery on more wikis (T377975)]] [11:08:45] T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975 [11:09:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P78161 and previous config saved to /var/cache/conftool/dbconfig/20250617-110900-ladsgroup.json [11:09:31] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:10:54] !log samtar@deploy1003 samwilson, samtar: Backport for [[gerrit:1151831|InitialiseSettings: wgTemplateDataEnableDiscovery on more wikis (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:11:06] (03CR) 10Btullis: "This is not quite ready yet, as the helm-lint still shows:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159926 (https://phabricator.wikimedia.org/T395412) (owner: 10Stevemunene) [11:12:02] (03CR) 10Btullis: [C:03+1] replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135049 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [11:13:14] !log samtar@deploy1003 samwilson, samtar: Continuing with sync [11:15:33] (03CR) 10Effie Mouzeli: [C:03+2] kubernetes: create mediawiki_experimental profile [puppet] - 10https://gerrit.wikimedia.org/r/1156392 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [11:16:36] (03PS13) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) [11:17:06] (03CR) 10Stevemunene: "That's because the lint cannot access the values from puppet." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159926 (https://phabricator.wikimedia.org/T395412) (owner: 10Stevemunene) [11:17:38] (03PS1) 10Marostegui: db1217: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160088 (https://phabricator.wikimedia.org/T394371) [11:19:26] (03CR) 10Btullis: [C:03+1] "Oh, OK." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159926 (https://phabricator.wikimedia.org/T395412) (owner: 10Stevemunene) [11:20:17] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151831|InitialiseSettings: wgTemplateDataEnableDiscovery on more wikis (T377975)]] (duration: 11m 36s) [11:20:22] T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975 [11:20:45] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:20:55] PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:20:57] (03CR) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [11:20:58] (03PS1) 10Hnowlan: changeprop, mobileapps: bump pcs job concurrency, mobileapps replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160089 (https://phabricator.wikimedia.org/T397072) [11:21:07] All haproxy alerts are expected [11:21:23] (03PS2) 10Hnowlan: changeprop, mobileapps: bump pcs job concurrency, mobileapps replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160089 (https://phabricator.wikimedia.org/T397072) [11:21:39] (03CR) 10Effie Mouzeli: [C:03+2] "Chatted with Scott about this yesterday, we are good to go" [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [11:21:43] PROBLEM - haproxy failover on dbproxy1028 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:22:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T396130)', diff saved to https://phabricator.wikimedia.org/P78162 and previous config saved to /var/cache/conftool/dbconfig/20250617-112200-marostegui.json [11:22:01] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:22:05] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:22:15] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance [11:22:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T396130)', diff saved to https://phabricator.wikimedia.org/P78163 and previous config saved to /var/cache/conftool/dbconfig/20250617-112222-marostegui.json [11:22:30] (03CR) 10Marostegui: [C:03+2] db1217: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160088 (https://phabricator.wikimedia.org/T394371) (owner: 10Marostegui) [11:22:45] PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:23:13] PROBLEM - haproxy failover on dbproxy1029 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:23:56] (03PS2) 10Jgiannelos: Revert "RB sunset: debug spike in changeprop events" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160048 [11:24:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P78164 and previous config saved to /var/cache/conftool/dbconfig/20250617-112408-ladsgroup.json [11:25:45] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:25:55] RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:25:56] (03PS3) 10Hnowlan: changeprop, mobileapps: bump pcs job concurrency, mobileapps replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160089 (https://phabricator.wikimedia.org/T397072) [11:26:43] RECOVERY - haproxy failover on dbproxy1028 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:27:01] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:27:13] RECOVERY - haproxy failover on dbproxy1029 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:27:45] RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:27:58] (03CR) 10Jgiannelos: [C:03+1] changeprop, mobileapps: bump pcs job concurrency, mobileapps replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160089 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [11:28:38] (03CR) 10Hnowlan: [C:03+1] Revert "RB sunset: debug spike in changeprop events" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160048 (owner: 10Jgiannelos) [11:31:20] (03CR) 10Hnowlan: [C:03+2] Revert "RB sunset: debug spike in changeprop events" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160048 (owner: 10Jgiannelos) [11:33:23] (03Merged) 10jenkins-bot: Revert "RB sunset: debug spike in changeprop events" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160048 (owner: 10Jgiannelos) [11:33:55] (03CR) 10Hnowlan: [C:03+2] changeprop, mobileapps: bump pcs job concurrency, mobileapps replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160089 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [11:34:02] (03CR) 10CI reject: [V:04-1] changeprop, mobileapps: bump pcs job concurrency, mobileapps replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160089 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [11:34:31] FIRING: [2x] SystemdUnitFailed: prometheus-puppet-agent-stats.service on cuminunpriv1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:34:44] (03PS4) 10Hnowlan: changeprop, mobileapps: bump pcs job concurrency, mobileapps replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160089 (https://phabricator.wikimedia.org/T397072) [11:38:00] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051#10922553 (10jijiki) [11:38:06] (03CR) 10Hnowlan: [C:03+2] changeprop, mobileapps: bump pcs job concurrency, mobileapps replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160089 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [11:38:21] !log jiji@cumin1002 START - Cookbook sre.ganeti.makevm for new host wikikube-worker-exp1001.eqiad.wmnet [11:38:23] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [11:38:30] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1160096 (https://phabricator.wikimedia.org/T397163) [11:39:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T382778)', diff saved to https://phabricator.wikimedia.org/P78165 and previous config saved to /var/cache/conftool/dbconfig/20250617-113915-ladsgroup.json [11:39:20] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [11:39:31] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:39:51] (03Merged) 10jenkins-bot: changeprop, mobileapps: bump pcs job concurrency, mobileapps replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160089 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [11:40:21] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1160101 (https://phabricator.wikimedia.org/T397164) [11:40:30] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1172.eqiad.wmnet with reason: Maintenance [11:40:34] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [11:40:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T382778)', diff saved to https://phabricator.wikimedia.org/P78166 and previous config saved to /var/cache/conftool/dbconfig/20250617-114037-ladsgroup.json [11:40:43] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [11:41:41] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [11:41:45] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [11:43:58] jiji@cumin1002 makevm (PID 3425476) is awaiting input [11:43:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T382778)', diff saved to https://phabricator.wikimedia.org/P78167 and previous config saved to /var/cache/conftool/dbconfig/20250617-114357-ladsgroup.json [11:45:40] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [11:45:50] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [11:46:07] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [11:46:34] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [11:46:55] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [11:47:06] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [11:47:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:47:15] 10SRE-swift-storage, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX), 13Patch-For-Review: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10922627 (10KartikMistry) >>! In T335491#10921755, @klausman wrote: > Found it: the secrets were not wired u... [11:47:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T396130)', diff saved to https://phabricator.wikimedia.org/P78168 and previous config saved to /var/cache/conftool/dbconfig/20250617-114750-marostegui.json [11:47:55] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:48:28] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:48:29] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye [11:48:39] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10922632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2006.codfw.wmnet with OS bul... [11:52:29] (03PS1) 10Majavah: natlog: Persist logs to /srv [puppet] - 10https://gerrit.wikimedia.org/r/1160104 [11:53:09] (03PS2) 10Majavah: natlog: Persist logs to /srv [puppet] - 10https://gerrit.wikimedia.org/r/1160104 (https://phabricator.wikimedia.org/T273734) [11:53:25] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5985/co" [puppet] - 10https://gerrit.wikimedia.org/r/1160104 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [11:56:02] (03CR) 10Stevemunene: zookeeper: remove an-conf100[1-3] from the cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135028 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [11:56:13] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet [11:59:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P78169 and previous config saved to /var/cache/conftool/dbconfig/20250617-115905-ladsgroup.json [11:59:58] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1200) [12:00:41] (03CR) 10Stevemunene: [C:03+2] superset:pull kerberos server values from global values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159926 (https://phabricator.wikimedia.org/T395412) (owner: 10Stevemunene) [12:02:33] (03Merged) 10jenkins-bot: superset:pull kerberos server values from global values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159926 (https://phabricator.wikimedia.org/T395412) (owner: 10Stevemunene) [12:02:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P78170 and previous config saved to /var/cache/conftool/dbconfig/20250617-120257-marostegui.json [12:03:28] FIRING: [2x] SystemdUnitFailed: prometheus-puppet-agent-stats.service on cuminunpriv1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:03] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [12:06:58] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [12:08:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage [12:09:31] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [12:09:45] jouncebot: nowandnext [12:09:45] For the next 0 hour(s) and 50 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1200) [12:09:45] In 0 hour(s) and 50 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1300) [12:10:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage [12:11:09] (03PS2) 10Samtar: IS: Enable `wgTemplateDataEnableDiscovery` for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155665 (https://phabricator.wikimedia.org/T377975) [12:12:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155665 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [12:13:05] (03PS1) 10Brouberol: airflow: derive AIRFLOW_APPOWNER from the user principal in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160115 (https://phabricator.wikimedia.org/T394297) [12:13:39] (03Merged) 10jenkins-bot: IS: Enable `wgTemplateDataEnableDiscovery` for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155665 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [12:14:04] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1155665|IS: Enable `wgTemplateDataEnableDiscovery` for mediawikiwiki (T377975)]] [12:14:09] T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975 [12:14:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P78171 and previous config saved to /var/cache/conftool/dbconfig/20250617-121412-ladsgroup.json [12:14:46] (03PS1) 10Muehlenhoff: standard_packages: Remove more obsolete package after bullseye->bookworm update [puppet] - 10https://gerrit.wikimedia.org/r/1160116 [12:15:36] (03PS1) 10Majavah: natlog: Use a separate journald namespace with no storage [puppet] - 10https://gerrit.wikimedia.org/r/1160117 (https://phabricator.wikimedia.org/T273734) [12:16:15] !log samtar@deploy1003 samtar: Backport for [[gerrit:1155665|IS: Enable `wgTemplateDataEnableDiscovery` for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:16:16] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [12:16:27] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5986/co" [puppet] - 10https://gerrit.wikimedia.org/r/1160117 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [12:17:46] !log samtar@deploy1003 samtar: Continuing with sync [12:17:55] (03PS1) 10Jelto: gitlab-runner: upgrade default image to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1160119 (https://phabricator.wikimedia.org/T384595) [12:17:56] (03PS1) 10Jelto: gitlab-runner: upgrade default image to bookworm on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/1160120 (https://phabricator.wikimedia.org/T384595) [12:18:00] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [12:18:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P78172 and previous config saved to /var/cache/conftool/dbconfig/20250617-121805-marostegui.json [12:20:54] !log jmm@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2023.codfw.wmnet with reason: remove for decom [12:21:19] (03PS1) 10Filippo Giunchedi: thanos: remove old istio_sli_ recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1160122 (https://phabricator.wikimedia.org/T394318) [12:21:26] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet [12:21:59] (03CR) 10FNegri: [C:03+1] hieradata: Specify IPv4 addresses of Cloud VPS DNS recursors directly [puppet] - 10https://gerrit.wikimedia.org/r/1160081 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [12:22:22] (03CR) 10Filippo Giunchedi: "I did a dashboard audit and these are the only two, seemingly outdated, dashboards I could find:" [puppet] - 10https://gerrit.wikimedia.org/r/1160122 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [12:22:54] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Specify IPv4 addresses of Cloud VPS DNS recursors directly [puppet] - 10https://gerrit.wikimedia.org/r/1160081 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [12:23:01] (03CR) 10FNegri: [C:03+1] hieradata: Fix codfw1dev IPv6 DNS recursor address [puppet] - 10https://gerrit.wikimedia.org/r/1160083 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [12:23:03] (03PS1) 10Sbisson: CX3 Build 1.0.0+20250616 [extensions/ContentTranslation] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160123 (https://phabricator.wikimedia.org/T374695) [12:23:05] RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:23:08] jouncebot: nowandnext [12:23:08] For the next 0 hour(s) and 36 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1200) [12:23:09] In 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1300) [12:23:45] RECOVERY - haproxy failover on dbproxy1025 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:24:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ContentTranslation] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160123 (https://phabricator.wikimedia.org/T374695) (owner: 10Sbisson) [12:24:13] (03CR) 10Majavah: [C:03+2] hieradata: Fix codfw1dev IPv6 DNS recursor address [puppet] - 10https://gerrit.wikimedia.org/r/1160083 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [12:24:22] (03CR) 10FNegri: [C:03+1] hieradata: Add eqiad1 DNS IPv6 VIPs [puppet] - 10https://gerrit.wikimedia.org/r/1160084 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [12:24:47] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155665|IS: Enable `wgTemplateDataEnableDiscovery` for mediawikiwiki (T377975)]] (duration: 10m 42s) [12:24:51] T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975 [12:25:47] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [12:25:56] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet [12:26:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [12:26:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2006.codfw.wmnet with OS bullseye [12:26:20] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10922787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2006.codfw.wmnet with OS bullsey... [12:26:40] (03PS1) 10Effie Mouzeli: scap: allow k8s hosts to sync mediawiki/release [puppet] - 10https://gerrit.wikimedia.org/r/1160124 [12:27:58] (03PS2) 10Effie Mouzeli: scap: allow k8s hosts to sync mediawiki/release [puppet] - 10https://gerrit.wikimedia.org/r/1160124 [12:28:22] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160124 (owner: 10Effie Mouzeli) [12:28:35] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet [12:29:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T382778)', diff saved to https://phabricator.wikimedia.org/P78173 and previous config saved to /var/cache/conftool/dbconfig/20250617-122919-ladsgroup.json [12:29:23] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [12:29:35] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1177.eqiad.wmnet with reason: Maintenance [12:29:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T382778)', diff saved to https://phabricator.wikimedia.org/P78174 and previous config saved to /var/cache/conftool/dbconfig/20250617-122942-ladsgroup.json [12:30:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [12:32:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T382778)', diff saved to https://phabricator.wikimedia.org/P78175 and previous config saved to /var/cache/conftool/dbconfig/20250617-123241-ladsgroup.json [12:32:43] ^ expected, a host is being drained [12:32:50] (03CR) 10Filippo Giunchedi: [C:03+1] standard_packages: Remove more obsolete package after bullseye->bookworm update [puppet] - 10https://gerrit.wikimedia.org/r/1160116 (owner: 10Muehlenhoff) [12:33:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T396130)', diff saved to https://phabricator.wikimedia.org/P78176 and previous config saved to /var/cache/conftool/dbconfig/20250617-123312-marostegui.json [12:33:17] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:33:28] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance [12:33:34] (03CR) 10Muehlenhoff: [C:03+2] standard_packages: Remove more obsolete package after bullseye->bookworm update [puppet] - 10https://gerrit.wikimedia.org/r/1160116 (owner: 10Muehlenhoff) [12:33:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T396130)', diff saved to https://phabricator.wikimedia.org/P78177 and previous config saved to /var/cache/conftool/dbconfig/20250617-123334-marostegui.json [12:34:39] jouncebot: nowandnext [12:34:39] For the next 0 hour(s) and 25 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1200) [12:34:39] In 0 hour(s) and 25 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1300) [12:34:46] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM wikikube-worker-exp1001.eqiad.wmnet - jiji@cumin1002" [12:34:51] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM wikikube-worker-exp1001.eqiad.wmnet - jiji@cumin1002" [12:34:51] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:34:51] !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker-exp1001.eqiad.wmnet on all recursors [12:34:54] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker-exp1001.eqiad.wmnet on all recursors [12:35:05] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5987/co" [puppet] - 10https://gerrit.wikimedia.org/r/1160084 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [12:35:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [12:35:22] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM wikikube-worker-exp1001.eqiad.wmnet - jiji@cumin1002" [12:35:27] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM wikikube-worker-exp1001.eqiad.wmnet - jiji@cumin1002" [12:37:30] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Add eqiad1 DNS IPv6 VIPs [puppet] - 10https://gerrit.wikimedia.org/r/1160084 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [12:39:18] (03PS1) 10Samtar: InitialiseSettings: Enable TemplateDiscovery for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160126 (https://phabricator.wikimedia.org/T377975) [12:40:47] jiji@cumin1002 makevm (PID 3425476) is awaiting input [12:42:19] (03PS1) 10C. Scott Ananian: stats: Add buckets based on wikitext size; fix increment bug [extensions/Linter] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160127 (https://phabricator.wikimedia.org/T393400) [12:43:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Linter] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160127 (https://phabricator.wikimedia.org/T393400) (owner: 10C. Scott Ananian) [12:45:19] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10922960 (10MatthewVernon) @Jhancock.wm thanos-be2006 Just Worked; thanos-be2007 still had a disk with an old EFI setup on it - I knew which one fro... [12:45:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10922962 (10Jclark-ctr) [12:47:15] !log taavi@cumin1003 START - Cookbook sre.dns.netbox [12:47:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P78178 and previous config saved to /var/cache/conftool/dbconfig/20250617-124748-ladsgroup.json [12:50:10] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker-exp1001.eqiad.wmnet with OS bookworm [12:50:24] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051#10923010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host wikikube-worker-exp1001.eqiad.wmnet with OS bookworm [12:50:38] !log taavi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add eqiad1 auth v6 VIPs - taavi@cumin1003" [12:50:43] !log taavi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add eqiad1 auth v6 VIPs - taavi@cumin1003" [12:50:43] !log taavi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:54:12] (03PS1) 10KartikMistry: Enable the Contribute menu on new Wikipedias automatically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160128 (https://phabricator.wikimedia.org/T395031) [12:55:38] (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum5002 [puppet] - 10https://gerrit.wikimedia.org/r/1159493 (owner: 10Ssingh) [12:55:43] (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum6002 [puppet] - 10https://gerrit.wikimedia.org/r/1159495 (owner: 10Ssingh) [12:56:07] (03PS1) 10Vgutierrez: liberica: Disable ProtectKernelTunables for daemons using eBPF [puppet] - 10https://gerrit.wikimedia.org/r/1160130 (https://phabricator.wikimedia.org/T397053) [12:56:37] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum6001.drmrs.wmnet with OS bookworm [12:56:38] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum5002.eqsin.wmnet with OS bookworm [12:58:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T396130)', diff saved to https://phabricator.wikimedia.org/P78179 and previous config saved to /var/cache/conftool/dbconfig/20250617-125847-marostegui.json [12:58:52] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:59:38] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030#10923088 (10Jclark-ctr) a:03Jclark-ctr [13:00:04] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160130 (https://phabricator.wikimedia.org/T397053) (owner: 10Vgutierrez) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1300). [13:00:05] tgr, stephanebisson, and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:13] o/ [13:00:17] o/ [13:00:34] let’s hold for a second [13:01:10] FIRING: [4x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:02:39] (03PS2) 10Effie Mouzeli: debug.json: add mw-experimental hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154070 (https://phabricator.wikimedia.org/T276994) [13:02:41] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker-exp1001.eqiad.wmnet with reason: host reimage [13:02:55] o/ [13:02:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P78180 and previous config saved to /var/cache/conftool/dbconfig/20250617-130256-ladsgroup.json [13:03:16] Lucas_WMDE what's going on? [13:03:24] Lucas_WMDE: do I have time to throw https://gerrit.wikimedia.org/r/c/operations/puppet/+/1154069 in the mix? [13:03:35] let’s start with tgr’s config patch [13:03:38] tgr: want to self-service? [13:03:45] (03CR) 10Jelto: "looks mostly good to me, some comments in-line" [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb) [13:03:52] (03Abandoned) 10Slyngshede: Makefile: force AMD64 as platform [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1160047 (owner: 10Slyngshede) [13:04:24] Lucas_WMDE: I will queue as patiently as a brit [13:04:38] :D [13:06:08] !bash Lucas_WMDE: I will queue as patiently as a brit [13:06:09] Lucas_WMDE: Stored quip at https://bash.toolforge.org/quip/n1D_fZcBffdvpiTrPXbb [13:06:10] FIRING: [6x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:06:33] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker-exp1001.eqiad.wmnet with reason: host reimage [13:07:28] (03CR) 10Muehlenhoff: scap: allow k8s hosts to sync mediawiki/release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1160124 (owner: 10Effie Mouzeli) [13:08:02] (03CR) 10Ssingh: [C:03+1] liberica: Disable ProtectKernelTunables for daemons using eBPF [puppet] - 10https://gerrit.wikimedia.org/r/1160130 (https://phabricator.wikimedia.org/T397053) (owner: 10Vgutierrez) [13:08:05] (03PS1) 10Volans: CHANGELOG: add changelogs for release v2.0.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1160131 [13:08:09] Lucas_WMDE: sure (sorry was AFK) [13:08:35] np [13:09:20] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:09:22] (03PS1) 10Brouberol: airflow-dev: increase the memory limits of the webserver in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160132 (https://phabricator.wikimedia.org/T394297) [13:09:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153626 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza) [13:10:14] (03Merged) 10jenkins-bot: Use GetSecurityLogContext hook for goodpass/badpass logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153626 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza) [13:10:22] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:magru and not P{cp7002*} and A:cp - 9.2.10 upgrade (T390912) [13:10:27] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [13:10:37] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1153626|Use GetSecurityLogContext hook for goodpass/badpass logging (T395204)]] [13:10:41] T395204: MediaWiki should log request information (IP, user agent, referrer, HTTP method, etc) in a more uniform and predictable way - https://phabricator.wikimedia.org/T395204 [13:10:53] effie: seems fine to deploy that in parallel? [13:10:55] Please add me to the queue ^_^ *drinks tea patiently* [13:11:05] (03CR) 10Fabfur: [C:03+1] "Every day I learn a new systemd hardening option..." [puppet] - 10https://gerrit.wikimedia.org/r/1160130 (https://phabricator.wikimedia.org/T397053) (owner: 10Vgutierrez) [13:11:06] (03PS3) 10Effie Mouzeli: scap: allow k8s hosts to sync mediawiki/release [puppet] - 10https://gerrit.wikimedia.org/r/1160124 [13:11:11] (03PS1) 10Muehlenhoff: Also switch cumin2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1160133 (https://phabricator.wikimedia.org/T389380) [13:11:45] tgr: I was hoping for a kind soul to +2 it and deploy it with the rest :p [13:12:07] (03CR) 10Effie Mouzeli: scap: allow k8s hosts to sync mediawiki/release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1160124 (owner: 10Effie Mouzeli) [13:12:10] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160124 (owner: 10Effie Mouzeli) [13:12:26] it's a puppet patch so not sure any of the deployers can do that [13:12:47] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1160124 (owner: 10Effie Mouzeli) [13:12:47] !log tgr@deploy1003 tgr: Backport for [[gerrit:1153626|Use GetSecurityLogContext hook for goodpass/badpass logging (T395204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:13:33] tgr: sorry, I meant this patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1154070 [13:13:48] Lucas_WMDE: ^ [13:13:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P78182 and previous config saved to /var/cache/conftool/dbconfig/20250617-131354-marostegui.json [13:13:58] (03PS2) 10Vgutierrez: liberica: Add /sys/fs/bpf to ReadWritePaths [puppet] - 10https://gerrit.wikimedia.org/r/1160130 (https://phabricator.wikimedia.org/T397053) [13:13:59] E_TOO_MANY_PATCHES [13:14:03] FIRING: [2x] PuppetFailure: Puppet has failed on parsoidtest1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:14:20] (03CR) 10Lucas Werkmeister (WMDE): debug.json: add mw-experimental hosts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154070 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:14:39] (03CR) 10Ssingh: [C:03+1] liberica: Add /sys/fs/bpf to ReadWritePaths [puppet] - 10https://gerrit.wikimedia.org/r/1160130 (https://phabricator.wikimedia.org/T397053) (owner: 10Vgutierrez) [13:14:44] (03PS3) 10Effie Mouzeli: debug.json: add mw-experimental hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154070 (https://phabricator.wikimedia.org/T276994) [13:15:11] (03CR) 10Vgutierrez: [C:03+2] liberica: Add /sys/fs/bpf to ReadWritePaths [puppet] - 10https://gerrit.wikimedia.org/r/1160130 (https://phabricator.wikimedia.org/T397053) (owner: 10Vgutierrez) [13:15:17] 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade firmware (NIC and system) on ganeti1047 - https://phabricator.wikimedia.org/T396660#10923171 (10MoritzMuehlenhoff) That >>! In T396660#10922175, @elukey wrote: > While reviewing `/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor` I noticed that we have a mixture of... [13:17:06] Lucas_WMDE: <3, sorted [13:17:33] !log tgr@deploy1003 Sync cancelled. [13:17:42] oh no [13:18:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T382778)', diff saved to https://phabricator.wikimedia.org/P78183 and previous config saved to /var/cache/conftool/dbconfig/20250617-131803-ladsgroup.json [13:18:08] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [13:18:19] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1178.eqiad.wmnet with reason: Maintenance [13:18:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T382778)', diff saved to https://phabricator.wikimedia.org/P78184 and previous config saved to /var/cache/conftool/dbconfig/20250617-131824-ladsgroup.json [13:18:38] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [13:18:41] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting A:liberica-canary (T397053) [13:18:45] T397053: seamless upgrade triggers dropped packets with katran - https://phabricator.wikimedia.org/T397053 [13:18:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 22:00:00 on 10 hosts with reason: Maintenance [13:18:55] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling A:liberica-canary [13:19:01] I'll make a quick fix [13:19:03] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling A:liberica-canary [13:19:06] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin pooling A:liberica-canary [13:19:11] so much for config changes being quick, sorry [13:19:13] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling A:liberica-canary [13:19:15] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting A:liberica-canary (T397053) [13:19:22] (03CR) 10Lucas Werkmeister (WMDE): debug.json: add mw-experimental hosts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154070 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:19:24] (03PS1) 10Majavah: dynamicproxy: Support IPv6-enabled recursors [puppet] - 10https://gerrit.wikimedia.org/r/1160135 (https://phabricator.wikimedia.org/T396448) [13:19:24] (03PS1) 10Marostegui: db1155: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160137 (https://phabricator.wikimedia.org/T394371) [13:19:26] (03PS1) 10Majavah: P:toolforge: nginx: Support IPv6-enabled recursors [puppet] - 10https://gerrit.wikimedia.org/r/1160136 (https://phabricator.wikimedia.org/T396448) [13:19:57] tgr: good luck! [13:20:00] (03CR) 10CI reject: [V:04-1] dynamicproxy: Support IPv6-enabled recursors [puppet] - 10https://gerrit.wikimedia.org/r/1160135 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [13:20:35] (03PS1) 10Gergő Tisza: Fix GetSecurityLogContext hook declaration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160138 (https://phabricator.wikimedia.org/T395204) [13:20:55] (03CR) 10Marostegui: [C:03+2] db1155: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160137 (https://phabricator.wikimedia.org/T394371) (owner: 10Marostegui) [13:21:50] (03PS2) 10Majavah: dynamicproxy: Support IPv6-enabled recursors [puppet] - 10https://gerrit.wikimedia.org/r/1160135 (https://phabricator.wikimedia.org/T396448) [13:21:51] (03PS2) 10Majavah: P:toolforge: nginx: Support IPv6-enabled recursors [puppet] - 10https://gerrit.wikimedia.org/r/1160136 (https://phabricator.wikimedia.org/T396448) [13:21:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160138 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza) [13:21:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T382778)', diff saved to https://phabricator.wikimedia.org/P78185 and previous config saved to /var/cache/conftool/dbconfig/20250617-132157-ladsgroup.json [13:22:02] (03PS1) 10Vgutierrez: liberica: Disable ProtectKernelTunables on units using eBPF [puppet] - 10https://gerrit.wikimedia.org/r/1160141 (https://phabricator.wikimedia.org/T397053) [13:22:24] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160141 (https://phabricator.wikimedia.org/T397053) (owner: 10Vgutierrez) [13:22:26] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [13:22:48] (03CR) 10Ssingh: [C:03+1] liberica: Disable ProtectKernelTunables on units using eBPF [puppet] - 10https://gerrit.wikimedia.org/r/1160141 (https://phabricator.wikimedia.org/T397053) (owner: 10Vgutierrez) [13:22:49] (03Merged) 10jenkins-bot: Fix GetSecurityLogContext hook declaration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160138 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza) [13:22:53] did we always have the top errors in spiderpig or is that a new feature? [13:22:58] relatively new [13:22:59] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker-exp1001.eqiad.wmnet with OS bookworm [13:22:59] !log jiji@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host wikikube-worker-exp1001.eqiad.wmnet [13:23:07] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051#10923236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host wikikube-worker-exp1001.eqiad.wmnet with OS bookworm complete... [13:23:09] nice [13:23:11] T391005, a few days ago [13:23:11] T391005: Add a log view to SpiderPig - https://phabricator.wikimedia.org/T391005 [13:23:12] yeah! [13:23:13] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1160138|Fix GetSecurityLogContext hook declaration (T395204)]] [13:23:17] T395204: MediaWiki should log request information (IP, user agent, referrer, HTTP method, etc) in a more uniform and predictable way - https://phabricator.wikimedia.org/T395204 [13:24:04] (03CR) 10Brouberol: airflow: derive AIRFLOW_APPOWNER from the user principal in devenvs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160115 (https://phabricator.wikimedia.org/T394297) (owner: 10Brouberol) [13:24:45] (03CR) 10Lucas Werkmeister (WMDE): debug.json: add mw-experimental hosts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154070 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:25:26] !log tgr@deploy1003 tgr: Backport for [[gerrit:1160138|Fix GetSecurityLogContext hook declaration (T395204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:25:37] (03CR) 10Vgutierrez: [C:03+2] liberica: Disable ProtectKernelTunables on units using eBPF [puppet] - 10https://gerrit.wikimedia.org/r/1160141 (https://phabricator.wikimedia.org/T397053) (owner: 10Vgutierrez) [13:27:10] !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:27:18] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting A:liberica-canary (T397053) [13:27:22] T397053: seamless upgrade triggers dropped packets with katran - https://phabricator.wikimedia.org/T397053 [13:27:32] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling A:liberica-canary [13:27:39] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling A:liberica-canary [13:27:42] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin pooling A:liberica-canary [13:27:46] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:28:00] !log tgr@deploy1003 tgr: Continuing with sync [13:28:04] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling A:liberica-canary [13:28:06] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting A:liberica-canary (T397053) [13:28:40] (03CR) 10JMeybohm: scap: allow k8s hosts to sync mediawiki/release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1160124 (owner: 10Effie Mouzeli) [13:29:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P78186 and previous config saved to /var/cache/conftool/dbconfig/20250617-132902-marostegui.json [13:30:14] (03CR) 10Effie Mouzeli: scap: allow k8s hosts to sync mediawiki/release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1160124 (owner: 10Effie Mouzeli) [13:31:03] (03CR) 10Effie Mouzeli: [C:03+2] scap: allow k8s hosts to sync mediawiki/release [puppet] - 10https://gerrit.wikimedia.org/r/1160124 (owner: 10Effie Mouzeli) [13:31:25] effie: FYI I added your config change to the deployments page [13:31:30] remains to be seen if we’ll get there during this window [13:32:24] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting A:liberica-canary (T397053) [13:32:28] T397053: seamless upgrade triggers dropped packets with katran - https://phabricator.wikimedia.org/T397053 [13:32:38] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling A:liberica-canary [13:32:46] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling A:liberica-canary [13:32:49] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin pooling A:liberica-canary [13:32:52] seems safe enough to just deploy it alongside some other patch [13:32:57] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1047.eqiad.wmnet [13:32:58] (which I forgot right now, sorry) [13:33:00] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling A:liberica-canary [13:33:02] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting A:liberica-canary (T397053) [13:33:19] (03CR) 10Effie Mouzeli: [C:03+2] x-wikimedia-debug-routing: add mw-experimental hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154069 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:33:37] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: 3x VM request for new opensearch cluster - https://phabricator.wikimedia.org/T362107#10923318 (10Gehel) 05Open→03Declined We'll deploy on dse-k8s instead [13:33:48] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): codfw: 3x VM request for new opensearch cluster - https://phabricator.wikimedia.org/T362106#10923321 (10Gehel) [13:34:10] (03PS1) 10Reedy: composer: Update php platform to 8.1.0 and update composer.lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160144 [13:34:11] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): codfw: 3x VM request for new opensearch cluster - https://phabricator.wikimedia.org/T362106#10923323 (10Gehel) 05Open→03Declined We'll deploy on dse-k8s instead [13:34:14] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): eqiad: 3x VM request for new opensearch cluster - https://phabricator.wikimedia.org/T362107#10923326 (10Gehel) [13:34:27] (03CR) 10Klausman: [C:03+1] WIP: machinetranslation: Use s3 storage for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159696 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [13:35:00] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160138|Fix GetSecurityLogContext hook declaration (T395204)]] (duration: 11m 47s) [13:35:05] T395204: MediaWiki should log request information (IP, user agent, referrer, HTTP method, etc) in a more uniform and predictable way - https://phabricator.wikimedia.org/T395204 [13:35:15] Lucas_WMDE: thank you very much!! [13:35:38] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum5002.eqsin.wmnet with reason: host reimage [13:35:58] who's next? [13:36:16] i am [13:36:18] ihurbain is next :) [13:36:47] that caused a big error spike [13:37:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P78187 and previous config saved to /var/cache/conftool/dbconfig/20250617-133704-ladsgroup.json [13:37:10] nooo :( [13:37:17] did I misremember when that patch was merged? [13:37:49] the tags on https://phabricator.wikimedia.org/T395204 say wmf.6 [13:38:00] I sure did [13:38:06] sorry, let me just quickly revert [13:38:11] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1047.eqiad.wmnet [13:38:24] ihurbain: ^ heads up, tgr needs to revert [13:38:29] ack [13:38:37] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum5002.eqsin.wmnet with reason: host reimage [13:38:53] (03PS2) 10Effie Mouzeli: kubernetes.yaml: switch mw-experimental to debug image [puppet] - 10https://gerrit.wikimedia.org/r/1159524 (https://phabricator.wikimedia.org/T396767) [13:39:30] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#10923362 (10Eevans) [13:39:34] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T395685#10923365 (10Eevans) →14Duplicate dup:03T396970 [13:40:27] (03CR) 10Elukey: [C:03+1] CHANGELOG: add changelogs for release v2.0.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1160131 (owner: 10Volans) [13:41:26] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:41:45] (03PS1) 10Gergő Tisza: Revert "Use GetSecurityLogContext hook for goodpass/badpass logging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160147 (https://phabricator.wikimedia.org/T395204) [13:41:59] (03CR) 10Effie Mouzeli: [C:03+2] kubernetes.yaml: switch mw-experimental to debug image [puppet] - 10https://gerrit.wikimedia.org/r/1159524 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [13:42:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160147 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza) [13:42:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:42:28] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6001.drmrs.wmnet with OS bookworm [13:42:54] (03Merged) 10jenkins-bot: Revert "Use GetSecurityLogContext hook for goodpass/badpass logging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160147 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza) [13:43:16] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v2.0.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1160131 (owner: 10Volans) [13:43:20] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1160147|Revert "Use GetSecurityLogContext hook for goodpass/badpass logging" (T395204)]] [13:43:25] T395204: MediaWiki should log request information (IP, user agent, referrer, HTTP method, etc) in a more uniform and predictable way - https://phabricator.wikimedia.org/T395204 [13:43:46] note to self: don't use testwiki for testing deployments [13:44:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T396130)', diff saved to https://phabricator.wikimedia.org/P78188 and previous config saved to /var/cache/conftool/dbconfig/20250617-134409-marostegui.json [13:44:14] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [13:44:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2188.codfw.wmnet with reason: Maintenance [13:44:27] (03PS2) 10Arnaudb: gerrit: read-only plugin orchestration in failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) [13:44:27] (03CR) 10Arnaudb: "thanks for the feedback!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb) [13:44:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T396130)', diff saved to https://phabricator.wikimedia.org/P78189 and previous config saved to /var/cache/conftool/dbconfig/20250617-134432-marostegui.json [13:44:43] oh, right, because it’s early in the train [13:45:14] (03CR) 10Elukey: [V:03+1 C:03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5989/console" [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [13:45:25] tgr: FWIW if you put Depends-On on the config change then scap backport should check whether the depended-on changes are in all the train branches, I believe [13:45:32] !log tgr@deploy1003 tgr: Backport for [[gerrit:1160147|Revert "Use GetSecurityLogContext hook for goodpass/badpass logging" (T395204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:45:44] instead of having to check `git branch -r --contains` manually or something like that [13:46:04] 10SRE-SLO, 10Observability-Metrics, 10SRE Observability (FY2024/2025-Q4): liftwing SLO performance issues - https://phabricator.wikimedia.org/T387350#10923392 (10lmata) [13:46:23] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:46:28] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum6002.drmrs.wmnet with OS bookworm [13:46:40] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:46:45] !log tgr@deploy1003 tgr: Continuing with sync [13:47:39] (03PS2) 10Reedy: composer: Update php platform to 8.1.0 and update composer.lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160144 [13:47:57] apparently that did not work: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1153626 [13:48:23] maybe the check happens after the sync-to-testservers part? and I interrupted it there [13:48:28] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:48:49] (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum7002 [puppet] - 10https://gerrit.wikimedia.org/r/1159496 (owner: 10Ssingh) [13:48:50] (03CR) 10Jforrester: [C:03+1] composer: Update php platform to 8.1.0 and update composer.lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160144 (owner: 10Reedy) [13:48:51] oh, sorry, I didn’t see that [13:48:57] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v2.0.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1160131 (owner: 10Volans) [13:49:05] no, I thought it happened before it asks you whether to merge this change or not y/N o_O [13:49:21] weird [13:49:38] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum7002.magru.wmnet with OS bookworm [13:49:54] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:50:30] !log jiji@cumin1002 START - Cookbook sre.ganeti.makevm for new host wikikube-worker-exp2001.codfw.wmnet [13:50:31] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [13:50:32] !log broke login for ~30 min by deploying the wrong patch (T395204) [13:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:40] T395204: MediaWiki should log request information (IP, user agent, referrer, HTTP method, etc) in a more uniform and predictable way - https://phabricator.wikimedia.org/T395204 [13:50:49] (03PS1) 10Volans: Upstream release v2.0.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1160149 [13:51:37] (03PS3) 10Reedy: composer: Various updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160144 [13:51:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:52:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P78190 and previous config saved to /var/cache/conftool/dbconfig/20250617-135211-ladsgroup.json [13:52:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:52:24] (03CR) 10CI reject: [V:04-1] composer: Various updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160144 (owner: 10Reedy) [13:53:05] is logstash down? [13:53:18] (03CR) 10Jforrester: "Ha, will need actual fixes given new devDeps." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160144 (owner: 10Reedy) [13:53:28] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:53:28] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:53:45] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160147|Revert "Use GetSecurityLogContext hook for goodpass/badpass logging" (T395204)]] (duration: 10m 24s) [13:54:22] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Adding wikikube-worker-exp1001 - jiji@cumin1002 - T397051" [13:54:24] logstash seems to be working for me [13:54:26] T397051: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051 [13:54:37] am i on? [13:54:40] I think so yeah [13:54:42] jouncebot: next [13:54:42] In 1 hour(s) and 5 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1500) [13:54:43] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Adding wikikube-worker-exp1001 - jiji@cumin1002 - T397051" [13:54:48] we’ll just run over into the break I guess [13:54:54] letsgoooo [13:55:01] 🤞 [13:55:24] Lucas_WMDE: can you check if the error spike ended, then? I just get errors in Logstash for some reason [13:55:26] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM wikikube-worker-exp2001.codfw.wmnet - jiji@cumin1002" [13:55:31] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM wikikube-worker-exp2001.codfw.wmnet - jiji@cumin1002" [13:55:32] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:55:32] !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker-exp2001.codfw.wmnet on all recursors [13:55:33] tgr: looking [13:55:35] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker-exp2001.codfw.wmnet on all recursors [13:55:50] tgr: confirmed [13:55:57] thx [13:55:59] last message was @ 13:53:17.016 UTC [13:56:05] (sorry for all the chaos) [13:56:07] (of Error: Call to undefined method MediaWiki\Request\WebRequest::getSecurityLogContext()) [13:56:07] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#10923471 (10Eevans) @VRiley-WMF apologies, I never looked at the server on Friday after you closed T395685 but —looking at it now— the errors for `/dev/sdc` never went away. I'm not even sure I can ascertain f... [13:56:12] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs[7001-7002].magru.wmnet} and A:liberica (T397053) [13:56:16] T397053: seamless upgrade triggers dropped packets with katran - https://phabricator.wikimedia.org/T397053 [13:56:27] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs[7001-7002].magru.wmnet} and A:liberica [13:56:57] !log kcvelaga@deploy1003 Started deploy [airflow-dags/analytics_product@90a716a]: T365813 [13:56:59] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs[7001-7002].magru.wmnet} and A:liberica [13:57:02] T365813: Develop a unified Content Translation (CX) metrics dashboard - https://phabricator.wikimedia.org/T365813 [13:57:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1165 T395989', diff saved to https://phabricator.wikimedia.org/P78191 and previous config saved to /var/cache/conftool/dbconfig/20250617-135706-marostegui.json [13:57:11] T395989: Migrate s6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395989 [13:57:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:57:21] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin pooling P{lvs[7001-7002].magru.wmnet} and A:liberica [13:57:48] (03CR) 10Elukey: "Now that I see the code I am wondering if it makes more sense to just split the slos::istio define into multiple pieces, so that when a ne" [puppet] - 10https://gerrit.wikimedia.org/r/1160075 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [13:57:50] (03CR) 10Tiziano Fogli: [C:03+2] "Thank you, @dcaro@wikimedia.org, and sorry for the inconvenience." [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [13:58:05] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs[7001-7002].magru.wmnet} and A:liberica [13:58:07] np 👍 [13:58:08] !log kcvelaga@deploy1003 Finished deploy [airflow-dags/analytics_product@90a716a]: T365813 (duration: 01m 21s) [13:58:11] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs[7001-7002].magru.wmnet} and A:liberica [13:58:13] (03PS1) 10Marostegui: db1165: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160150 (https://phabricator.wikimedia.org/T395989) [13:58:18] (03CR) 10Volans: [C:03+2] Upstream release v2.0.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1160149 (owner: 10Volans) [13:58:42] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1165.eqiad.wmnet with reason: Maintenance [13:58:44] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs[7001-7002].magru.wmnet} and A:liberica [13:58:44] (03CR) 10Marostegui: [C:03+2] db1165: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1160150 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [13:59:05] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin pooling P{lvs[7001-7002].magru.wmnet} and A:liberica [13:59:09] (03PS4) 10Reedy: composer: Various updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160144 [13:59:09] (03PS1) 10Reedy: Setup json linting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 [13:59:49] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs[7001-7002].magru.wmnet} and A:liberica [13:59:51] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs[7001-7002].magru.wmnet} and A:liberica (T397053) [14:00:06] (03CR) 10CI reject: [V:04-1] Setup json linting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 (owner: 10Reedy) [14:00:08] I guess that was just my IDP session expiring mid-incident [14:01:15] (03CR) 10Jforrester: [C:03+1] "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159599 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra) [14:03:15] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum5002.eqsin.wmnet with OS bookworm [14:03:38] jouncebot next [14:03:38] In 0 hour(s) and 56 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1500) [14:03:58] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1201 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1160155 (https://phabricator.wikimedia.org/T397198) [14:04:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78192 and previous config saved to /var/cache/conftool/dbconfig/20250617-140402-root.json [14:04:03] (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1160156 (https://phabricator.wikimedia.org/T397198) [14:04:04] What's happening now? Can we continue with the deployment? [14:04:14] ihurbain is deploying at the moment [14:04:24] if you’re still around we can do your change after that [14:04:34] Great! let me know [14:04:57] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#10923524 (10Eevans) >>! In T396970#10923469, @Eevans wrote: > > [ ... ] > > Because they're in a failed state, I can't see the serial numbers anymore, but it looks like `sda` is the first device in the first c... [14:05:00] (03PS1) 10Gergő Tisza: Reapply "Use GetSecurityLogContext hook for goodpass/badpass logging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160157 (https://phabricator.wikimedia.org/T395204) [14:05:02] (03CR) 10Bernard Wang: "not really, originally was gonna be deployed alongside emtpy search next tuesday, but was moved a week earlier to help with code clean up " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159542 (owner: 10Bernard Wang) [14:05:04] probably together with effie’s, if the puppet change was merged in the meantime [14:05:17] and cscott’s patch looks like it would be better to deploy separately imho [14:05:17] (03CR) 10Jforrester: [C:03+1] composer: Various updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160144 (owner: 10Reedy) [14:05:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159542 (owner: 10Bernard Wang) [14:05:18] (03Merged) 10jenkins-bot: Upstream release v2.0.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1160149 (owner: 10Volans) [14:05:23] (03CR) 10Gergő Tisza: [C:04-2] "To be deployed next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160157 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza) [14:05:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T396130)', diff saved to https://phabricator.wikimedia.org/P78193 and previous config saved to /var/cache/conftool/dbconfig/20250617-140530-marostegui.json [14:05:35] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [14:05:55] (03CR) 10Jforrester: Setup json linting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 (owner: 10Reedy) [14:06:10] FIRING: [6x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:06:54] (03CR) 10Lucas Werkmeister (WMDE): "LGTM except that there’s no `package.json` ;)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 (owner: 10Reedy) [14:07:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T382778)', diff saved to https://phabricator.wikimedia.org/P78194 and previous config saved to /var/cache/conftool/dbconfig/20250617-140718-ladsgroup.json [14:07:25] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [14:07:34] (03PS2) 10Reedy: Setup json linting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 [14:07:34] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1192.eqiad.wmnet with reason: Maintenance [14:07:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T382778)', diff saved to https://phabricator.wikimedia.org/P78195 and previous config saved to /var/cache/conftool/dbconfig/20250617-140741-ladsgroup.json [14:07:45] (03CR) 10Jforrester: "`" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 (owner: 10Reedy) [14:07:48] (03CR) 10Lucas Werkmeister (WMDE): Setup json linting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 (owner: 10Reedy) [14:07:52] Lucas_WMDE: it is merged, but even if it isnt, wont break much :) [14:08:05] heh, I guess so [14:08:09] (03CR) 10Jforrester: Setup json linting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 (owner: 10Reedy) [14:08:18] worst case is that people can select a backend that will… be broken? or just route to a non-experimental server [14:08:40] !log ihurbain Deployed security patch for T397127 [14:08:50] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [14:09:03] not done yet, there's two branches [14:10:02] (03PS1) 10Effie Mouzeli: wmnet: add mw-exprimental CNAMES [dns] - 10https://gerrit.wikimedia.org/r/1160158 (https://phabricator.wikimedia.org/T396767) [14:10:19] Lucas_WMDE: I am sure there are plenty of options : [14:10:20] :p [14:10:56] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10923550 (10Marostegui) Any update on the status of the delivery? Thanks! [14:10:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T382778)', diff saved to https://phabricator.wikimedia.org/P78196 and previous config saved to /var/cache/conftool/dbconfig/20250617-141058-ladsgroup.json [14:11:58] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [14:13:48] FIRING: [2x] PuppetFailure: Puppet has failed on parsoidtest1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:14:36] (03CR) 10Filippo Giunchedi: "Modulo CI passing, seems reasonable to me" [alerts] - 10https://gerrit.wikimedia.org/r/1156324 (https://phabricator.wikimedia.org/T396738) (owner: 10Btullis) [14:15:05] !log uploaded python3-wmflib_2.0.0 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia,bookworm-wikimedia,trixie-wikimedia [14:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:47] (03PS1) 10Muehlenhoff: Fix order in site.pp for durum7003/doh7003 [puppet] - 10https://gerrit.wikimedia.org/r/1160161 [14:16:21] 10SRE-Access-Requests: Access to Wikipedia DB Replicas (SSH) - https://phabricator.wikimedia.org/T397200#10923587 (10Martina_sanchez) [14:17:25] (03Abandoned) 10Filippo Giunchedi: prometheus: SystemdUnitFailed as warning for data-persitence [puppet] - 10https://gerrit.wikimedia.org/r/1066762 (https://phabricator.wikimedia.org/T357333) (owner: 10Filippo Giunchedi) [14:17:31] (03CR) 10Ssingh: [C:03+1] Fix order in site.pp for durum7003/doh7003 [puppet] - 10https://gerrit.wikimedia.org/r/1160161 (owner: 10Muehlenhoff) [14:17:38] (03CR) 10Muehlenhoff: [C:03+2] Fix order in site.pp for durum7003/doh7003 [puppet] - 10https://gerrit.wikimedia.org/r/1160161 (owner: 10Muehlenhoff) [14:17:49] !log ihurbain Deployed security patch for T397127 [14:17:59] (03CR) 10Reedy: "I guess the primary request here is to catch the linting of the files that CI isn't going to catch otherwise (because they won't be loaded" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 (owner: 10Reedy) [14:18:03] THERE [14:18:21] (03PS3) 10Reedy: Setup json linting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 (https://phabricator.wikimedia.org/T397191) [14:18:23] i'm done, y'all can continue with the deploy window [14:18:26] 10SRE-Access-Requests: Access to Wikipedia DB Replicas (SSH) - https://phabricator.wikimedia.org/T397200#10923601 (10Martina_sanchez) [14:18:58] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum7002.magru.wmnet with reason: host reimage [14:19:03] stephanebisson: want to self-service? [14:19:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78197 and previous config saved to /var/cache/conftool/dbconfig/20250617-141907-root.json [14:19:11] Yes, on it [14:19:23] ok! [14:19:33] (03CR) 10Jforrester: [C:03+1] "Ship it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160151 (https://phabricator.wikimedia.org/T397191) (owner: 10Reedy) [14:19:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160123 (https://phabricator.wikimedia.org/T374695) (owner: 10Sbisson) [14:20:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P78198 and previous config saved to /var/cache/conftool/dbconfig/20250617-142037-marostegui.json [14:22:01] (03PS1) 10Filippo Giunchedi: team-sre: move PrometheusDown to paging [alerts] - 10https://gerrit.wikimedia.org/r/1160163 (https://phabricator.wikimedia.org/T393365) [14:22:03] (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20250616 [extensions/ContentTranslation] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160123 (https://phabricator.wikimedia.org/T374695) (owner: 10Sbisson) [14:22:12] (03PS8) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [14:22:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.736s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:22:29] 10SRE-Access-Requests: Access to Wikipedia DB Replicas (SSH) - https://phabricator.wikimedia.org/T397200#10923614 (10Martina_sanchez) [14:22:31] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1160123|CX3 Build 1.0.0+20250616 (T374695 T395415 T396628 T396711 T396716 T396836)]] [14:22:34] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7002.magru.wmnet with reason: host reimage [14:22:43] T374695: Community-defined Translation Collections: Support collections with multiple sub-collections - https://phabricator.wikimedia.org/T374695 [14:22:43] T395415: CX events EventGate validation errors: translation id, source section, target section values should be string - https://phabricator.wikimedia.org/T395415 [14:22:43] T396628: CX: Wrong spacing in quick tutorial step - https://phabricator.wikimedia.org/T396628 [14:22:44] T396711: CX instrumentation should use the latest schema version - https://phabricator.wikimedia.org/T396711 [14:22:44] T396716: CX instrumentation: editor_segment_edit and editor_segment_skip events are not logged - https://phabricator.wikimedia.org/T396716 [14:22:44] T396836: CX mobile editor: Action buttons for edited template provide the wrong options - https://phabricator.wikimedia.org/T396836 [14:23:48] RESOLVED: [2x] PuppetFailure: Puppet has failed on parsoidtest1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:23:57] (03CR) 10Scott French: wmnet: add mw-exprimental CNAMES (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1160158 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [14:24:14] (03PS1) 10Effie Mouzeli: mediawiki_experimental: allow users to lock code update [puppet] - 10https://gerrit.wikimedia.org/r/1160164 [14:24:43] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1160123|CX3 Build 1.0.0+20250616 (T374695 T395415 T396628 T396711 T396716 T396836)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:25:13] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host durum7004.magru.wmnet [14:25:15] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [14:25:36] 10SRE-Access-Requests: Access to Wikipedia DB Replicas (SSH) - https://phabricator.wikimedia.org/T397200#10923669 (10Reedy) I'm not sure you're asking in the right place... You should be able to do this with a Toolforge account - https://wikitech.wikimedia.org/wiki/Help:Getting_Started https://idm.wikimedia.org... [14:26:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P78199 and previous config saved to /var/cache/conftool/dbconfig/20250617-142605-ladsgroup.json [14:26:28] !log sbisson@deploy1003 sbisson: Continuing with sync [14:26:37] !log jmm@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti1047.eqiad.wmnet to cluster eqiad and group C [14:26:45] (03PS1) 10Marostegui: wmnet: Fix es6 and es7 CNAME [dns] - 10https://gerrit.wikimedia.org/r/1160165 [14:27:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.736s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:27:52] (03CR) 10Marostegui: [C:03+2] wmnet: Fix es6 and es7 CNAME [dns] - 10https://gerrit.wikimedia.org/r/1160165 (owner: 10Marostegui) [14:27:59] !log marostegui@dns1006 START - running authdns-update [14:28:51] !log marostegui@dns1006 END - running authdns-update [14:29:15] (03PS2) 10Effie Mouzeli: wmnet: add mw-exprimental CNAMES [dns] - 10https://gerrit.wikimedia.org/r/1160158 (https://phabricator.wikimedia.org/T396767) [14:29:20] (03PS9) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [14:29:57] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:30:04] 06SRE, 10SRE-Access-Requests: Access to Wikipedia DB Replicas (SSH) - https://phabricator.wikimedia.org/T397200#10923697 (10Martina_sanchez) Hi Reedy, Thank you for the quick response. Since this work is part of my thesis dissertation I cannot have it published (at least until I submit and present my thesis.)... [14:30:07] (03CR) 10Clément Goubert: [C:03+1] shellbox-syntaxhighlight: pilot bookworm-based httpd image (1 replica) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156442 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [14:30:21] (03CR) 10Clément Goubert: [C:03+1] shellbox-syntaxhighlight: migrate to bookworm-based httpd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156443 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [14:30:23] (03CR) 10Effie Mouzeli: wmnet: add mw-exprimental CNAMES (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1160158 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [14:30:24] jouncebot: nowandnext [14:30:24] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [14:30:24] In 0 hour(s) and 29 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1500) [14:30:37] (03CR) 10Elukey: [C:03+1] thanos: remove old istio_sli_ recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1160122 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [14:30:42] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6002.drmrs.wmnet with OS bookworm [14:30:43] (03PS1) 10Volans: locking: fix unit test missing assert [software/spicerack] - 10https://gerrit.wikimedia.org/r/1160166 [14:30:49] jmm@cumin1003 makevm (PID 1978911) is awaiting input [14:31:19] hnowlan: backport+config window is still running [14:31:35] how are we doing, sorry lost track of where we are in the queue [14:31:36] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7004.magru.wmnet - jmm@cumin1003" [14:31:40] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7004.magru.wmnet - jmm@cumin1003" [14:31:41] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:31:41] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache durum7004.magru.wmnet on all recursors [14:31:44] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7004.magru.wmnet on all recursors [14:32:14] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7004.magru.wmnet - jmm@cumin1003" [14:32:19] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7004.magru.wmnet - jmm@cumin1003" [14:32:39] 06SRE, 10SRE-Access-Requests: Access to Wikipedia DB Replicas (SSH) - https://phabricator.wikimedia.org/T397200#10923720 (10Martina_sanchez) Ultimately, I would just want to access the wikipedia db replicas like I would on Quarry with the ability to connect them to a script that is private to me so that the da... [14:33:01] cscott: once stephanebisson is done I’d deploy yours and effie’s together [14:33:05] (do you want to self-service?) [14:33:12] (03PS2) 10Effie Mouzeli: mediawiki_experimental: allow users to lock code update [puppet] - 10https://gerrit.wikimedia.org/r/1160164 [14:33:40] (03CR) 10Herron: [C:03+1] thanos: remove old istio_sli_ recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1160122 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [14:33:53] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160123|CX3 Build 1.0.0+20250616 (T374695 T395415 T396628 T396711 T396716 T396836)]] (duration: 11m 21s) [14:33:59] (03PS10) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [14:34:02] T374695: Community-defined Translation Collections: Support collections with multiple sub-collections - https://phabricator.wikimedia.org/T374695 [14:34:03] T395415: CX events EventGate validation errors: translation id, source section, target section values should be string - https://phabricator.wikimedia.org/T395415 [14:34:04] T396628: CX: Wrong spacing in quick tutorial step - https://phabricator.wikimedia.org/T396628 [14:34:04] T396711: CX instrumentation should use the latest schema version - https://phabricator.wikimedia.org/T396711 [14:34:04] T396716: CX instrumentation: editor_segment_edit and editor_segment_skip events are not logged - https://phabricator.wikimedia.org/T396716 [14:34:04] T396836: CX mobile editor: Action buttons for edited template provide the wrong options - https://phabricator.wikimedia.org/T396836 [14:34:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78200 and previous config saved to /var/cache/conftool/dbconfig/20250617-143413-root.json [14:34:26] 06SRE, 10SRE-Access-Requests: Access to Wikipedia DB Replicas (SSH) - https://phabricator.wikimedia.org/T397200#10923742 (10Reedy) You should be able to do this in your userspace on Toolforge... [14:34:27] I'm done. Next [14:35:19] jmm@cumin1003 makevm (PID 1978911) is awaiting input [14:35:43] ok! [14:35:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P78201 and previous config saved to /var/cache/conftool/dbconfig/20250617-143545-marostegui.json [14:35:46] (03CR) 10Herron: [C:03+1] profile::pyrra: improve the istio SLOs template [puppet] - 10https://gerrit.wikimedia.org/r/1160075 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [14:36:07] (03CR) 10Scott French: [C:03+1] "Thanks, Effie!" [dns] - 10https://gerrit.wikimedia.org/r/1160158 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [14:36:23] (03PS3) 10Effie Mouzeli: wmnet: add mw-exprimental CNAMES [dns] - 10https://gerrit.wikimedia.org/r/1160158 (https://phabricator.wikimedia.org/T396767) [14:36:28] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: remove old istio_sli_ recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1160122 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [14:36:29] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:36:36] tgr: FWIW, https://spiderpig.wikimedia.org/jobs/206 shows [14:36:38] Change '1154069', project 'operations/puppet', branch 'production' not found in any deployed wikiversion. Deployed wikiversions: ['1.45.0-wmf.5', '1.45.0-wmf.6'] [14:36:38] Continue with backport? [14:36:43] (03CR) 10Herron: [C:03+1] "good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1160076 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [14:36:44] so somewhere the Depends-On logic is still working… o_O [14:37:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Linter] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160127 (https://phabricator.wikimedia.org/T393400) (owner: 10C. Scott Ananian) [14:37:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154070 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [14:37:22] (the log display is currently broken due to T397097, that’s probably unrelated) [14:37:22] T397097: SpiderPig live job log view is broken - https://phabricator.wikimedia.org/T397097 [14:38:02] (03Merged) 10jenkins-bot: debug.json: add mw-experimental hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154070 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [14:38:18] (03CR) 10Effie Mouzeli: [C:03+2] wmnet: add mw-exprimental CNAMES [dns] - 10https://gerrit.wikimedia.org/r/1160158 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [14:38:57] !log jiji@dns1004 START - running authdns-update [14:39:00] (03CR) 10Herron: [C:03+1] team-sre: move PrometheusDown to paging [alerts] - 10https://gerrit.wikimedia.org/r/1160163 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi) [14:39:07] jmm@cumin1003 makevm (PID 1978911) is awaiting input [14:39:14] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host durum7004.magru.wmnet with OS bookworm [14:40:00] !log jiji@dns1004 END - running authdns-update [14:40:29] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:41:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P78202 and previous config saved to /var/cache/conftool/dbconfig/20250617-144112-ladsgroup.json [14:41:28] Lucas_WMDE: Sorry about the log problem. I will deploy a new release of scap after your deployment which will revert that change. [14:41:34] 06SRE, 10SRE-Access-Requests: Access to Wikipedia DB Replicas (SSH) - https://phabricator.wikimedia.org/T397200#10923839 (10Bugreporter) You can use https://paws.wmcloud.org/ which you can also run your script inside a terminal. [14:42:19] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7002.magru.wmnet with OS bookworm [14:42:19] (03PS11) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [14:42:29] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:42:40] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:magru and not P{cp7002*} and A:cp - 9.2.10 upgrade (T390912) [14:42:44] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [14:43:11] !log dancy@deploy1003 Installing scap version "4.176.0" for 2 host(s) [14:43:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [14:43:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [14:43:52] o_O spiderpig / scap failed [14:43:57] (03PS13) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 [14:44:00] Could not find a suitable TLS CA certificate bundle [14:44:08] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync [14:44:13] dancy: related? are you deploying a new scap version already by any chance? [14:44:18] (https://spiderpig.wikimedia.org/jobs/206) [14:44:20] Lucas_WMDE: That was probably me. Stand by for a minute and then I'll ask you to retry [14:44:23] ack [14:44:43] zuul says there’s 1 min ETA for the core backport’s CI anyway :P [14:44:45] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [14:44:53] (03CR) 10Tiziano Fogli: [C:03+1] "LGTM. Should we maybe also add a critical alert for the PoP instances?" [alerts] - 10https://gerrit.wikimedia.org/r/1160163 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi) [14:45:02] !log dancy@deploy1003 Installation of scap version "4.176.0" completed for 2 hosts [14:45:39] (03Merged) 10jenkins-bot: stats: Add buckets based on wikitext size; fix increment bug [extensions/Linter] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1160127 (https://phabricator.wikimedia.org/T393400) (owner: 10C. Scott Ananian) [14:45:59] 06SRE, 06Infrastructure-Foundations: Deal with archival of Buster on Debian mirrors - https://phabricator.wikimedia.org/T397209 (10MoritzMuehlenhoff) 03NEW [14:46:00] Lucas_WMDE: Ready for retry. [14:46:05] 06SRE, 06Infrastructure-Foundations: Deal with archival of Buster on Debian mirrors - https://phabricator.wikimedia.org/T397209#10923885 (10MoritzMuehlenhoff) p:05Triage→03High [14:46:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:46:14] started https://spiderpig.wikimedia.org/jobs/207 [14:46:26] OK. Live job log is back. Sorry for the disruption. [14:46:31] yay [14:46:34] thanks! [14:46:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:46:45] for a second I was worried because the console was still blank during the “not found in any deployed wikiversion” prompt [14:46:48] but now it’s showing up fine [14:46:49] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1160127|stats: Add buckets based on wikitext size; fix increment bug (T393400)]], [[gerrit:1154070|debug.json: add mw-experimental hosts (T276994)]] [14:46:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:46:54] T393400: Create metric to measure Parsoid speed on small/medium/large pages - https://phabricator.wikimedia.org/T393400 [14:46:54] T276994: Provide an mwdebug functionality on kubernetes (mw-experimental) - https://phabricator.wikimedia.org/T276994 [14:47:29] 06SRE, 10SRE-Access-Requests: Access to Wikipedia DB Replicas (SSH) - https://phabricator.wikimedia.org/T397200#10923896 (10herron) 05Open→03Invalid >>! In T397200#10923720, @Martina_sanchez wrote: > Do you have any pointers as to how I could achieve this? Hi @Martina_sanchez please see https://wikite... [14:48:15] (03CR) 10Volans: [C:03+2] "trivial, self-merging" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1160166 (owner: 10Volans) [14:48:29] (03PS12) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [14:48:37] (03PS1) 10Muehlenhoff: No longer use mirrors.debian.org on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1160171 [14:49:02] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, jiji, cscott: Backport for [[gerrit:1160127|stats: Add buckets based on wikitext size; fix increment bug (T393400)]], [[gerrit:1154070|debug.json: add mw-experimental hosts (T276994)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:49:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78203 and previous config saved to /var/cache/conftool/dbconfig/20250617-144918-root.json [14:49:39] (03PS1) 10Effie Mouzeli: otel: add tolerations for mw-experimental hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160173 (https://phabricator.wikimedia.org/T396767) [14:50:06] Lucas_WMDE: thanks! [14:50:15] (03CR) 10Tiziano Fogli: "I've added comments inline. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [14:50:28] cscott: can you test the change on mwdebug? [14:50:28] Lucas_WMDE: it might take a little while to see if metrics are coming in from the test servers [14:50:31] ack [14:50:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T396130)', diff saved to https://phabricator.wikimedia.org/P78204 and previous config saved to /var/cache/conftool/dbconfig/20250617-145052-marostegui.json [14:50:58] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [14:51:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2202.codfw.wmnet with reason: Maintenance [14:51:19] (03Abandoned) 10Herron: thanos-store: increase store cache size to 24GB [puppet] - 10https://gerrit.wikimedia.org/r/1105395 (https://phabricator.wikimedia.org/T368953) (owner: 10Herron) [14:51:42] effie: I can see the experimental hosts in the extension already o_O [14:51:56] does the extension get the debug.json file itself from a WikimediaDebug server? [14:52:16] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:ulsfo and A:cp - 9.2.10 upgrade (T390912) [14:52:21] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [14:52:25] (03Abandoned) 10Herron: thanos-store: manage and increase chunk-pool-size setting [puppet] - 10https://gerrit.wikimedia.org/r/1105389 (https://phabricator.wikimedia.org/T368953) (owner: 10Herron) [14:54:39] Lucas_WMDE: we did something a couple of years ago, so adding/removing hosts wouldnt need to redeploy the extension [14:54:43] Lucas_WMDE: good to continue, no errors seen [14:55:44] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, jiji, cscott: Continuing with sync [14:55:46] nice, thanks! [14:56:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T382778)', diff saved to https://phabricator.wikimedia.org/P78205 and previous config saved to /var/cache/conftool/dbconfig/20250617-145619-ladsgroup.json [14:56:24] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [14:56:28] (03CR) 10Ahmon Dancy: [C:03+1] "Seems reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/1160119 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto) [14:56:35] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1203.eqiad.wmnet with reason: Maintenance [14:56:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T382778)', diff saved to https://phabricator.wikimedia.org/P78206 and previous config saved to /var/cache/conftool/dbconfig/20250617-145642-ladsgroup.json [14:58:20] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] "Sounds good, will be done in a followup change" [alerts] - 10https://gerrit.wikimedia.org/r/1160163 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi) [14:58:39] (03Merged) 10jenkins-bot: locking: fix unit test missing assert [software/spicerack] - 10https://gerrit.wikimedia.org/r/1160166 (owner: 10Volans) [14:59:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T382778)', diff saved to https://phabricator.wikimedia.org/P78207 and previous config saved to /var/cache/conftool/dbconfig/20250617-145958-ladsgroup.json [15:00:05] jelto, arnoldokoth, and mutante: Time to snap out of that daydream and deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1500). [15:01:39] I’m still deploying, very sorry [15:01:46] didn’t notice the next window was coming up /o\ [15:01:50] should be done soon [15:02:48] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160127|stats: Add buckets based on wikitext size; fix increment bug (T393400)]], [[gerrit:1154070|debug.json: add mw-experimental hosts (T276994)]] (duration: 15m 59s) [15:02:55] T393400: Create metric to measure Parsoid speed on small/medium/large pages - https://phabricator.wikimedia.org/T393400 [15:02:56] T276994: Provide an mwdebug functionality on kubernetes (mw-experimental) - https://phabricator.wikimedia.org/T276994 [15:02:59] !log extra-long UTC afternoon backport+config window done [15:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:05] * Lucas_WMDE done deploying [15:03:28] Lucas_WMDE: thanks! [15:04:01] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum7004.magru.wmnet with reason: host reimage [15:05:07] (03PS1) 10Filippo Giunchedi: team-sre: check PoPs for PrometheusDown [alerts] - 10https://gerrit.wikimedia.org/r/1160177 (https://phabricator.wikimedia.org/T393365) [15:06:27] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051#10924029 (10jijiki) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:06] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7004.magru.wmnet with reason: host reimage [15:08:45] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2212.codfw.wmnet with reason: Maintenance [15:08:47] (03PS1) 10Zabe: filtered_tables: Add new categorylinks columns [puppet] - 10https://gerrit.wikimedia.org/r/1160178 (https://phabricator.wikimedia.org/T299951) [15:10:10] (03PS4) 10Seanleong-wmde: Create feature flags for resolving Wikibase item labels on Watchlist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141852 (https://phabricator.wikimedia.org/T388685) (owner: 10Neslihan Turan) [15:10:45] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153#10924062 (10xcollazo) CC @BTullis [15:11:16] (03CR) 10Xcollazo: [C:03+1] Mark HTTP(S) traffic from dumps with low-priority QoS mark [puppet] - 10https://gerrit.wikimedia.org/r/1160071 (https://phabricator.wikimedia.org/T397153) (owner: 10Cathal Mooney) [15:13:03] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki_experimental: allow users to lock code update [puppet] - 10https://gerrit.wikimedia.org/r/1160164 (owner: 10Effie Mouzeli) [15:14:23] (03PS6) 10Effie Mouzeli: site.pp: make wikikube-worker-exp* k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) [15:15:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P78208 and previous config saved to /var/cache/conftool/dbconfig/20250617-151505-ladsgroup.json [15:15:39] (03Abandoned) 10Ryan Kemper: wdqs: add SLIs for main & scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1148976 (owner: 10Ryan Kemper) [15:16:00] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153#10924112 (10xcollazo) @cmooney: +1 to the change. Can you please share the link to this dashboard? [15:16:23] (03PS1) 10Elukey: profile::pyrra: add SLO ratio for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:18] (03PS7) 10Effie Mouzeli: site.pp: make wikikube-worker-exp1001 a k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) [15:19:42] (03CR) 10Scott French: "Thanks for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156442 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [15:19:52] (03CR) 10Scott French: [C:03+2] shellbox-syntaxhighlight: pilot bookworm-based httpd image (1 replica) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156442 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [15:21:34] (03Merged) 10jenkins-bot: shellbox-syntaxhighlight: pilot bookworm-based httpd image (1 replica) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156442 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [15:21:48] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153#10924152 (10cmooney) >>! In T397153#10924112, @xcollazo wrote: > @cmooney: +1 to the change. > > Can you please share the link to this dashboard?... [15:22:37] (03PS2) 10Mimurawil: Configure instrument for CheckUser - UserInfoCard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) [15:22:44] !log dancy@deploy1003 Installing scap version "4.177.0" for 2 host(s) [15:23:22] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:23:43] (03CR) 10Mimurawil: Configure instrument for CheckUser - UserInfoCard (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) (owner: 10Mimurawil) [15:23:46] jmm@cumin1003 addnode (PID 1979025) is awaiting input [15:23:48] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:24:06] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7004.magru.wmnet with OS bookworm [15:24:06] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7004.magru.wmnet [15:24:15] (03PS2) 10Vgutierrez: hiera: Issue a separate GTS cert for upload cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159511 (https://phabricator.wikimedia.org/T394484) [15:24:32] !log dancy@deploy1003 Installation of scap version "4.177.0" completed for 2 hosts [15:25:14] !log dancy@deploy1003 Started scap sync-world: Testing T396166 [15:25:18] T396166: Are `php_fpm`/`php_version` inside `scap.cfg` used anymore? - https://phabricator.wikimedia.org/T396166 [15:25:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2216.codfw.wmnet with reason: Maintenance [15:25:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T396130)', diff saved to https://phabricator.wikimedia.org/P78209 and previous config saved to /var/cache/conftool/dbconfig/20250617-152555-marostegui.json [15:26:00] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [15:26:39] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1047.eqiad.wmnet to cluster eqiad and group C [15:28:00] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-backup2001.codfw.wmnet with reason: Maintenance and reboot [15:28:51] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:28:57] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:29:00] !log dancy@deploy1003 Finished scap sync-world: Testing T396166 (duration: 03m 46s) [15:30:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P78210 and previous config saved to /var/cache/conftool/dbconfig/20250617-153013-ladsgroup.json [15:33:23] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1175 is OK: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [15:33:55] !log stopping puppet on A:wikikube-worker and A:eqiad [15:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:47] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: make wikikube-worker-exp1001 a k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [15:42:36] (03PS3) 10Ebernhardson: Bump ltr plugin to 1.5.4-wmf1-os1.3.20 [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1143156 (https://phabricator.wikimedia.org/T317599) [15:43:13] (03PS4) 10Ebernhardson: Update plugins for extended regex support [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1143156 (https://phabricator.wikimedia.org/T317599) [15:43:20] (03CR) 10Ssingh: [C:03+1] "let's merge this, given that the first two choices remain the same and whatever ambiguity is there is between the other ones. It is also h" [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [15:44:15] (03CR) 10Ssingh: add rest of South America (except Falkland Islands) to geo-maps (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [15:44:51] (03Abandoned) 10Ssingh: hiera: set do_ech to false for durum7003 [puppet] - 10https://gerrit.wikimedia.org/r/1159497 (owner: 10Ssingh) [15:45:01] (03CR) 10Ssingh: "reimaged to insetup; part of routed ganeti." [puppet] - 10https://gerrit.wikimedia.org/r/1159497 (owner: 10Ssingh) [15:45:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T382778)', diff saved to https://phabricator.wikimedia.org/P78211 and previous config saved to /var/cache/conftool/dbconfig/20250617-154520-ladsgroup.json [15:45:25] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [15:45:36] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1209.eqiad.wmnet with reason: Maintenance [15:45:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T382778)', diff saved to https://phabricator.wikimedia.org/P78212 and previous config saved to /var/cache/conftool/dbconfig/20250617-154542-ladsgroup.json [15:46:31] (03PS14) 10CDobbins: add rest of South America (except CO, EC, VE, & FK) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 [15:47:02] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:47:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T396130)', diff saved to https://phabricator.wikimedia.org/P78213 and previous config saved to /var/cache/conftool/dbconfig/20250617-154709-marostegui.json [15:47:13] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [15:47:29] (03PS15) 10CDobbins: add CO, EC, and VE to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 [15:47:35] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:47:48] (03CR) 10CDobbins: add CO, EC, and VE to geo-maps (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [15:48:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T382778)', diff saved to https://phabricator.wikimedia.org/P78214 and previous config saved to /var/cache/conftool/dbconfig/20250617-154850-ladsgroup.json [15:49:27] (03CR) 10Ssingh: [C:03+1] add CO, EC, and VE to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [15:50:11] (03CR) 10CDobbins: add CO, EC, and VE to geo-maps (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [15:50:39] (03CR) 10CDobbins: [C:03+2] add CO, EC, and VE to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [15:53:32] !log cdobbins@dns1004 START - running authdns-update [15:54:50] !log cdobbins@dns1004 END - running authdns-update [15:56:47] (03PS1) 10Effie Mouzeli: regex.yaml: make wikikube-worker-exp1001 an mw-experimental host [puppet] - 10https://gerrit.wikimedia.org/r/1160187 [16:00:00] (03CR) 10Hnowlan: [C:03+1] regex.yaml: make wikikube-worker-exp1001 an mw-experimental host [puppet] - 10https://gerrit.wikimedia.org/r/1160187 (owner: 10Effie Mouzeli) [16:00:05] jhathaway and moritzm: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1600) [16:00:06] No Gerrit patches in the queue for this window AFAICS. [16:00:33] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:00:40] (03CR) 10Hnowlan: [C:03+1] regex.yaml: make wikikube-worker-exp1001 an mw-experimental host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1160187 (owner: 10Effie Mouzeli) [16:00:49] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:01:23] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54082 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:01:26] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-drmrs and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [16:01:30] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [16:01:39] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:01:57] (03PS2) 10Effie Mouzeli: regex.yaml: make wikikube-worker-exp1001 an mw-experimental host [puppet] - 10https://gerrit.wikimedia.org/r/1160187 [16:02:09] (03CR) 10Effie Mouzeli: regex.yaml: make wikikube-worker-exp1001 an mw-experimental host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1160187 (owner: 10Effie Mouzeli) [16:02:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P78215 and previous config saved to /var/cache/conftool/dbconfig/20250617-160216-marostegui.json [16:02:44] (03CR) 10Hnowlan: [C:03+1] regex.yaml: make wikikube-worker-exp1001 an mw-experimental host [puppet] - 10https://gerrit.wikimedia.org/r/1160187 (owner: 10Effie Mouzeli) [16:02:46] (03CR) 10Effie Mouzeli: [C:03+2] regex.yaml: make wikikube-worker-exp1001 an mw-experimental host [puppet] - 10https://gerrit.wikimedia.org/r/1160187 (owner: 10Effie Mouzeli) [16:03:16] (03CR) 10BCornwall: [C:03+2] Revert^2 "hiera: Add lvs1016 to high-traffic1" [puppet] - 10https://gerrit.wikimedia.org/r/1159592 (owner: 10BCornwall) [16:03:28] FIRING: SystemdUnitFailed: wmf_auto_restart_ipmiseld.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:35] (03CR) 10Scott French: [C:03+1] regex.yaml: make wikikube-worker-exp1001 an mw-experimental host [puppet] - 10https://gerrit.wikimedia.org/r/1160187 (owner: 10Effie Mouzeli) [16:03:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P78216 and previous config saved to /var/cache/conftool/dbconfig/20250617-160357-ladsgroup.json [16:04:24] (03CR) 10Vgutierrez: [C:03+2] hiera: Issue a separate GTS cert for upload cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159511 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [16:05:04] (03PS4) 10Elukey: profile::pyrra: improve the istio SLOs template [puppet] - 10https://gerrit.wikimedia.org/r/1160075 (https://phabricator.wikimedia.org/T391852) [16:05:04] (03PS4) 10Elukey: profile::pyrra: fix Citoid's SLO targets [puppet] - 10https://gerrit.wikimedia.org/r/1160076 (https://phabricator.wikimedia.org/T391852) [16:05:05] (03PS2) 10Elukey: profile::pyrra: add SLO ratio for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) [16:05:36] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:05:54] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:07:44] (03PS1) 10D3r1ck01: Revert^2 "JCCache: Use WANObjectCache::getWithSetCallback() instead of set/get" [extensions/JsonConfig] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1160190 [16:08:07] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5999/co" [puppet] - 10https://gerrit.wikimedia.org/r/1160075 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [16:08:41] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1016.eqiad.wmnet with OS bullseye [16:08:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10924494 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1016.eqiad.wmnet with OS bullseye [16:09:57] (03CR) 10Elukey: [V:03+1] "@kherron@wikimedia.org hi! Fixed a little issue with PCC, namely the need for "= undef" in the optional params. Now PCC shows a little cha" [puppet] - 10https://gerrit.wikimedia.org/r/1160075 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [16:10:53] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6000/co" [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [16:11:05] (03Abandoned) 10D3r1ck01: Revert^2 "JCCache: Use WANObjectCache::getWithSetCallback() instead of set/get" [extensions/JsonConfig] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1160190 (owner: 10D3r1ck01) [16:11:17] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1175 - https://phabricator.wikimedia.org/T396703#10924499 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF drive has been replaced [16:12:14] (03CR) 10Dzahn: [C:03+2] miscweb: delete miscweb::rsync profile [puppet] - 10https://gerrit.wikimedia.org/r/1159550 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [16:12:19] (03CR) 10Elukey: "@kherron@wikimedia.org I totally missed the ratio SLO in https://wikitech.wikimedia.org/wiki/SLO/Citoid#Reconciliation, so I tried to add " [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [16:13:27] PROBLEM - MegaRAID on analytics1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:13:28] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:13:29] ACKNOWLEDGEMENT - MegaRAID on analytics1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T397231 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:13:38] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on analytics1073 - https://phabricator.wikimedia.org/T397231 (10ops-monitoring-bot) 03NEW [16:14:23] (03CR) 10Dzahn: [C:03+1] gitlab-runner: upgrade default image to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1160119 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto) [16:14:31] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:45] (03CR) 10Dzahn: [C:03+1] gitlab-runner: upgrade default image to bookworm on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/1160120 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto) [16:17:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P78217 and previous config saved to /var/cache/conftool/dbconfig/20250617-161723-marostegui.json [16:18:28] FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:18:33] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:ulsfo and A:cp - 9.2.10 upgrade (T390912) [16:18:37] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [16:19:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P78218 and previous config saved to /var/cache/conftool/dbconfig/20250617-161904-ladsgroup.json [16:22:04] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1016.eqiad.wmnet with reason: host reimage [16:22:05] !log root@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-backup2001.codfw.wmnet: Renew puppet certificate - root@cumin1002 [16:22:39] (03CR) 10Dzahn: [C:03+2] "manually stopped rsyncd on both machines" [puppet] - 10https://gerrit.wikimedia.org/r/1159550 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [16:24:38] !log cdobbins@cumin2002:~$ sudo -i cookbook --dry-run sre.cdn.roll-upgrade-varnish --query 'P{cp30[66-81].esams.wmnet}' --reason 'Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0' --task-id T396581 [16:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:43] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [16:24:54] !log cdobbins@cumin2002:~$ sudo -i cookbook sre.cdn.roll-upgrade-varnish --query 'P{cp30[66-81].esams.wmnet}' --reason 'Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0' --task-id T396581 [16:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:06] !log cdobbins@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp30[66-81].esams.wmnet} and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [16:25:32] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1016.eqiad.wmnet with reason: host reimage [16:28:28] FIRING: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:32:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T396130)', diff saved to https://phabricator.wikimedia.org/P78220 and previous config saved to /var/cache/conftool/dbconfig/20250617-163231-marostegui.json [16:32:36] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [16:34:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T382778)', diff saved to https://phabricator.wikimedia.org/P78221 and previous config saved to /var/cache/conftool/dbconfig/20250617-163412-ladsgroup.json [16:34:17] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [16:34:28] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1214.eqiad.wmnet with reason: Maintenance [16:34:33] (03CR) 10Herron: "Oh interesting, thanks for the patch! As I understand the success ratio SLO counts non 2xx responses as errors, and the existing requests" [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [16:34:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T382778)', diff saved to https://phabricator.wikimedia.org/P78222 and previous config saved to /var/cache/conftool/dbconfig/20250617-163434-ladsgroup.json [16:35:12] (03PS2) 10Tchanders: Assign global IP viewer to stewards, to avoid log spam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159987 (https://phabricator.wikimedia.org/T397224) [16:35:32] (03CR) 10Tchanders: [C:04-2] "Needs community approval" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159987 (https://phabricator.wikimedia.org/T397224) (owner: 10Tchanders) [16:36:07] (03PS3) 10Tchanders: Revert "Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159580 (https://phabricator.wikimedia.org/T397224) [16:36:29] (03CR) 10Tchanders: [C:04-2] "Needs community approval" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159580 (https://phabricator.wikimedia.org/T397224) (owner: 10Tchanders) [16:36:55] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153#10924650 (10xcollazo) Now that I think more about this: I don't know where in puppet, but I am aware that we throttle any individual download to 3-6... [16:37:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T382778)', diff saved to https://phabricator.wikimedia.org/P78223 and previous config saved to /var/cache/conftool/dbconfig/20250617-163746-ladsgroup.json [16:38:32] (03CR) 10Herron: [C:03+1] "thanks! please feel free to ping me if you'd like a second set of eyes while rolling this out" [puppet] - 10https://gerrit.wikimedia.org/r/1155335 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [16:39:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:44:11] (03CR) 10Herron: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1160075 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [16:44:25] (03CR) 10Brennen Bearnes: [C:03+1] gitlab-runner: upgrade default image to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1160119 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto) [16:45:51] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1160071 (https://phabricator.wikimedia.org/T397153) (owner: 10Cathal Mooney) [16:47:22] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet [16:47:23] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10924709 (10herron) >>! In T397004#10920139, @AndyRussG_volunteer wrote: > - I created a new SSH key for `andrew.green@extern.wikimedia.de` and added it in Gerrit, as per [[... [16:47:48] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM wikikube-worker-exp2001.codfw.wmnet - jiji@cumin1002" [16:49:05] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10924723 (10herron) [16:50:43] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10924726 (10AndyRussG_volunteer) >>! In T397004#10924709, @herron wrote: >>>! In T397004#10920139, @AndyRussG_volunteer wrote: >> - I created a new SSH key for `andrew.green@... [16:50:52] jiji@cumin1002 makevm (PID 3563660) is awaiting input [16:51:31] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10924727 (10herron) Yes please 👍 [16:52:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P78224 and previous config saved to /var/cache/conftool/dbconfig/20250617-165253-ladsgroup.json [16:53:16] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10924743 (10AndyRussG_volunteer) >>! In T397004#10924727, @herron wrote: > Yes please 👍 ok! thanks! it is: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPk5bd6dpChEOm8RMFGriNY5vzI6E... [16:54:29] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [16:54:49] (03PS1) 10BCornwall: hiera: override interface names for lvs1016 [puppet] - 10https://gerrit.wikimedia.org/r/1160202 (https://phabricator.wikimedia.org/T387145) [16:55:52] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153#10924754 (10cmooney) >>! In T397153#10924650, @xcollazo wrote: > Perhaps this also includes rsync traffic? Yeah the throughput graphs include all... [16:56:19] (03CR) 10Ssingh: [C:03+1] hiera: override interface names for lvs1016 [puppet] - 10https://gerrit.wikimedia.org/r/1160202 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [16:56:40] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10924762 (10herron) Thanks! I've just emailed you as well for the out of band verification step [16:57:05] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:57:32] (03CR) 10BCornwall: [C:03+2] hiera: override interface names for lvs1016 [puppet] - 10https://gerrit.wikimedia.org/r/1160202 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [16:57:39] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 [16:57:47] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 [16:57:48] (03PS3) 10Mimurawil: Configure instrument for CheckUser - UserInfoCard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) [16:58:20] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:00:06] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1700) [17:00:12] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM wikikube-worker-exp2001.codfw.wmnet - jiji@cumin1002" [17:00:41] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [17:02:24] (03PS1) 10Dzahn: Revert "phabricator: comment out scap::target in migration class" [puppet] - 10https://gerrit.wikimedia.org/r/1160204 [17:02:26] (03PS1) 10Jsn.sherman: undeploy enwiki Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160206 (https://phabricator.wikimedia.org/T396250) [17:02:27] (03PS1) 10Jgreen: nsca_frack.cfg.erb break out trino hostgroup, add trino API check [puppet] - 10https://gerrit.wikimedia.org/r/1160205 (https://phabricator.wikimedia.org/T386259) [17:02:53] (03PS2) 10Dzahn: Revert "phabricator: comment out scap::target in migration class" [puppet] - 10https://gerrit.wikimedia.org/r/1160204 (https://phabricator.wikimedia.org/T377889) [17:03:10] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: cloudcephosd200[567] service implementation - https://phabricator.wikimedia.org/T397237 (10Andrew) 03NEW [17:03:46] brett@cumin2002 reimage (PID 1484312) is awaiting input [17:03:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160206 (https://phabricator.wikimedia.org/T396250) (owner: 10Jsn.sherman) [17:04:09] jiji@cumin1002 makevm (PID 3563660) is awaiting input [17:05:03] (03CR) 10Dzahn: [C:03+2] Revert "phabricator: comment out scap::target in migration class" [puppet] - 10https://gerrit.wikimedia.org/r/1160204 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [17:05:09] (03CR) 10CI reject: [V:04-1] Revert "phabricator: comment out scap::target in migration class" [puppet] - 10https://gerrit.wikimedia.org/r/1160204 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [17:05:25] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker-exp2001.codfw.wmnet with OS bookworm [17:05:32] 06SRE, 06Infrastructure-Foundations, 10vm-requests: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051#10924809 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host wikikube-worker-exp2001.codfw.wmnet with OS bookworm [17:05:38] (03CR) 10Scott French: "Pilot looks good. Moving ahead with this. Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156443 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [17:05:40] (03CR) 10Scott French: [C:03+2] shellbox-syntaxhighlight: migrate to bookworm-based httpd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156443 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [17:07:19] (03Merged) 10jenkins-bot: shellbox-syntaxhighlight: migrate to bookworm-based httpd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156443 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [17:08:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P78225 and previous config saved to /var/cache/conftool/dbconfig/20250617-170800-ladsgroup.json [17:08:16] (03PS3) 10Dzahn: Revert "phabricator: comment out scap::target in migration class" [puppet] - 10https://gerrit.wikimedia.org/r/1160204 (https://phabricator.wikimedia.org/T377889) [17:08:40] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:08:50] (03PS4) 10Dzahn: Revert "phabricator: comment out scap::target in migration class" [puppet] - 10https://gerrit.wikimedia.org/r/1160204 (https://phabricator.wikimedia.org/T377889) [17:09:18] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:10:10] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - brett@cumin2002" [17:10:11] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1016.eqiad.wmnet with OS bullseye [17:10:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10924834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1016.eqiad.wmnet with OS bullseye completed: - lvs1016 (**PASS**)... [17:12:14] (03CR) 10Andrew Bogott: [C:03+2] Add radosgw access for members of the new 'object_storage' role. [puppet] - 10https://gerrit.wikimedia.org/r/1155775 (https://phabricator.wikimedia.org/T396594) (owner: 10Andrew Bogott) [17:12:18] (03CR) 10Dzahn: [C:03+2] Revert "phabricator: comment out scap::target in migration class" [puppet] - 10https://gerrit.wikimedia.org/r/1160204 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [17:12:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10924863 (10BCornwall) [17:12:58] (03CR) 10JHathaway: "generally looks good, a couple of thoughts" [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [17:15:21] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:15:27] (03CR) 10Kosta Harlan: Configure instrument for CheckUser - UserInfoCard (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) (owner: 10Mimurawil) [17:17:07] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10924888 (10herron) [17:17:44] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10924891 (10herron) [17:18:17] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:18:49] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:18:59] (03PS1) 10C. Scott Ananian: stats: Add buckets based on wikitext size; fix increment bug [extensions/Linter] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1160210 (https://phabricator.wikimedia.org/T393400) [17:19:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/Linter] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1160210 (https://phabricator.wikimedia.org/T393400) (owner: 10C. Scott Ananian) [17:19:17] !log migrated shellbox-syntaxhighlight to bookworm-based httpd images - T378128 [17:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:23] T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128 [17:21:12] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:22:11] (03CR) 10Kosta Harlan: [C:03+1] Configure instrument for CheckUser - UserInfoCard (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) (owner: 10Mimurawil) [17:22:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10924906 (10BCornwall) [17:23:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T382778)', diff saved to https://phabricator.wikimedia.org/P78226 and previous config saved to /var/cache/conftool/dbconfig/20250617-172308-ladsgroup.json [17:23:13] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [17:23:16] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker-exp2001.codfw.wmnet with reason: host reimage [17:23:23] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1226.eqiad.wmnet with reason: Maintenance [17:23:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T382778)', diff saved to https://phabricator.wikimedia.org/P78227 and previous config saved to /var/cache/conftool/dbconfig/20250617-172330-ladsgroup.json [17:25:51] !log homer "cr*-eqiad*" commit "enable BGP on lvs1016" - T387145 [17:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:56] T387145: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145 [17:26:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T382778)', diff saved to https://phabricator.wikimedia.org/P78228 and previous config saved to /var/cache/conftool/dbconfig/20250617-172622-ladsgroup.json [17:26:41] (03PS1) 10Herron: admin: add andyrussg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1160211 (https://phabricator.wikimedia.org/T397004) [17:26:56] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker-exp2001.codfw.wmnet with reason: host reimage [17:27:48] !log brennen@deploy1003 Started deploy [phabricator/deployment@f8d7b38]: re-test deploy to phab1005 for T377889 [17:27:52] T377889: install a service on phab1005 - https://phabricator.wikimedia.org/T377889 [17:28:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10924947 (10BCornwall) [17:28:06] vriley@cumin1002 provision (PID 3829476) is awaiting input [17:28:11] !log brennen@deploy1003 Finished deploy [phabricator/deployment@f8d7b38]: re-test deploy to phab1005 for T377889 (duration: 00m 23s) [17:28:31] (03CR) 10Andrew Bogott: [C:03+2] Keystone: apply upstream patch allowing non-uuid project ids in trusts [puppet] - 10https://gerrit.wikimedia.org/r/1158849 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [17:28:54] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:31:17] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: cloudcephosd200[567] service implementation - https://phabricator.wikimedia.org/T397237#10924971 (10dcaro) p:05Triage→03High [17:31:40] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloudcephosd200[567] service implementation - https://phabricator.wikimedia.org/T397237#10924978 (10dcaro) [17:31:48] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloudcephosd200[567] service implementation - https://phabricator.wikimedia.org/T397237#10924980 (10dcaro) [17:32:20] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:34:16] (03PS4) 10Mimurawil: Configure instrument for CheckUser - UserInfoCard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) [17:34:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10925010 (10dcaro) p:05Triage→03High [17:34:43] (03CR) 10Mimurawil: Configure instrument for CheckUser - UserInfoCard (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) (owner: 10Mimurawil) [17:35:14] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10925011 (10KFrancis) Hi all, I'm confirming the NDA is complete. Thanks! [17:36:20] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:37:04] (03PS1) 10Herron: admin: add ldap_only entry for derhexer [puppet] - 10https://gerrit.wikimedia.org/r/1160216 (https://phabricator.wikimedia.org/T397099) [17:37:32] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1186.eqiad.wmnet with OS bullseye [17:37:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10925030 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS b... [17:38:31] !log stopping pybal on lvs1017 to move traffic over to lvs1020 - T387145 [17:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:36] T387145: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145 [17:39:23] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10925045 (10herron) >>! In T397099#10925025, @gerritbot wrote: > Change #1160216 had a related patch set uploaded (by Herron; author: Herron): > %%%[operations/puppet@productio... [17:39:27] (03CR) 10Herron: "to be reviewed and merged ahead of ldap groupadd to nda" [puppet] - 10https://gerrit.wikimedia.org/r/1160216 (https://phabricator.wikimedia.org/T397099) (owner: 10Herron) [17:40:42] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10925046 (10herron) [17:40:52] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [17:40:54] PROBLEM - pybal on lvs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:41:21] ^ brett is working on it, expected [17:41:30] ty! [17:41:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P78229 and previous config saved to /var/cache/conftool/dbconfig/20250617-174130-ladsgroup.json [17:41:41] (03PS1) 10Dzahn: phabricator::migration: add /etc/phabricator/script-vars for scap [puppet] - 10https://gerrit.wikimedia.org/r/1160217 (https://phabricator.wikimedia.org/T377889) [17:41:50] (03PS1) 10Bvibber: Enable JSON transforms for Chart+JsonConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160218 (https://phabricator.wikimedia.org/T388616) [17:41:52] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [17:42:03] brett: should we downtime it I guess? [17:42:08] yep, on it! [17:42:10] a bit split usually [17:42:13] but in this case it is going away [17:42:14] ok [17:42:45] i have a config backport, should i wait or ok to push it? [17:42:59] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on lvs1017.eqiad.wmnet with reason: T387145 [17:43:07] bvibber: no reason to wait I think, we have tested this transition in the past so should be ok™ :) [17:43:09] bvibber: You should be fine [17:43:10] (03PS2) 10Herron: admin: add andyrussg to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1160211 (https://phabricator.wikimedia.org/T397004) [17:43:13] cool thx :D [17:43:25] (03PS2) 10Dzahn: phabricator::migration: add /etc/phabricator/script-vars for scap [puppet] - 10https://gerrit.wikimedia.org/r/1160217 (https://phabricator.wikimedia.org/T377889) [17:43:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160218 (https://phabricator.wikimedia.org/T388616) (owner: 10Bvibber) [17:44:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10925056 (10BCornwall) [17:44:43] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker-exp2001.codfw.wmnet with OS bookworm [17:44:43] !log jiji@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host wikikube-worker-exp2001.codfw.wmnet [17:44:48] 06SRE, 06Infrastructure-Foundations, 10vm-requests: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051#10925058 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host wikikube-worker-exp2001.codfw.wmnet with OS bookworm completed: - wikikube-worker-e... [17:44:52] (03Merged) 10jenkins-bot: Enable JSON transforms for Chart+JsonConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160218 (https://phabricator.wikimedia.org/T388616) (owner: 10Bvibber) [17:45:18] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1160218|Enable JSON transforms for Chart+JsonConfig (T388616)]] [17:45:22] T388616: Expose Data: Lua filter interface to Charts via the .chart format setup - https://phabricator.wikimedia.org/T388616 [17:45:33] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, but note that the previous access had Kerberos access, so it's worth asking on task if it's also needed for the new position (" [puppet] - 10https://gerrit.wikimedia.org/r/1160211 (https://phabricator.wikimedia.org/T397004) (owner: 10Herron) [17:48:03] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1160218|Enable JSON transforms for Chart+JsonConfig (T388616)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:48:32] (03PS1) 10Btullis: Failover hive and presto to the standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/1160222 (https://phabricator.wikimedia.org/T394499) [17:48:43] (03CR) 10Herron: [C:03+2] "Thanks! Good idea, will mention krb on task" [puppet] - 10https://gerrit.wikimedia.org/r/1160211 (https://phabricator.wikimedia.org/T397004) (owner: 10Herron) [17:48:44] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10925084 (10Dzahn) Hi all, just wanted to say it would be uncommon to have a WMDE engineer that is in LDAP group `nda` but not also in LDAP group `wmde... [17:49:05] !log bvibber@deploy1003 bvibber: Continuing with sync [17:52:13] (03CR) 10Btullis: [C:03+2] Failover hive and presto to the standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/1160222 (https://phabricator.wikimedia.org/T394499) (owner: 10Btullis) [17:52:34] !log btullis@dns1004 START - running authdns-update [17:53:29] !log btullis@dns1004 END - running authdns-update [17:53:48] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Degraded RAID on analytics1073 - https://phabricator.wikimedia.org/T397231#10925113 (10BTullis) a:03BTullis [17:54:03] vriley@cumin1002 reimage (PID 3846950) is awaiting input [17:54:42] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10925116 (10AndyRussG_volunteer) >>! In T397004#10925084, @Dzahn wrote: > just wanted to say it would be uncommon to have a WMDE engineer that is in LDA... [17:55:39] (03PS1) 10AOkoth: site: decom doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/1160225 (https://phabricator.wikimedia.org/T392130) [17:56:05] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160218|Enable JSON transforms for Chart+JsonConfig (T388616)]] (duration: 10m 47s) [17:56:09] T388616: Expose Data: Lua filter interface to Charts via the .chart format setup - https://phabricator.wikimedia.org/T388616 [17:56:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P78230 and previous config saved to /var/cache/conftool/dbconfig/20250617-175637-ladsgroup.json [17:58:20] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10925135 (10herron) Hi @AndyRussG_volunteer you should have just received an email regarding kerberos, and I'll update the account data to reflect krb:... [17:58:25] !log brett@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs1017.eqiad.wmnet [17:58:49] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-mariadb100[1-2] - https://phabricator.wikimedia.org/T394498#10925148 (10BTullis) [17:59:18] (03PS1) 10Herron: admin: set krb present for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/1160226 (https://phabricator.wikimedia.org/T397004) [17:59:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10925156 (10BCornwall) [17:59:59] (03CR) 10Dwisehaupt: nsca_frack.cfg.erb break out trino hostgroup, add trino API check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1160205 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen) [18:00:06] hashar and brennen: Deploy window MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T1800) [18:00:22] (03CR) 10Herron: [C:03+2] admin: set krb present for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/1160226 (https://phabricator.wikimedia.org/T397004) (owner: 10Herron) [18:00:24] o/ [18:00:56] running train in a few minutes, post eating-a-sandwich operations. [18:01:26] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Degraded RAID on analytics1073 - https://phabricator.wikimedia.org/T397231#10925174 (10BTullis) p:05Triage→03Medium [18:01:39] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): SSD firmware update for an-mariadb100[1-2] - https://phabricator.wikimedia.org/T394498#10925175 (10BTullis) p:05Triage→03High [18:01:46] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1186.eqiad.wmnet with OS bullseye [18:01:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10925176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS bulls... [18:02:13] !log aokoth@cumin1002 START - Cookbook sre.hosts.decommission for hosts doc1003.eqiad.wmnet [18:02:34] (03PS2) 10Jgreen: nsca_frack.cfg.erb break out trino hostgroup, add trino API check [puppet] - 10https://gerrit.wikimedia.org/r/1160205 (https://phabricator.wikimedia.org/T386259) [18:02:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1017:enp94s0f0np0 (Equinix, 21989995) {#20220409}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-e1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [18:03:03] (03CR) 10Jgreen: nsca_frack.cfg.erb break out trino hostgroup, add trino API check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1160205 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen) [18:03:04] hmmmm [18:03:13] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10925179 (10herron) 05Open→03Resolved a:03herron The requested access has been merged and will be fully deployed within 30 minutes. I'll go a... [18:04:29] (03CR) 10Dwisehaupt: [C:03+2] "shipit." [puppet] - 10https://gerrit.wikimedia.org/r/1160205 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen) [18:06:40] ^ https://netbox.wikimedia.org/extras/changelog/229367/ [18:06:56] !log aokoth@cumin1002 START - Cookbook sre.dns.netbox [18:07:40] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160229 (https://phabricator.wikimedia.org/T392176) [18:07:42] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160229 (https://phabricator.wikimedia.org/T392176) (owner: 10TrainBranchBot) [18:08:21] arnoldokoth: if you seen any unexpected diff on the sre.dns.netbox cookbook that you were not expecting [18:08:29] please ping us (brett or myself) [18:08:39] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160229 (https://phabricator.wikimedia.org/T392176) (owner: 10TrainBranchBot) [18:09:06] (03CR) 10Dzahn: [C:03+2] phabricator::migration: add /etc/phabricator/script-vars for scap [puppet] - 10https://gerrit.wikimedia.org/r/1160217 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [18:09:12] netbox caches are syncing now for my decomm of lvs1017 [18:09:31] sukhe: Ack. [18:09:39] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10925199 (10AndyRussG_volunteer) Yayyyy works great! Thanks so much, @herron, @KFrancis, @WMDE-leszek, @Dzahn! :D [18:10:59] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1186.eqiad.wmnet with OS bullseye [18:11:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10925201 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS b... [18:11:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T382778)', diff saved to https://phabricator.wikimedia.org/P78231 and previous config saved to /var/cache/conftool/dbconfig/20250617-181144-ladsgroup.json [18:11:49] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [18:12:00] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [18:12:03] !log brett@cumin2002 START - Cookbook sre.dns.netbox [18:12:32] aokoth@cumin1002 decommission (PID 3874428) is awaiting input [18:12:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1017:enp94s0f0np0 (Equinix, 21989995) {#20220409}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-e1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [18:13:14] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2152.codfw.wmnet with reason: Maintenance [18:13:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T382778)', diff saved to https://phabricator.wikimedia.org/P78232 and previous config saved to /var/cache/conftool/dbconfig/20250617-181321-ladsgroup.json [18:15:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T382778)', diff saved to https://phabricator.wikimedia.org/P78233 and previous config saved to /var/cache/conftool/dbconfig/20250617-181552-ladsgroup.json [18:15:54] !log aokoth@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doc1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1002" [18:16:13] !log brett@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:16:14] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts lvs1017.eqiad.wmnet [18:16:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10925216 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brett@cumin2002 for hosts: `lvs1017.eqiad.wmnet` - lvs1017.eqiad.wmnet (**PASS**) - Downtimed hos... [18:16:53] !log aokoth@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doc1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1002" [18:16:53] !log aokoth@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:16:54] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doc1003.eqiad.wmnet [18:17:36] (03CR) 10AOkoth: "Decom cookbook is complete so I'll merge this." [puppet] - 10https://gerrit.wikimedia.org/r/1160225 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [18:18:02] !log brett@cumin2002 START - Cookbook sre.dns.netbox [18:18:10] (03CR) 10AOkoth: [C:03+2] site: decom doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/1160225 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [18:18:29] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.6 refs T392176 [18:18:34] T392176: 1.45.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T392176 [18:20:38] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:24:13] !log brennen@deploy1003 Started deploy [phabricator/deployment@f8d7b38]: re-test deploy to phab1005 for T377889 (once more, with feeling) [18:24:17] T377889: install a service on phab1005 - https://phabricator.wikimedia.org/T377889 [18:24:19] (03PS11) 10AOkoth: miscweb: add os-reports update mechanism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) [18:24:31] !log brennen@deploy1003 Finished deploy [phabricator/deployment@f8d7b38]: re-test deploy to phab1005 for T377889 (once more, with feeling) (duration: 00m 18s) [18:24:54] (03PS1) 10Cwhite: logstash: relocate ecs-default indexes to hdd nodes after 8 weeks [puppet] - 10https://gerrit.wikimedia.org/r/1160230 (https://phabricator.wikimedia.org/T390215) [18:26:33] (03PS3) 10BCornwall: Promote lvs1016 over lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/1154905 (https://phabricator.wikimedia.org/T387145) [18:26:48] (03CR) 10CI reject: [V:04-1] Promote lvs1016 over lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/1154905 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [18:27:31] vriley@cumin1002 reimage (PID 3882225) is awaiting input [18:28:00] (03CR) 10Btullis: [C:03+1] airflow-dev: increase the memory limits of the webserver in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160132 (https://phabricator.wikimedia.org/T394297) (owner: 10Brouberol) [18:29:29] (03PS4) 10BCornwall: Promote lvs1016 over lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/1154905 (https://phabricator.wikimedia.org/T387145) [18:30:20] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6003/co" [puppet] - 10https://gerrit.wikimedia.org/r/1154905 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [18:31:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P78234 and previous config saved to /var/cache/conftool/dbconfig/20250617-183059-ladsgroup.json [18:33:11] (03PS12) 10AOkoth: miscweb: add os-reports update mechanism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) [18:36:53] (03CR) 10AOkoth: miscweb: add os-reports update mechanism (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [18:37:00] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10925247 (10DerHexer) >>! In T397099#10925045, @herron wrote: >>>! In T397099#10925025, @gerritbot wrote: >> Change #1160216 had a related patch set uploaded (by Herron; author... [18:39:03] (03PS1) 10Dzahn: phabricator::migration: add missing sudo_defaults file for scap [puppet] - 10https://gerrit.wikimedia.org/r/1160231 (https://phabricator.wikimedia.org/T377889) [18:39:32] (03CR) 10CI reject: [V:04-1] phabricator::migration: add missing sudo_defaults file for scap [puppet] - 10https://gerrit.wikimedia.org/r/1160231 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [18:41:01] (03PS2) 10Dzahn: phabricator::migration: add missing sudo_defaults file for scap [puppet] - 10https://gerrit.wikimedia.org/r/1160231 (https://phabricator.wikimedia.org/T377889) [18:42:12] (03CR) 10Ssingh: [C:03+1] "Let's do it :)" [puppet] - 10https://gerrit.wikimedia.org/r/1154905 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [18:43:20] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10925293 (10Dzahn) >>! In T397099#10920594, @Astinson wrote: > DerHexer is a long-trusted Steward that wants access to some of the data that is available through Central Notice... [18:45:32] (03CR) 10Dzahn: [C:03+2] phabricator::migration: add missing sudo_defaults file for scap [puppet] - 10https://gerrit.wikimedia.org/r/1160231 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [18:45:43] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10925306 (10DerHexer) >>! In T397099#10925292, @Dzahn wrote: >>>! In T397099#10920594, @Astinson wrote: >> DerHexer is a long-trusted Steward that wants access to some of the d... [18:46:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P78235 and previous config saved to /var/cache/conftool/dbconfig/20250617-184606-ladsgroup.json [18:47:33] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10925312 (10Dzahn) @DerHexer Yea, I definitely believe you and I can also confirm of course you are a long-trusted volunteer and steward. It's just that WMF has not just one ty... [18:54:12] (03CR) 10BCornwall: [V:03+1 C:03+2] Promote lvs1016 over lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/1154905 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [18:54:34] (03CR) 10BCornwall: [V:03+2 C:03+2] "Suspected issues with pcc, we're going forward anyway" [puppet] - 10https://gerrit.wikimedia.org/r/1154905 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [18:58:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10925320 (10BCornwall) [18:59:17] !log Restarting pybal on lvs1016, setting it to primary - T387145 [18:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:22] T387145: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145 [19:01:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T382778)', diff saved to https://phabricator.wikimedia.org/P78236 and previous config saved to /var/cache/conftool/dbconfig/20250617-190113-ladsgroup.json [19:01:18] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [19:01:29] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2154.codfw.wmnet with reason: Maintenance [19:01:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T382778)', diff saved to https://phabricator.wikimedia.org/P78237 and previous config saved to /var/cache/conftool/dbconfig/20250617-190136-ladsgroup.json [19:01:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10925328 (10BCornwall) [19:04:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T382778)', diff saved to https://phabricator.wikimedia.org/P78238 and previous config saved to /var/cache/conftool/dbconfig/20250617-190453-ladsgroup.json [19:05:59] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160232 [19:06:25] (03CR) 10Btullis: [C:03+2] Allow blunderbuss to contact archiva [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159579 (https://phabricator.wikimedia.org/T392244) (owner: 10Btullis) [19:08:03] (03Merged) 10jenkins-bot: Allow blunderbuss to contact archiva [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159579 (https://phabricator.wikimedia.org/T392244) (owner: 10Btullis) [19:12:36] !log brennen@deploy1003 Started deploy [phabricator/deployment@f8d7b38]: re-test deploy to phab1005 for T377889 (once more, with feeling) [19:12:40] T377889: install a service on phab1005 - https://phabricator.wikimedia.org/T377889 [19:12:46] !log brennen@deploy1003 Finished deploy [phabricator/deployment@f8d7b38]: re-test deploy to phab1005 for T377889 (once more, with feeling) (duration: 00m 10s) [19:13:16] brennen: looks promising. yay [19:13:44] way closer [19:14:14] ok [19:15:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10925361 (10BCornwall) [19:15:18] https://phabricator.wikimedia.org/P78239 [19:15:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10925367 (10BCornwall) @VRiley-WMF Okay! We've reimaged lvs1016 as the new primary and have lvs1020 as secondary. lvs1017 has been decommissioned and is ready to be removed/ser... [19:17:39] (03CR) 10Cwhite: [C:03+2] logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956) (owner: 10Cwhite) [19:19:55] (03PS1) 10Dzahn: zuul::main: add rsyslog logging config snippet [puppet] - 10https://gerrit.wikimedia.org/r/1160234 (https://phabricator.wikimedia.org/T395938) [19:20:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P78240 and previous config saved to /var/cache/conftool/dbconfig/20250617-192001-ladsgroup.json [19:24:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10925414 (10BCornwall) [19:25:12] (03PS2) 10Kimberly Sarabia: Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159542 (https://phabricator.wikimedia.org/T393944) (owner: 10Bernard Wang) [19:26:26] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Adding wikikube-worker-exp2001 - jiji@cumin1002 - T397051" [19:26:31] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Adding wikikube-worker-exp2001 - jiji@cumin1002 - T397051" [19:26:34] T397051: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051 [19:31:41] (03CR) 10Elukey: "Thanks! IIUC they explicitly wanted two separate SLOs for this, I wasn't part of the conversation but they deeply care about ratio and the" [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [19:33:33] FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:35:02] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10925446 (10KFrancis) Hi all, there is an NDA on file, but it was for Wikimania a couple years ago. We'd need a new one for LDAP access. @DerHexer, please send me your mailin... [19:35:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P78241 and previous config saved to /var/cache/conftool/dbconfig/20250617-193508-ladsgroup.json [19:35:31] (03CR) 10Dzahn: [C:03+2] "https://wikitech.wikimedia.org/wiki/Rsyslog" [puppet] - 10https://gerrit.wikimedia.org/r/1160234 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:36:35] (03CR) 10LorenMora: [C:03+1] Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159542 (https://phabricator.wikimedia.org/T393944) (owner: 10Bernard Wang) [19:37:49] (03CR) 10Dzahn: [C:03+2] "Will most certainly need adjustment. Just adding it to get out of the way how to even configure it and what we are using." [puppet] - 10https://gerrit.wikimedia.org/r/1160234 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:39:28] (03PS3) 10Bernard Wang: Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159542 (https://phabricator.wikimedia.org/T395634) [19:39:46] (03PS4) 10Bernard Wang: Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159542 (https://phabricator.wikimedia.org/T393944) [19:39:48] (03PS1) 10Effie Mouzeli: mediawiki_experimental: $kubernetes_release dir fix [puppet] - 10https://gerrit.wikimedia.org/r/1160236 (https://phabricator.wikimedia.org/T396767) [19:42:00] (03CR) 10CI reject: [V:04-1] mediawiki_experimental: $kubernetes_release dir fix [puppet] - 10https://gerrit.wikimedia.org/r/1160236 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [19:44:26] (03CR) 10Herron: [C:03+1] "sure yes lets try it. what do you think of a name like success-ratio or 200-ratio to help clarify?" [puppet] - 10https://gerrit.wikimedia.org/r/1160180 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [19:48:41] (03PS2) 10Effie Mouzeli: mediawiki_experimental: $kubernetes_release dir fix [puppet] - 10https://gerrit.wikimedia.org/r/1160236 (https://phabricator.wikimedia.org/T396767) [19:48:56] (03PS3) 10Effie Mouzeli: mediawiki_experimental: $kubernetes_release dir fix [puppet] - 10https://gerrit.wikimedia.org/r/1160236 (https://phabricator.wikimedia.org/T396767) [19:50:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T382778)', diff saved to https://phabricator.wikimedia.org/P78242 and previous config saved to /var/cache/conftool/dbconfig/20250617-195017-ladsgroup.json [19:50:22] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2163.codfw.wmnet with reason: Maintenance [19:50:23] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [19:50:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T382778)', diff saved to https://phabricator.wikimedia.org/P78243 and previous config saved to /var/cache/conftool/dbconfig/20250617-195029-ladsgroup.json [19:51:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [19:53:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T382778)', diff saved to https://phabricator.wikimedia.org/P78244 and previous config saved to /var/cache/conftool/dbconfig/20250617-195347-ladsgroup.json [19:54:00] (03PS1) 10Effie Mouzeli: site.pp: make wikikube-worker-exp2001 a k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1160238 (https://phabricator.wikimedia.org/T276994) [19:54:05] (03CR) 10Kimberly Sarabia: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159542 (https://phabricator.wikimedia.org/T393944) (owner: 10Bernard Wang) [19:55:19] (03PS1) 10Cwhite: logstash: add tests to normalize_labels [puppet] - 10https://gerrit.wikimedia.org/r/1160239 (https://phabricator.wikimedia.org/T368956) [19:55:21] (03PS1) 10Cwhite: logstash: add tests to dot_expander [puppet] - 10https://gerrit.wikimedia.org/r/1160240 (https://phabricator.wikimedia.org/T368956) [19:57:31] (03CR) 10CI reject: [V:04-1] logstash: add tests to dot_expander [puppet] - 10https://gerrit.wikimedia.org/r/1160240 (https://phabricator.wikimedia.org/T368956) (owner: 10Cwhite) [19:57:34] (03PS15) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [19:57:43] (03CR) 10CI reject: [V:04-1] logstash: add tests to normalize_labels [puppet] - 10https://gerrit.wikimedia.org/r/1160239 (https://phabricator.wikimedia.org/T368956) (owner: 10Cwhite) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T2000). [20:00:05] ebernhardson, bwang, JSherman, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.02%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [20:00:14] here [20:00:18] \o [20:02:16] ebernhardson: it looks like you're first in line; are you self deploying? [20:02:24] i suppose i can, sec [20:03:21] (03PS2) 10Cwhite: logstash: add tests to normalize_labels [puppet] - 10https://gerrit.wikimedia.org/r/1160239 (https://phabricator.wikimedia.org/T368956) [20:03:22] (03PS2) 10Cwhite: logstash: add tests to dot_expander [puppet] - 10https://gerrit.wikimedia.org/r/1160240 (https://phabricator.wikimedia.org/T368956) [20:03:28] FIRING: SystemdUnitFailed: wmf_auto_restart_ipmiseld.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155738 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:04:34] (03CR) 10Cwhite: "I intend to begin testing this on beta-logs on 2025-06-23." [puppet] - 10https://gerrit.wikimedia.org/r/1154348 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:04:35] (03Merged) 10jenkins-bot: cirrussearch: return traffic to all DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155738 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:04:58] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1155738|cirrussearch: return traffic to all DCs (T388610)]] [20:05:03] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [20:05:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.02%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [20:06:11] o/ [20:06:29] hello [20:06:47] "syncing to baremetal masters" is where we put on AC/DC and headbang for a bit [20:07:14] haha [20:07:14] !log ebernhardson@deploy1003 bking, ebernhardson: Backport for [[gerrit:1155738|cirrussearch: return traffic to all DCs (T388610)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:36] i'm here for 1159542 (bwang is out) [20:08:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P78245 and previous config saved to /var/cache/conftool/dbconfig/20250617-200854-ladsgroup.json [20:09:19] !log ebernhardson@deploy1003 Sync cancelled. [20:09:32] mine not working as expected, will have to adjust something. On to the next [20:10:41] ebernhardson: I see you cancelled the sync; do you need to revert too? [20:10:58] JSherman: ahh, for some reason i was expecting this thing to do all the pieces :P sure one sec [20:11:03] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 655.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:12:02] (03PS1) 10Ebernhardson: Revert "cirrussearch: return traffic to all DCs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160247 [20:12:09] (03CR) 10Ebernhardson: [C:03+2] Revert "cirrussearch: return traffic to all DCs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160247 (owner: 10Ebernhardson) [20:13:00] (03Merged) 10jenkins-bot: Revert "cirrussearch: return traffic to all DCs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160247 (owner: 10Ebernhardson) [20:14:21] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1186.eqiad.wmnet with OS bullseye [20:14:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10925582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS bulls... [20:14:31] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:25] i haven't reverted before, just so I know the process: cancelling spiderpig does the revert on the actual servers, we just have to merge a revert in gerrit to make sure gerrit HEAD matches what's actually deployed? [20:17:17] cscott: hmm, thats a good question and i'm not completely sure :S In the old system i suppose you would revert and pull to the deployment host (i've done the same here) [20:17:36] but you wouldn't scap out the revert, since the initial deploy was never shipped [20:18:19] Cancelling scap doesn't revert anything for you. [20:18:32] it looks like the patch is still on the testservers as well? [20:18:50] Nod. If you want to undo, merge and deploy a revert commit. [20:19:01] ^ [20:19:14] good to know. [20:19:57] so it sounds like i should run the revert through scap deploy? doing now [20:20:00] spiderpig might want to put up a flashing warning to that effect, I was under the same impression as ebernhardson that it would clean up fully if i cancelled [20:20:33] cscott: We have some open tickets to improve things in this area. Thanks for the feedback! [20:20:37] Any opinions on the slave_sql_lag alert for db2141? Shall I depool as mentioned in the runbook? [20:20:57] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1160247|Revert "cirrussearch: return traffic to all DCs"]] [20:21:58] brett: I see https://sal.toolforge.org/log/qOxGfJcB8tZ8Ohr0-20q [20:22:08] where it was downtimed earlier [20:22:23] perhaps confirm that and just extend the downtime I guess? [20:23:20] !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:1160247|Revert "cirrussearch: return traffic to all DCs"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:24:03] !log ebernhardson@deploy1003 ebernhardson: Continuing with sync [20:24:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P78246 and previous config saved to /var/cache/conftool/dbconfig/20250617-202403-ladsgroup.json [20:30:56] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160247|Revert "cirrussearch: return traffic to all DCs"]] (duration: 09m 59s) [20:31:44] ok, now it's done and ready for the next [20:32:01] kimberly_sarabia: i think you're next [20:32:11] ok i'm here [20:32:15] marostegui: I'm going to re-downtime for you [20:32:52] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: marostegui maintenance [20:32:58] kimberly_sarabia: are you self deploying? [20:33:02] no [20:33:13] are there any deployers around? [20:33:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1186.eqiad.wmnet with OS bullseye [20:33:22] mmk, I'm happy to deploy for you. I also have a config change. Can I do them together? [20:33:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10925614 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1186.eqiad.wmnet with OS... [20:33:40] JSherman: yep [20:33:49] thanks so much [20:33:51] mmk, here goes! [20:34:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159542 (https://phabricator.wikimedia.org/T393944) (owner: 10Bernard Wang) [20:34:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160206 (https://phabricator.wikimedia.org/T396250) (owner: 10Jsn.sherman) [20:34:31] brett: thank you! [20:35:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.336s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:35:37] (03Merged) 10jenkins-bot: Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159542 (https://phabricator.wikimedia.org/T393944) (owner: 10Bernard Wang) [20:35:42] (03Merged) 10jenkins-bot: undeploy enwiki Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160206 (https://phabricator.wikimedia.org/T396250) (owner: 10Jsn.sherman) [20:35:47] ^Seems to be going down again [20:36:04] !log jsn@deploy1003 Started scap sync-world: Backport for [[gerrit:1159542|Enable new mobile search experience everywhere (not including empty search recommendations) (T393944)]], [[gerrit:1160206|undeploy enwiki Patroller Tools surveys (T396250)]] [20:36:10] T393944: Implement mobile updates to Codex TAHS - https://phabricator.wikimedia.org/T393944 [20:36:10] T396250: Deploy remaining Patroller Tools surveys - https://phabricator.wikimedia.org/T396250 [20:38:17] !log jsn@deploy1003 bwang, jsn: Backport for [[gerrit:1159542|Enable new mobile search experience everywhere (not including empty search recommendations) (T393944)]], [[gerrit:1160206|undeploy enwiki Patroller Tools surveys (T396250)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:38:59] kimberly_sarabia: please test [20:39:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T382778)', diff saved to https://phabricator.wikimedia.org/P78247 and previous config saved to /var/cache/conftool/dbconfig/20250617-203910-ladsgroup.json [20:39:12] JSherman: ok one moment [20:39:15] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [20:39:26] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2164.codfw.wmnet with reason: Maintenance [20:39:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T382778)', diff saved to https://phabricator.wikimedia.org/P78248 and previous config saved to /var/cache/conftool/dbconfig/20250617-203933-ladsgroup.json [20:39:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:39:56] verified that my enwiki undeploy worked as expected [20:40:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.336s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:40:33] JSherman: LGTM. Thank you [20:40:39] !log jsn@deploy1003 bwang, jsn: Continuing with sync [20:42:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T382778)', diff saved to https://phabricator.wikimedia.org/P78249 and previous config saved to /var/cache/conftool/dbconfig/20250617-204250-ladsgroup.json [20:47:29] !log jsn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159542|Enable new mobile search experience everywhere (not including empty search recommendations) (T393944)]], [[gerrit:1160206|undeploy enwiki Patroller Tools surveys (T396250)]] (duration: 11m 25s) [20:47:35] T393944: Implement mobile updates to Codex TAHS - https://phabricator.wikimedia.org/T393944 [20:47:35] T396250: Deploy remaining Patroller Tools surveys - https://phabricator.wikimedia.org/T396250 [20:47:54] kimberly_sarabia: all done [20:48:03] cscott: your turn! [20:48:04] JSherman: Thank you! [20:48:15] kimberly_sarabia: np! [20:48:42] cscott: are you good to self deploy, or would you like me to deploy for you? [20:50:26] i can self deploy, spiderpig is fun [20:50:31] :) [20:50:38] ^ 100%! :-) [20:51:06] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1160127|stats: Add buckets based on wikitext size; fix increment bug (T393400)]] [20:51:11] T393400: Create metric to measure Parsoid speed on small/medium/large pages - https://phabricator.wikimedia.org/T393400 [20:51:27] that merged super fast, was jenkins working ahead or something? [20:52:32] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Map dumps HTTPS traffic as low-priority for QoS - https://phabricator.wikimedia.org/T397153#10925689 (10xcollazo) >That said we you can see that many of the busiest times - as seen on the Grafana throughput graph - correlate with times when... [20:53:05] oh :( it picked the earlier backport to wmf.6, not the backport to wmf.6 I wanted to do. [20:53:16] should i cancel scap or let it run, it's effectively doing nothing I think [20:53:19] !log cscott@deploy1003 cscott: Backport for [[gerrit:1160127|stats: Add buckets based on wikitext size; fix increment bug (T393400)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:54:01] PROBLEM - Host ms-fe1016 is DOWN: PING CRITICAL - Packet loss = 100% [20:54:12] !log cscott@deploy1003 cscott: Continuing with sync [20:54:37] jhancock@cumin2002 reimage (PID 1685032) is awaiting input [20:55:14] i'm letting it do its no-op re-deploy of 1160127, then I'll start 1160210 which is what i mean to do. [20:56:04] I would think that would be fine since the commit is already in the git commit graph [20:56:35] RECOVERY - Host ms-fe1016 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [20:57:02] dancy: could probably give a more authoritative answer to that though [20:57:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P78250 and previous config saved to /var/cache/conftool/dbconfig/20250617-205758-ladsgroup.json [20:58:20] Reading... [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250617T2100) [21:01:10] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160127|stats: Add buckets based on wikitext size; fix increment bug (T393400)]] (duration: 10m 04s) [21:01:15] T393400: Create metric to measure Parsoid speed on small/medium/large pages - https://phabricator.wikimedia.org/T393400 [21:01:45] cscott: The no-op deployed is fine. It would have been okay to cancel too if you know that the prior deployment had run to completion. [21:02:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/Linter] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1160210 (https://phabricator.wikimedia.org/T393400) (owner: 10C. Scott Ananian) [21:02:31] dancy: yeah, i didn't want to take any chances. i've got enough t-shirts. :) [21:02:39] anyway, "real" backport started now, to wmf.5 [21:02:41] Nod. Perfectly reasonable [21:03:26] (03Merged) 10jenkins-bot: stats: Add buckets based on wikitext size; fix increment bug [extensions/Linter] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1160210 (https://phabricator.wikimedia.org/T393400) (owner: 10C. Scott Ananian) [21:03:52] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1160210|stats: Add buckets based on wikitext size; fix increment bug (T393400)]] [21:06:20] !log cscott@deploy1003 cscott: Backport for [[gerrit:1160210|stats: Add buckets based on wikitext size; fix increment bug (T393400)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:06:25] T393400: Create metric to measure Parsoid speed on small/medium/large pages - https://phabricator.wikimedia.org/T393400 [21:07:22] testing [21:08:57] looks good, continuing [21:09:02] !log cscott@deploy1003 cscott: Continuing with sync [21:11:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.516s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:13:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P78251 and previous config saved to /var/cache/conftool/dbconfig/20250617-211305-ladsgroup.json [21:16:02] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160210|stats: Add buckets based on wikitext size; fix increment bug (T393400)]] (duration: 12m 09s) [21:16:06] T393400: Create metric to measure Parsoid speed on small/medium/large pages - https://phabricator.wikimedia.org/T393400 [21:16:10] ok all done [21:16:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:17:03] dancy: thanks! [21:21:14] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:24:36] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for an-test-coord1002 - jclark@cumin1002" [21:24:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for an-test-coord1002 - jclark@cumin1002" [21:24:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:25:05] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:25:18] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:25:33] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:25:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10925787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1186.eqiad.wmnet with OS bul... [21:26:54] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:26:57] !log brennen@deploy1003 Started deploy [phabricator/deployment@f8d7b38]: re-test deploy to phab1005 for T377889 (once more, with feeling) [21:27:01] T377889: install a service on phab1005 - https://phabricator.wikimedia.org/T377889 [21:27:04] !log brennen@deploy1003 Finished deploy [phabricator/deployment@f8d7b38]: re-test deploy to phab1005 for T377889 (once more, with feeling) (duration: 00m 07s) [21:27:16] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:28:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T382778)', diff saved to https://phabricator.wikimedia.org/P78252 and previous config saved to /var/cache/conftool/dbconfig/20250617-212813-ladsgroup.json [21:28:18] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [21:28:29] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2165.codfw.wmnet with reason: Maintenance [21:28:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T382778)', diff saved to https://phabricator.wikimedia.org/P78253 and previous config saved to /var/cache/conftool/dbconfig/20250617-212835-ladsgroup.json [21:29:29] !log brennen@deploy1003 Started deploy [phabricator/deployment@f8d7b38]: re-test deploy to phab1005 for T377889 (once more, with feeling) [21:29:36] !log brennen@deploy1003 Finished deploy [phabricator/deployment@f8d7b38]: re-test deploy to phab1005 for T377889 (once more, with feeling) (duration: 00m 07s) [21:30:30] jclark@cumin1002 provision (PID 3914324) is awaiting input [21:31:04] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:31:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T382778)', diff saved to https://phabricator.wikimedia.org/P78254 and previous config saved to /var/cache/conftool/dbconfig/20250617-213153-ladsgroup.json [21:36:06] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:36:24] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:36:34] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:37:33] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:38:16] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:42:26] (03PS1) 10Dzahn: phabricator::migration: include phabricator::httpd, add /srv/phab dir [puppet] - 10https://gerrit.wikimedia.org/r/1160270 (https://phabricator.wikimedia.org/T377889) [21:42:41] (03CR) 10CI reject: [V:04-1] phabricator::migration: include phabricator::httpd, add /srv/phab dir [puppet] - 10https://gerrit.wikimedia.org/r/1160270 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [21:42:59] (03PS2) 10Dzahn: phabricator::migration: include phabricator::httpd, add /srv/phab dir [puppet] - 10https://gerrit.wikimedia.org/r/1160270 (https://phabricator.wikimedia.org/T377889) [21:44:22] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:44:50] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:45:19] (03CR) 10Dzahn: [C:03+2] phabricator::migration: include phabricator::httpd, add /srv/phab dir [puppet] - 10https://gerrit.wikimedia.org/r/1160270 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [21:46:02] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:46:24] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:47:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P78255 and previous config saved to /var/cache/conftool/dbconfig/20250617-214659-ladsgroup.json [21:47:01] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:47:24] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:48:33] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:49:15] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:52:05] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:55:58] (03PS1) 10Dzahn: phabricator::migration: install PHP [puppet] - 10https://gerrit.wikimedia.org/r/1160280 (https://phabricator.wikimedia.org/T377889) [21:56:27] (03CR) 10CI reject: [V:04-1] phabricator::migration: install PHP [puppet] - 10https://gerrit.wikimedia.org/r/1160280 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [21:58:38] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:59:28] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host an-test-master1004 [21:59:35] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:59:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-test-master1004 [22:00:32] (03PS2) 10Dzahn: phabricator::migration: install PHP [puppet] - 10https://gerrit.wikimedia.org/r/1160280 (https://phabricator.wikimedia.org/T377889) [22:01:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:02:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P78256 and previous config saved to /var/cache/conftool/dbconfig/20250617-220207-ladsgroup.json [22:02:51] (03PS3) 10Dzahn: phabricator::migration: install PHP [puppet] - 10https://gerrit.wikimedia.org/r/1160280 (https://phabricator.wikimedia.org/T377889) [22:05:02] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:05:39] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:06:11] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:06:15] (03PS4) 10Dzahn: phabricator::migration: install PHP [puppet] - 10https://gerrit.wikimedia.org/r/1160280 (https://phabricator.wikimedia.org/T377889) [22:09:05] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:12:23] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:12:25] (03CR) 10Dzahn: [C:03+2] phabricator::migration: install PHP [puppet] - 10https://gerrit.wikimedia.org/r/1160280 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [22:12:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:13:12] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:13:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:14:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:14:27] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:14:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:14:58] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:15:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:16:09] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:17:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T382778)', diff saved to https://phabricator.wikimedia.org/P78257 and previous config saved to /var/cache/conftool/dbconfig/20250617-221714-ladsgroup.json [22:17:18] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:17:19] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [22:17:30] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2166.codfw.wmnet with reason: Maintenance [22:17:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T382778)', diff saved to https://phabricator.wikimedia.org/P78258 and previous config saved to /var/cache/conftool/dbconfig/20250617-221737-ladsgroup.json [22:18:17] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:20:50] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:20:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T382778)', diff saved to https://phabricator.wikimedia.org/P78259 and previous config saved to /var/cache/conftool/dbconfig/20250617-222053-ladsgroup.json [22:21:03] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:21:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:21:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:22:49] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:23:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-master1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:23:37] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:24:08] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:25:36] !log brennen@deploy1003 Started deploy [phabricator/deployment@f8d7b38]: re-test deploy to phab1005 for T377889 (once more, with feeling) [22:25:41] T377889: install a service on phab1005 - https://phabricator.wikimedia.org/T377889 [22:25:43] !log brennen@deploy1003 Finished deploy [phabricator/deployment@f8d7b38]: re-test deploy to phab1005 for T377889 (once more, with feeling) (duration: 00m 07s) [22:26:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10925977 (10Jhancock.wm) okay awesome. that's helpful. thank you! [22:26:26] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10925979 (10Jhancock.wm) 05Open→03Resolved [22:26:59] !log brennen@deploy1003 Started deploy [phabricator/deployment@f8d7b38]: re-test deploy to phab1005 for T377889 (once more, with feeling) [22:27:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154139 (owner: 10Krinkle) [22:27:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:27:56] (03PS6) 10Krinkle: multiversion: Re-use prod for beta setSiteInfoForWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154140 (https://phabricator.wikimedia.org/T289318) [22:28:17] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:28:19] (03Merged) 10jenkins-bot: multiversion: Remove unused newFromDBName() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154139 (owner: 10Krinkle) [22:28:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-master1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:28:45] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1154139|multiversion: Remove unused newFromDBName()]] [22:30:26] (03PS1) 10Scott French: shellbox: migrate to bookworm-based httpd image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160223 (https://phabricator.wikimedia.org/T378128) [22:30:59] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1154139|multiversion: Remove unused newFromDBName()]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:31:16] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:31:40] (03PS1) 10Andrew Bogott: cinder: add service_user section to config [puppet] - 10https://gerrit.wikimedia.org/r/1160298 (https://phabricator.wikimedia.org/T396739) [22:32:00] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160298 (https://phabricator.wikimedia.org/T396739) (owner: 10Andrew Bogott) [22:32:42] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-master1003.eqiad.wmnet with OS bullseye [22:32:43] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-master1004.eqiad.wmnet with OS bullseye [22:32:48] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030#10925992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-test-master1003.eqiad.wmnet with OS bullseye [22:32:50] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030#10925993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-test-master1004.eqiad.wmnet with OS bullseye [22:33:19] !log krinkle@deploy1003 krinkle: Continuing with sync [22:34:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:34:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudcephosd2005 to codfw - jhancock@cumin2002" [22:34:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudcephosd2005 to codfw - jhancock@cumin2002" [22:34:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:34:57] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2005 [22:35:05] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:35:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2005 [22:35:10] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2006 [22:35:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2006 [22:35:22] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2007 [22:35:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2007 [22:35:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:36:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P78260 and previous config saved to /var/cache/conftool/dbconfig/20250617-223601-ladsgroup.json [22:36:05] (03CR) 10Andrew Bogott: [C:03+2] cinder: add service_user section to config [puppet] - 10https://gerrit.wikimedia.org/r/1160298 (https://phabricator.wikimedia.org/T396739) (owner: 10Andrew Bogott) [22:36:12] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-drmrs and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [22:36:17] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [22:36:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:37:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:37:29] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bullseye [22:37:39] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T393029#10925999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-test-coord1002.eqiad.wmnet with OS bullseye [22:37:57] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T393029#10926000 (10Jclark-ctr) a:03Jclark-ctr [22:38:51] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030#10926002 (10Jclark-ctr) [22:40:23] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1154139|multiversion: Remove unused newFromDBName()]] (duration: 11m 37s) [22:43:27] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1004.eqiad.wmnet with reason: host reimage [22:43:56] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1003.eqiad.wmnet with reason: host reimage [22:46:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:47:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1004.eqiad.wmnet with reason: host reimage [22:48:28] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-coord1002.eqiad.wmnet with reason: host reimage [22:51:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1003.eqiad.wmnet with reason: host reimage [22:51:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P78261 and previous config saved to /var/cache/conftool/dbconfig/20250617-225108-ladsgroup.json [22:51:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:52:17] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:53:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:53:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1002.eqiad.wmnet with reason: host reimage [22:56:16] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:57:48] 10ops-eqiad, 06DC-Ops: Power Supply - PS Redundancy - issue on ms-fe1016:9290 - https://phabricator.wikimedia.org/T397261 (10phaultfinder) 03NEW [23:01:53] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:03:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:03:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-master1004.eqiad.wmnet with OS bullseye [23:03:46] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030#10926076 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-test-master1004.eqiad.wmnet with OS bullseye completed: - a... [23:04:15] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030#10926077 (10Jclark-ctr) [23:05:16] (03PS1) 10Dzahn: phabricator::migration: ensure /srv/phab is the correct symlink [puppet] - 10https://gerrit.wikimedia.org/r/1160310 (https://phabricator.wikimedia.org/T377889) [23:05:37] (03CR) 10CI reject: [V:04-1] phabricator::migration: ensure /srv/phab is the correct symlink [puppet] - 10https://gerrit.wikimedia.org/r/1160310 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [23:05:39] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:05:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:05:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-master1003.eqiad.wmnet with OS bullseye [23:06:01] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030#10926087 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-test-master1003.eqiad.wmnet with OS bullseye completed: - a... [23:06:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T382778)', diff saved to https://phabricator.wikimedia.org/P78262 and previous config saved to /var/cache/conftool/dbconfig/20250617-230616-ladsgroup.json [23:06:20] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [23:06:22] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030#10926088 (10Jclark-ctr) [23:06:25] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030#10926091 (10Jclark-ctr) 05Open→03Resolved [23:06:32] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2167.codfw.wmnet with reason: Maintenance [23:06:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T382778)', diff saved to https://phabricator.wikimedia.org/P78263 and previous config saved to /var/cache/conftool/dbconfig/20250617-230639-ladsgroup.json [23:07:30] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:07:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:07:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-coord1002.eqiad.wmnet with OS bullseye [23:07:57] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T393029#10926095 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-test-coord1002.eqiad.wmnet with OS bullseye completed: - an-tes... [23:08:19] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T393029#10926096 (10Jclark-ctr) [23:08:50] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T393029#10926097 (10Jclark-ctr) 05Open→03Resolved [23:09:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T382778)', diff saved to https://phabricator.wikimedia.org/P78264 and previous config saved to /var/cache/conftool/dbconfig/20250617-230959-ladsgroup.json [23:10:24] (03PS2) 10Dzahn: phabricator::migration: ensure /srv/phab is the correct symlink [puppet] - 10https://gerrit.wikimedia.org/r/1160310 (https://phabricator.wikimedia.org/T377889) [23:16:53] (03PS1) 10Krinkle: multiversion: Remove routing for former `deploymentwiki` in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160318 (https://phabricator.wikimedia.org/T198673) [23:23:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160318 (https://phabricator.wikimedia.org/T198673) (owner: 10Krinkle) [23:23:33] (03PS7) 10Krinkle: multiversion: Re-use prod for beta setSiteInfoForWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154140 (https://phabricator.wikimedia.org/T289318) [23:24:05] (03Merged) 10jenkins-bot: multiversion: Remove routing for former `deploymentwiki` in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160318 (https://phabricator.wikimedia.org/T198673) (owner: 10Krinkle) [23:24:27] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1160318|multiversion: Remove routing for former `deploymentwiki` in Beta (T198673 T289318)]] [23:24:33] T198673: Remove deployment.wikimedia.beta.wmflabs.org wiki (deploymentwiki) - https://phabricator.wikimedia.org/T198673 [23:24:33] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [23:25:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P78265 and previous config saved to /var/cache/conftool/dbconfig/20250617-232506-ladsgroup.json [23:26:42] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1160318|multiversion: Remove routing for former `deploymentwiki` in Beta (T198673 T289318)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:30:03] !log brennen@deploy1003 Finished deploy [phabricator/deployment@f8d7b38]: re-test deploy to phab1005 for T377889 (once more, with feeling) (duration: 63m 03s) [23:30:08] T377889: install a service on phab1005 - https://phabricator.wikimedia.org/T377889 [23:31:31] !log krinkle@deploy1003 krinkle: Continuing with sync [23:31:39] 06SRE, 06collaboration-services, 10observability, 13Patch-For-Review: create a new place for prometheus/alertmanager checks not tied to physical machines - https://phabricator.wikimedia.org/T397264#10926155 (10Dzahn) [23:31:57] (03PS8) 10Krinkle: multiversion: Re-use prod for beta setSiteInfoForWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154140 (https://phabricator.wikimedia.org/T289318) [23:38:27] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160318|multiversion: Remove routing for former `deploymentwiki` in Beta (T198673 T289318)]] (duration: 14m 00s) [23:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1160325 [23:38:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1160325 (owner: 10TrainBranchBot) [23:38:32] T198673: Remove deployment.wikimedia.beta.wmflabs.org wiki (deploymentwiki) - https://phabricator.wikimedia.org/T198673 [23:38:33] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [23:38:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:40:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P78266 and previous config saved to /var/cache/conftool/dbconfig/20250617-234013-ladsgroup.json [23:43:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:49:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1160325 (owner: 10TrainBranchBot) [23:55:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T382778)', diff saved to https://phabricator.wikimedia.org/P78267 and previous config saved to /var/cache/conftool/dbconfig/20250617-235521-ladsgroup.json [23:55:26] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [23:55:36] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2181.codfw.wmnet with reason: Maintenance [23:55:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T382778)', diff saved to https://phabricator.wikimedia.org/P78268 and previous config saved to /var/cache/conftool/dbconfig/20250617-235543-ladsgroup.json [23:59:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T382778)', diff saved to https://phabricator.wikimedia.org/P78269 and previous config saved to /var/cache/conftool/dbconfig/20250617-235900-ladsgroup.json