[00:08:25] FIRING: [9x] SystemdUnitFailed: apt-daily.service on lvs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:42:38] 10Acme-chief, 06Traffic, 06SRE, 13Patch-For-Review: Allow specifying a custom period of time before deploying a newly issued certificate - https://phabricator.wikimedia.org/T213737#11906132 (10Krinkle) [04:08:40] FIRING: [9x] SystemdUnitFailed: apt-daily.service on lvs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:08:10] FYI, I'm rebooting acmechief2002 for T422596 [06:08:11] T422596: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596 [06:12:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [06:27:28] RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [06:43:25] FIRING: [10x] SystemdUnitFailed: apt-daily-upgrade.service on lvs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:25] FIRING: [11x] SystemdUnitFailed: apt-daily-upgrade.service on lvs2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:09:59] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11906733 (10cmooney) >>! In T424683#11885878, @ayounsi wrote: > Nice! > > We can also filter out the `.16386`, `.16384`, `.1... [11:14:30] 06Traffic, 10bot-traffic-requests, 07SEO: Bing can't search images from Commons, is Wikimedia denying their requests? - https://phabricator.wikimedia.org/T425850#11907073 (10AlexisJazz) >>! In T425850#11905943, @Aklapper wrote: > Have you tried to contact Bing why Bing behaves this way? I can imagine Wikime... [12:41:35] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907442 (10VRiley-WMF) Hey @ssingh Is it okay to make this change today? [12:42:14] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907451 (10VRiley-WMF) Also, I do apologize, I was planning on doing this today [12:55:52] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia SR-Linux: BFD broken with default homer configuration - https://phabricator.wikimedia.org/T425813#11907507 (10cmooney) 05Open→03Resolved Patch merged and config pushed to all Nokia devices now. [13:16:48] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907565 (10ssingh) >>! In T421421#11907442, @VRiley-WMF wrote: > Hey @ssingh Is it okay to make this change today? Yes, please, the host is not in service so you can start whenever... [13:39:17] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907708 (10VRiley-WMF) 05Open→03In progress [13:39:56] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11907712 (10Fabfur) [13:58:07] 06Traffic, 10Liberica, 10Prod-Kubernetes, 06Data-Platform-SRE (2026-04-24 - 2026-05-15), 07Kubernetes: Migrate DSE k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420437#11907806 (10BTullis) 05Open→03Resolved I believe that this is now all complete. [13:58:59] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907811 (10VRiley-WMF) [14:13:55] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907908 (10VRiley-WMF) [14:15:46] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11907926 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host lvs1017.eqiad.wmnet with OS bookworm [14:32:45] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908107 (10ssingh) ` Record: 410 Date/Time: 05/10/2026 04:22:34 Source: system Severity: Critical Description: A critical diagnostic event occurred in the memory device at B2. Conta... [14:54:35] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908254 (10Jhancock.wm) i pulled a replacement DIMM and a ssd from our offlined hosts. @ssingh safe to power down the host? [14:55:33] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908259 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8830f0f1-94da-40cc-8ac8-4aef8e53c8f4) set by sukhe@cumin1003 for 1:00:00 on 1 host(s) and their services with reason: DI... [14:55:58] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908262 (10ssingh) >>! In T425890#11908254, @Jhancock.wm wrote: > i pulled a replacement DIMM and a ssd from our offlined hosts. > @ssingh safe to power down the host? @Jhancock.wm: Yes, please... [14:57:21] o/ inspired by LE outage last week, we in wmcs were talking about getting spare google trust certs for toolforge / wmcloud.org just in case. we have separate acme-chief instances for those, and aiui will need to get google cloud credentials for them somehow? any other advice for that? [14:58:25] taavi: as you probably may already know, it's completely free to use but you do need a GCP account [14:58:44] for at least the prod infra, this is under the GCP account of Learnign and Development [14:58:53] sukhe: that much I knew, the rest I don't (which is why I'm here :P) [14:59:03] you can ask Belinda to reach out to us (or talk to me about the process on getting access) [14:59:47] essentially all that is required is a valid credit card, and that currently is under the L&D account [15:00:19] is it just the initial setup that needs access to the google console, or would someone from us need access continuously? [15:06:11] 06Traffic, 07OKR-Work, 10Test Kitchen (Experiment Platform Sprint 23): Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570#11908324 (10KReid-WMF) 05Open→03Resolved [15:12:44] taavi: just one-time access to generate the acme account (which then goes in profile::acme_chief::accounts and the private key goes on puppetserver) and a valid credit card on file for the duration of the process (which will be taken care of L&D) [15:13:50] s/taken care of/taken care of by [15:17:34] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11908393 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host lvs1017.eqiad.wmnet with OS bookworm [15:24:01] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908449 (10Jhancock.wm) @ssingh replaced both. not seeing any errors in the idrac logs at this moment. You should be good to rebuild it. [15:32:13] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908546 (10ssingh) @Jhancock.wm: Thanks for the quick turnaround! Host is back and serving traffic, will keep a close watch for a bit before resolving this. [15:33:36] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: Degraded RAID on lvs2012 - https://phabricator.wikimedia.org/T425890#11908553 (10Jhancock.wm) [15:49:58] 06Traffic, 06MediaWiki-Platform-Team (Radar): Error 429 for search queries and images in older browsers - https://phabricator.wikimedia.org/T425763#11908671 (10JTweed-WMF) [16:23:22] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11908935 (10RobH) Update from email: * finally got an answer back after escalating both on the ticket, via our dell sg team, and via the accounts payable folks @ dell sg who want to be paid for the m... [16:34:00] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909017 (10VRiley-WMF) @ssingh I have almost gotten it all the way through. However, it doesn't seem to take the reimage. It's seemingly is getting stuck at the raid. I tried to log... [16:34:22] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909018 (10VRiley-WMF) 05In progress→03Open [16:41:21] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909067 (10ssingh) >>! In T421421#11909017, @VRiley-WMF wrote: > @ssingh I have almost gotten it all the way through. However, it doesn't seem to take the reimage. It's seemingly is... [16:44:31] 10netops, 06Infrastructure-Foundations: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611#11909075 (10cmooney) >>! In T424611#11886556, @ayounsi wrote: > I suggest `core1` instead of `corebgp` but that lgtm! Yep that works :) > For v4 I'd have thought a /31 for a vlan used only between... [17:15:33] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye [17:40:10] 10netops, 06Infrastructure-Foundations: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611#11909420 (10cmooney) So anyway, for now I'd propose we add the following vlans for this: ` 341 core1-bw27-esams 342 core1-by27-esams 441 core1-22-ulsfo 442 core1-23-ulsfo 541 core1-603-eqsin 5... [17:52:23] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909506 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors: - lvs1017 (**FAIL*... [17:53:19] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909511 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye [18:25:57] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909629 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors: - lvs1017 (**FAIL*... [18:26:43] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909630 (10ssingh) @VRiley-WMF: We may need to check this host; I can't seem to get it to come back up after a reboot (checked twice). Is there something else missing here? Perhaps... [18:31:26] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11909651 (10cmooney) >>! In T424683#11906733, @cmooney wrote: >>>! In T424683#11885878, @ayounsi wrote: >> Nice! >> >> We ca... [18:36:41] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909673 (10VRiley-WMF) checking, stand by [18:44:13] 06Traffic: images are not loading for some users (on the us west coast?) - https://phabricator.wikimedia.org/T425670#11909722 (10ssingh) I may be wrong but this was due to a temporary issue we had with upload.wikimedia.org in ulsfo, which matches the time of this report, and also matches the traffic from New Zea... [18:44:44] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909724 (10VRiley-WMF) Yes, it's getting stuck at the same spot I was getting stuck at. It looks like it's looking for a specific RAID. [18:51:17] 06Traffic, 06MediaWiki-Platform-Team (Radar): Error 429 for search queries and images in older browsers - https://phabricator.wikimedia.org/T425763#11909773 (10ssingh) Hi @BrokenImages1234, thanks for your report. The error report you indicated, `76af2b0`, does indeed point to the issue on why you are seeing 4... [18:56:33] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye [19:04:51] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909801 (10Jclark-ctr) @ssingh you are booting with UEFI? the YAML file need to be updated for lvs1017 -partman/standard-efi.cfg -partman/raid1-2dev-efi.cfg [19:07:20] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909810 (10ssingh) >>! In T421421#11909801, @Jclark-ctr wrote: > @ssingh you are booting with UEFI? > > the YAML file need to be updated for lvs1017 > > -partman/standard-efi.cfg... [19:12:00] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909839 (10ssingh) ` Forced UEFI HTTP Boot for next reboot Resetting chassis power status for lvs1017 to ForceRestart Host rebooted via Redfish ` [19:44:45] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11909962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors: - lvs1017 (**FAIL*...