[00:23:55] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:27:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:34:14] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:34:14] FIRING: [4x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:38:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1249076 [00:38:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1249076 (owner: 10TrainBranchBot) [00:49:14] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:51:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1249076 (owner: 10TrainBranchBot) [01:08:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1249077 [01:08:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1249077 (owner: 10TrainBranchBot) [01:27:57] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1249077 (owner: 10TrainBranchBot) [01:54:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:00:41] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:08:40] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 07m 58s) [02:08:55] FIRING: [6x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:14] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:33:55] FIRING: [6x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:45:55] FIRING: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:37:25] FIRING: [8x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [04:23:09] 06SRE, 06Traffic: Anycast ns[01].wikimedia.org for IPv4 - https://phabricator.wikimedia.org/T366193#11686108 (10cmooney) @ssingh in terms of the IPv6 anycast plans what is the current situation? I notice some patches like [[ https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1238015 | this one ]] have... [04:24:14] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:34:14] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:49:14] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:46:49] (03PS1) 10Kevin Bazira: ml-services: add embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249081 (https://phabricator.wikimedia.org/T418976) [05:48:38] (03PS2) 10Kevin Bazira: ml-services: add embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249081 (https://phabricator.wikimedia.org/T418976) [05:54:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:56:52] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11686138 (10VRiley-WMF) a:03VRiley-WMF [06:14:14] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:28:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:30:14] (03CR) 10Muehlenhoff: [C:03+2] Add a pbuilder hook to build against the ICU72 backport [puppet] - 10https://gerrit.wikimedia.org/r/1248797 (https://phabricator.wikimedia.org/T419058) (owner: 10Muehlenhoff) [06:34:14] FIRING: [4x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:36:34] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11686146 (10VRiley-WMF) This server doesn't support anymore. However, we have plenty of 4TB hard drives available to swap out. @BTullis is there an acceptable time to replace this hard drive? Let us kno... [06:43:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:50:41] FIRING: [2x] SystemdUnitFailed: bitu-permission-request.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:53:13] (03PS1) 10Ayounsi: codfw routed ganeti, unset neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/1249083 [06:55:11] (03PS2) 10Ayounsi: codfw routed ganeti, unset neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/1249083 [06:56:29] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1249083 (owner: 10Ayounsi) [06:58:14] (03CR) 10Ayounsi: [V:03+1] "PCC lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1249083 (owner: 10Ayounsi) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260308T0800) [07:00:05] Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T0700). [07:00:05] Msz2001: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:29] o/ [07:00:37] Starting to deploy [07:00:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248806 (https://phabricator.wikimedia.org/T419111) (owner: 10Mszwarc) [07:00:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248821 (https://phabricator.wikimedia.org/T417880) (owner: 10Mszwarc) [07:01:38] (03Merged) 10jenkins-bot: Set $wgOATH2FARequiredGroupRemovalPages for interface-admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248821 (https://phabricator.wikimedia.org/T417880) (owner: 10Mszwarc) [07:01:59] (03Merged) 10jenkins-bot: Add a script to send mandatory 2FA Echo notification [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248806 (https://phabricator.wikimedia.org/T419111) (owner: 10Mszwarc) [07:02:23] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1248806|Add a script to send mandatory 2FA Echo notification (T419111)]], [[gerrit:1248821|Set $wgOATH2FARequiredGroupRemovalPages for interface-admins (T417880)]] [07:02:29] T419111: Send Echo notification to 2FA-less users who are required to have 2FA - https://phabricator.wikimedia.org/T419111 [07:02:29] T417880: Set OATH2FARequiredGroupRemovalPages value for Wikimedia cluster - https://phabricator.wikimedia.org/T417880 [07:22:30] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1248806|Add a script to send mandatory 2FA Echo notification (T419111)]], [[gerrit:1248821|Set $wgOATH2FARequiredGroupRemovalPages for interface-admins (T417880)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:22:36] T419111: Send Echo notification to 2FA-less users who are required to have 2FA - https://phabricator.wikimedia.org/T419111 [07:22:36] T417880: Set OATH2FARequiredGroupRemovalPages value for Wikimedia cluster - https://phabricator.wikimedia.org/T417880 [07:23:03] !log mszwarc@deploy2002 mszwarc: Continuing with sync [07:24:47] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1249083 (owner: 10Ayounsi) [07:28:07] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1248592 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [07:28:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:28:44] (03CR) 10Muehlenhoff: [C:03+2] netbox: Add ulsfo02 [puppet] - 10https://gerrit.wikimedia.org/r/1248528 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [07:30:11] (03CR) 10Ayounsi: [V:03+1 C:03+2] ulsfo route ganeti use core routers interface IPs [puppet] - 10https://gerrit.wikimedia.org/r/1248592 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [07:30:48] (03CR) 10Ayounsi: [V:03+1 C:03+2] codfw routed ganeti, unset neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/1249083 (owner: 10Ayounsi) [07:32:19] (03CR) 10Ozge: [C:03+2] ml-services: add embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249081 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [07:34:02] (03Merged) 10jenkins-bot: ml-services: add embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249081 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [07:34:17] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind CDN - https://phabricator.wikimedia.org/T286066#11686188 (10ABran-WMF) [07:37:04] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248806|Add a script to send mandatory 2FA Echo notification (T419111)]], [[gerrit:1248821|Set $wgOATH2FARequiredGroupRemovalPages for interface-admins (T417880)]] (duration: 34m 41s) [07:37:09] T419111: Send Echo notification to 2FA-less users who are required to have 2FA - https://phabricator.wikimedia.org/T419111 [07:37:10] T417880: Set OATH2FARequiredGroupRemovalPages value for Wikimedia cluster - https://phabricator.wikimedia.org/T417880 [07:37:10] FIRING: [8x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [07:39:45] RECOVERY - Host rpki2003 is UP: PING OK - Packet loss = 0%, RTA = 30.82 ms [07:39:45] RECOVERY - Host netflow2004 is UP: PING OK - Packet loss = 0%, RTA = 32.41 ms [07:41:26] FIRING: RoutinatorValidROAs: Important drop of valid Routinator ROAs on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#Valid_ROAs_decreasing - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorValidROAs [07:42:10] FIRING: [9x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [07:43:55] RESOLVED: [4x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:45:41] FIRING: [2x] SystemdUnitFailed: bitu-permission-request.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:46:22] FIRING: GnmiTargetDown: cr2-eqdfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [07:51:16] (03PS1) 10Ayounsi: ulsfo routed ganeti fix v6 typo in neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/1249092 [07:56:25] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1249092 (owner: 10Ayounsi) [07:56:26] RESOLVED: RoutinatorValidROAs: Important drop of valid Routinator ROAs on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#Valid_ROAs_decreasing - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorValidROAs [07:57:52] (03CR) 10Ryan Kemper: "Only one blocking change (and I might be misunderstanding)" [puppet] - 10https://gerrit.wikimedia.org/r/1248760 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey) [07:59:25] (03PS6) 10Arnaudb: gerrit: remove mod_qos [puppet] - 10https://gerrit.wikimedia.org/r/1248843 (https://phabricator.wikimedia.org/T417615) [08:01:08] (03CR) 10Arnaudb: "I've added a new PS, with 2 small fixes for mtail: https://w.wiki/JFzq" [puppet] - 10https://gerrit.wikimedia.org/r/1248843 (https://phabricator.wikimedia.org/T417615) (owner: 10Arnaudb) [08:01:14] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248843 (https://phabricator.wikimedia.org/T417615) (owner: 10Arnaudb) [08:07:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4006.ulsfo.wmnet to cluster ulsfo and group 1 [08:07:57] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4006.ulsfo.wmnet to cluster ulsfo and group 1 [08:09:44] (03PS1) 10Aqu: dse-k8s-services Airflow: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249094 (https://phabricator.wikimedia.org/T415874) [08:16:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4006.ulsfo.wmnet to cluster ulsfo02 and group 1 [08:16:12] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4006.ulsfo.wmnet to cluster ulsfo02 and group 1 [08:19:43] (03CR) 10Brouberol: [C:03+2] "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249094 (https://phabricator.wikimedia.org/T415874) (owner: 10Aqu) [08:21:06] (03Abandoned) 10Ryan Kemper: pyrra: fix wdqs availability SLO config [puppet] - 10https://gerrit.wikimedia.org/r/1235891 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [08:21:09] (03Abandoned) 10Ryan Kemper: pyrra: absent old per-dc wdqs availability configs [puppet] - 10https://gerrit.wikimedia.org/r/1235892 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [08:21:11] (03Abandoned) 10Ryan Kemper: pyrra: remove previously absented wdqs avail SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1235893 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [08:21:19] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [08:22:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [08:23:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [08:23:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [08:23:52] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:24:14] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:24:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [08:25:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [08:25:20] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:25:43] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [08:25:52] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:25:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4006.ulsfo.wmnet to cluster ulsfo02 and group 1 [08:27:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:27:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti4006.ulsfo.wmnet to cluster ulsfo02 and group 1 [08:27:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:27:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [08:28:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [08:28:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [08:28:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:29:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [08:29:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [08:30:19] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11686288 (10MoritzMuehlenhoff) [08:30:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [08:30:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply [08:31:26] (03CR) 10Ayounsi: [C:03+2] ulsfo routed ganeti fix v6 typo in neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/1249092 (owner: 10Ayounsi) [08:31:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply [08:32:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:32:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:34:14] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:35:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [08:35:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [08:36:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [08:37:03] (03PS1) 10Muehlenhoff: thanos::sidecar: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1249196 [08:37:10] FIRING: [2x] GanetiBGPDown: BGP session down between ganeti4005 and cr3-ulsfo - group Ganeti6 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [08:37:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [08:42:10] RESOLVED: [2x] GanetiBGPDown: BGP session down between ganeti4005 and cr3-ulsfo - group Ganeti6 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [08:45:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1249196 (owner: 10Muehlenhoff) [08:46:01] (03PS1) 10Muehlenhoff: Add prometheus4003 [puppet] - 10https://gerrit.wikimedia.org/r/1249201 (https://phabricator.wikimedia.org/T418993) [08:46:12] (03PS2) 10Muehlenhoff: Add prometheus4003 [puppet] - 10https://gerrit.wikimedia.org/r/1249201 (https://phabricator.wikimedia.org/T418993) [08:46:46] (03CR) 10CI reject: [V:04-1] Add prometheus4003 [puppet] - 10https://gerrit.wikimedia.org/r/1249201 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [08:48:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:49:14] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:50:05] (03PS1) 10Brouberol: Sort and visually align airflow records [dns] - 10https://gerrit.wikimedia.org/r/1249202 (https://phabricator.wikimedia.org/T417213) [08:50:07] (03PS1) 10Brouberol: Provision the airflow-fr-tech internal and public DNS records [dns] - 10https://gerrit.wikimedia.org/r/1249203 (https://phabricator.wikimedia.org/T417213) [08:50:28] (03PS1) 10Kevin Bazira: ml-services: remove unused env vars in embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249204 (https://phabricator.wikimedia.org/T418976) [08:51:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:52:19] (03PS3) 10Muehlenhoff: Add prometheus4003 [puppet] - 10https://gerrit.wikimedia.org/r/1249201 (https://phabricator.wikimedia.org/T418993) [08:56:26] (03PS1) 10Brouberol: deployment_server: sort the services alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/1249205 (https://phabricator.wikimedia.org/T417213) [08:59:42] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8234/co" [puppet] - 10https://gerrit.wikimedia.org/r/1249205 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [08:59:46] (03CR) 10Muehlenhoff: [C:03+2] Add prometheus4003 [puppet] - 10https://gerrit.wikimedia.org/r/1249201 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [09:01:02] (03CR) 10Ozge: [C:03+2] ml-services: remove unused env vars in embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249204 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [09:03:16] (03Merged) 10jenkins-bot: ml-services: remove unused env vars in embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249204 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [09:05:03] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [09:06:40] (03PS2) 10Brouberol: deployment_server: sort the services alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/1249205 (https://phabricator.wikimedia.org/T417213) [09:06:40] (03PS1) 10Brouberol: deployment_server: deduplicate airflow private files configuration [puppet] - 10https://gerrit.wikimedia.org/r/1249206 (https://phabricator.wikimedia.org/T417213) [09:06:47] (03PS1) 10Brouberol: deployment_server: provision the airflow-fr-tech kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) [09:07:30] (03CR) 10CI reject: [V:04-1] deployment_server: deduplicate airflow private files configuration [puppet] - 10https://gerrit.wikimedia.org/r/1249206 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [09:07:47] (03CR) 10CI reject: [V:04-1] deployment_server: provision the airflow-fr-tech kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [09:08:26] (03PS1) 10Muehlenhoff: Add new durum/ncredir hosts for routed Ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1249208 (https://phabricator.wikimedia.org/T418993) [09:08:55] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Lumen 100g transport) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:10:30] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978#11686377 (10ayounsi) a:03RobH Rob, could you prioritize this ? Thanks [09:13:17] (03CR) 10Ayounsi: [C:03+1] Add new durum/ncredir hosts for routed Ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1249208 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [09:14:45] (03CR) 10Jelto: [C:03+1] "lgtm, thanks for the cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1248843 (https://phabricator.wikimedia.org/T417615) (owner: 10Arnaudb) [09:20:48] (03PS1) 10Kevin Bazira: ml-services: add MAX_MODEL_LEN performance optimization env var to embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249209 (https://phabricator.wikimedia.org/T418976) [09:22:59] (03CR) 10Ozge: [C:03+2] ml-services: add MAX_MODEL_LEN performance optimization env var to embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249209 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [09:25:22] (03Merged) 10jenkins-bot: ml-services: add MAX_MODEL_LEN performance optimization env var to embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249209 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [09:29:06] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [09:31:12] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host frdb1008 [09:31:15] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host frdb1008 [09:31:25] (03CR) 10JavierMonton: [C:03+1] aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:33:49] (03CR) 10Volans: [C:03+2] transports: refactor State implementation [software/cumin] - 10https://gerrit.wikimedia.org/r/1224031 (owner: 10Volans) [09:35:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host prometheus4003.ulsfo.wmnet [09:35:19] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:35:40] 06SRE, 07Sustainability (Incident Followup): Noise in #wikimedia-operations is making incident response more difficult - https://phabricator.wikimedia.org/T417163#11686454 (10hashar) For reference see T384804 where I did a breakdown per bots and components they report, I built it based on the [[ https://wm-bot... [09:38:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11686491 (10BTullis) [09:38:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11686506 (10BTullis) p:05Triage→03Low [09:39:04] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11686507 (10ayounsi) a:05ayounsi→03Papaul @papaul, can you try a factory reset of the switch from rack 23? (the one failing the TLS cookbook). I'm also still... [09:39:17] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11686512 (10BTullis) p:05Triage→03Low [09:40:05] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:40:12] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus4003.ulsfo.wmnet [09:40:15] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr1-drmrs:9804) - https://phabricator.wikimedia.org/T416987#11686528 (10ayounsi) 05Open→03Resolved [09:42:50] (03PS2) 10Brouberol: deployment_server: deduplicate airflow private files configuration [puppet] - 10https://gerrit.wikimedia.org/r/1249206 (https://phabricator.wikimedia.org/T417213) [09:42:50] (03PS2) 10Brouberol: deployment_server: provision the airflow-fr-tech kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) [09:42:51] (03PS1) 10Brouberol: data: add dwisehaupt to the airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1249212 (https://phabricator.wikimedia.org/T417213) [09:42:59] (03CR) 10Arnaudb: [C:03+2] gerrit: remove mod_qos [puppet] - 10https://gerrit.wikimedia.org/r/1248843 (https://phabricator.wikimedia.org/T417615) (owner: 10Arnaudb) [09:43:05] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:43:27] (03CR) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:43:31] (03CR) 10CI reject: [V:04-1] deployment_server: deduplicate airflow private files configuration [puppet] - 10https://gerrit.wikimedia.org/r/1249206 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [09:43:50] (03CR) 10CI reject: [V:04-1] deployment_server: provision the airflow-fr-tech kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [09:44:37] (03PS1) 10Kevin Bazira: ml-services: add MAX_NUM_BATCHED_TOKENS performance optimization env var to embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249213 (https://phabricator.wikimedia.org/T418976) [09:46:47] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:46:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:46:57] (03Merged) 10jenkins-bot: transports: refactor State implementation [software/cumin] - 10https://gerrit.wikimedia.org/r/1224031 (owner: 10Volans) [09:47:09] (03CR) 10Ozge: [C:03+2] ml-services: add MAX_NUM_BATCHED_TOKENS performance optimization env var to embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249213 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [09:47:35] (03CR) 10Volans: [C:03+2] transports: add shortened method to Command class [software/cumin] - 10https://gerrit.wikimedia.org/r/1224032 (owner: 10Volans) [09:48:20] (03PS3) 10Ayounsi: [WIP] Add depool strategy for rack depool cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1243077 (https://phabricator.wikimedia.org/T327300) [09:49:38] (03Merged) 10jenkins-bot: ml-services: add MAX_NUM_BATCHED_TOKENS performance optimization env var to embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249213 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [09:49:42] (03CR) 10Brouberol: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1249206 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [09:49:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:50:33] (03CR) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:50:35] (03CR) 10Brouberol: [C:03+2] aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:51:32] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [09:53:50] (03PS1) 10Brouberol: kafka-test: disable the mirrormaker instance [puppet] - 10https://gerrit.wikimedia.org/r/1249215 (https://phabricator.wikimedia.org/T417407) [09:54:37] (03CR) 10Brouberol: [C:03+1] "Nice! Thank you" [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) (owner: 10Elukey) [09:54:41] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8235/co" [puppet] - 10https://gerrit.wikimedia.org/r/1249215 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:56:38] (03CR) 10JavierMonton: [C:03+1] kafka-test: disable the mirrormaker instance [puppet] - 10https://gerrit.wikimedia.org/r/1249215 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:56:40] (03PS3) 10Brouberol: deployment_server: sort the services alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/1249205 (https://phabricator.wikimedia.org/T417213) [09:58:40] (03CR) 10N509FZ: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248906 (https://phabricator.wikimedia.org/T419265) (owner: 10SBassett) [09:58:46] (03CR) 10Elukey: [C:03+2] Add the sre.kafka.change-confluent-distro-version cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) (owner: 10Elukey) [09:59:51] (03PS1) 10JavierMonton: stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249217 (https://phabricator.wikimedia.org/T419258) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T1000) [10:00:10] (03Merged) 10jenkins-bot: transports: add shortened method to Command class [software/cumin] - 10https://gerrit.wikimedia.org/r/1224032 (owner: 10Volans) [10:00:49] (03CR) 10Volans: [C:03+2] transports: add new API for the execution results [software/cumin] - 10https://gerrit.wikimedia.org/r/1224033 (owner: 10Volans) [10:04:15] (03PS1) 10Brouberol: dse-k8s-eqiad: provision the airflow-fr-tech namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249218 (https://phabricator.wikimedia.org/T417213) [10:06:06] (03PS1) 10Brouberol: dse-k8s-eqiad: add the airflow-fr-tech ns to the ceph tenant list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249219 (https://phabricator.wikimedia.org/T417213) [10:06:34] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1243077 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [10:09:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:10:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:12:15] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:13:56] (03PS1) 10Arnaudb: gerrit: clean up post mod_qos removal [puppet] - 10https://gerrit.wikimedia.org/r/1249221 (https://phabricator.wikimedia.org/T417615) [10:14:03] 06SRE, 06Infrastructure-Foundations, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11686798 (10jijiki) [10:14:13] (03Merged) 10jenkins-bot: transports: add new API for the execution results [software/cumin] - 10https://gerrit.wikimedia.org/r/1224033 (owner: 10Volans) [10:14:14] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:14:48] (03PS1) 10Brouberol: dse-k8s-eqiad: provision the airflow-fr-tech instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249222 (https://phabricator.wikimedia.org/T417213) [10:15:00] (03CR) 10Volans: [C:03+2] tests: fix integration tests error handling [software/cumin] - 10https://gerrit.wikimedia.org/r/1224034 (owner: 10Volans) [10:15:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:15:41] (03CR) 10Brouberol: [C:03+1] Grant members of analytics-product-users access to an-coord hosts [puppet] - 10https://gerrit.wikimedia.org/r/1248771 (https://phabricator.wikimedia.org/T419167) (owner: 10Btullis) [10:15:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:16:22] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1249221 (https://phabricator.wikimedia.org/T417615) (owner: 10Arnaudb) [10:17:43] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:18:55] (03CR) 10Arnaudb: [C:03+2] gerrit: clean up post mod_qos removal [puppet] - 10https://gerrit.wikimedia.org/r/1249221 (https://phabricator.wikimedia.org/T417615) (owner: 10Arnaudb) [10:22:39] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdg) failed on ms-be2064 - https://phabricator.wikimedia.org/T419394 (10MatthewVernon) 03NEW [10:22:48] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdg) failed on ms-be2064 - https://phabricator.wikimedia.org/T419394#11686833 (10MatthewVernon) p:05Triage→03High [10:23:39] (03PS1) 10Elukey: cloud: update the PKI's api trusted CA certificate [puppet] - 10https://gerrit.wikimedia.org/r/1249224 (https://phabricator.wikimedia.org/T419099) [10:24:37] (03CR) 10Ayounsi: [C:03+2] [WIP] Add depool strategy for rack depool cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1243077 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [10:25:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:25:28] (03PS1) 10Brouberol: idp: provision the airflow_fr_tech placeholder client secret [labs/private] - 10https://gerrit.wikimedia.org/r/1249225 [10:25:37] (03CR) 10Brouberol: [C:03+2] idp: provision the airflow_fr_tech placeholder client secret [labs/private] - 10https://gerrit.wikimedia.org/r/1249225 (owner: 10Brouberol) [10:25:41] (03CR) 10Brouberol: [V:03+2 C:03+2] idp: provision the airflow_fr_tech placeholder client secret [labs/private] - 10https://gerrit.wikimedia.org/r/1249225 (owner: 10Brouberol) [10:26:03] (03CR) 10Elukey: [C:03+1] kafka-test: disable the mirrormaker instance [puppet] - 10https://gerrit.wikimedia.org/r/1249215 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [10:26:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:27:12] (03Merged) 10jenkins-bot: tests: fix integration tests error handling [software/cumin] - 10https://gerrit.wikimedia.org/r/1224034 (owner: 10Volans) [10:29:40] (03CR) 10Volans: [C:03+2] clustershell: implement the new API for results [software/cumin] - 10https://gerrit.wikimedia.org/r/1224035 (owner: 10Volans) [10:31:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:32:29] (03PS1) 10Brouberol: idp: add the airflow_fr_tech service [puppet] - 10https://gerrit.wikimedia.org/r/1249229 (https://phabricator.wikimedia.org/T417213) [10:33:18] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:33:45] (03CR) 10Brouberol: [V:03+1 C:03+2] kafka-test: disable the mirrormaker instance [puppet] - 10https://gerrit.wikimedia.org/r/1249215 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [10:34:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:35:15] (03PS2) 10Elukey: profile::pyrra: rework wdqs availability SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1248760 (https://phabricator.wikimedia.org/T393966) [10:35:15] (03PS2) 10Elukey: profile::pyrra: remove old wdqs SLO configs [puppet] - 10https://gerrit.wikimedia.org/r/1248761 (https://phabricator.wikimedia.org/T393966) [10:35:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:37:25] (03CR) 10Brouberol: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [10:37:25] (03CR) 10Elukey: profile::pyrra: rework wdqs availability SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1248760 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey) [10:37:30] (03CR) 10Brouberol: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1249206 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [10:37:31] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248760 (https://phabricator.wikimedia.org/T393966) (owner: 10Elukey) [10:39:02] (03Merged) 10jenkins-bot: clustershell: implement the new API for results [software/cumin] - 10https://gerrit.wikimedia.org/r/1224035 (owner: 10Volans) [10:39:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host prometheus4003.ulsfo.wmnet [10:39:21] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:39:23] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [10:39:42] (03CR) 10Volans: [C:03+2] tests: add CLI comparison tests [software/cumin] - 10https://gerrit.wikimedia.org/r/1224036 (owner: 10Volans) [10:40:43] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [10:41:44] (03PS1) 10Effie Mouzeli: api-gateway: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249232 (https://phabricator.wikimedia.org/T412693) [10:43:21] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus4003.ulsfo.wmnet - jmm@cumin2002" [10:43:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus4003.ulsfo.wmnet - jmm@cumin2002" [10:43:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:43:27] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache prometheus4003.ulsfo.wmnet on all recursors [10:43:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus4003.ulsfo.wmnet on all recursors [10:43:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:43:49] (03PS1) 10Brouberol: Reduce the allotted memory to the mm container from 8GB to 3GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249235 (https://phabricator.wikimedia.org/T417407) [10:44:02] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus4003.ulsfo.wmnet - jmm@cumin2002" [10:44:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus4003.ulsfo.wmnet - jmm@cumin2002" [10:45:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus4003.ulsfo.wmnet with OS bookworm [10:45:35] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus4003.ulsfo.wmnet with OS bookworm [10:45:35] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus4003.ulsfo.wmnet [10:47:09] (03CR) 10Brouberol: [C:03+2] Reduce the allotted memory to the mm container from 8GB to 3GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249235 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [10:49:48] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [10:49:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:50:05] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [10:51:40] (03Merged) 10jenkins-bot: tests: add CLI comparison tests [software/cumin] - 10https://gerrit.wikimedia.org/r/1224036 (owner: 10Volans) [10:54:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:56:50] (03PS1) 10Brouberol: kafka: allow ingress traffic to test/jumbo clusters from the aux k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/1249240 (https://phabricator.wikimedia.org/T417407) [10:56:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus4003.ulsfo.wmnet with OS bookworm [10:57:10] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11686982 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host prometheus4003.ulsfo.wmnet with OS bookworm [10:57:31] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus4003.ulsfo.wmnet with OS bookworm [10:57:40] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11686983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host prometheus4003.ulsfo.wmnet with OS bookworm executed with er... [10:58:46] (03CR) 10JMeybohm: [C:03+1] kafka: allow ingress traffic to test/jumbo clusters from the aux k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/1249240 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [10:59:10] (03CR) 10Brouberol: [C:03+2] kafka: allow ingress traffic to test/jumbo clusters from the aux k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/1249240 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [10:59:10] (03CR) 10Elukey: [C:03+1] kafka: allow ingress traffic to test/jumbo clusters from the aux k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/1249240 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [11:03:15] !log brouberol@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [11:03:35] !log brouberol@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [11:04:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:05:02] (03PS1) 10Effie Mouzeli: eventgate, eventstreams: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249241 (https://phabricator.wikimedia.org/T412693) [11:05:11] (03PS1) 10Phuedx: ext.wikimediaEvents: pageVisit -> loggedOutReaderRetention [extensions/WikimediaEvents] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249242 (https://phabricator.wikimedia.org/T419191) [11:05:24] (03PS1) 10Phuedx: JS SDK: Add getExperimentByPrefix() [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249243 (https://phabricator.wikimedia.org/T419191) [11:05:31] (03CR) 10CI reject: [V:04-1] ext.wikimediaEvents: pageVisit -> loggedOutReaderRetention [extensions/WikimediaEvents] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249242 (https://phabricator.wikimedia.org/T419191) (owner: 10Phuedx) [11:05:47] (03CR) 10Federico Ceratto: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1248489 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [11:07:22] (03CR) 10Muehlenhoff: [C:03+2] Add new durum/ncredir hosts for routed Ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1249208 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [11:07:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249243 (https://phabricator.wikimedia.org/T419191) (owner: 10Phuedx) [11:07:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249242 (https://phabricator.wikimedia.org/T419191) (owner: 10Phuedx) [11:09:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:11:05] 06SRE, 06SRE-OnFire, 10Incident Tooling, 10Stewards-Onboarding-Tool, 07Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#11687071 (10kostajh) Bumping this task -- I think this would be a useful improvement, for example with the inciden... [11:12:39] (03PS1) 10Phuedx: Hooks: Really only add global logging context for pageviews [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249245 [11:12:52] Hello 👋 [11:13:31] I'd like to deploy that patch ^ ASAP as it causes ~9 million event validation errors over the weekend [11:14:24] (03CR) 10Btullis: [C:03+2] Grant members of analytics-product-users access to an-coord hosts [puppet] - 10https://gerrit.wikimedia.org/r/1248771 (https://phabricator.wikimedia.org/T419167) (owner: 10Btullis) [11:15:46] jouncebot: nowandnext [11:15:46] No deployments scheduled for the next 1 hour(s) and 44 minute(s) [11:15:47] In 1 hour(s) and 44 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T1300) [11:16:18] phuedx: I think you can probably go ahead in a few minutes if nobody objects [11:16:21] or do you need a deployer? [11:17:01] I can deploy it :) [11:18:04] (03CR) 10Phuedx: "Recheck" [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249245 (owner: 10Phuedx) [11:19:10] (03CR) 10Volans: [C:03+2] tests: fix shellcheck issues [software/cumin] - 10https://gerrit.wikimedia.org/r/1224037 (owner: 10Volans) [11:20:39] (03PS1) 10Effie Mouzeli: eventrouter: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249247 (https://phabricator.wikimedia.org/T412693) [11:22:11] ok 👍 [11:23:26] (03CR) 10Filippo Giunchedi: [C:03+1] cloud: update the PKI's api trusted CA certificate [puppet] - 10https://gerrit.wikimedia.org/r/1249224 (https://phabricator.wikimedia.org/T419099) (owner: 10Elukey) [11:24:05] (03CR) 10Phuedx: "check codehealth" [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249245 (owner: 10Phuedx) [11:24:41] (03PS1) 10Effie Mouzeli: ipoid: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249249 (https://phabricator.wikimedia.org/T412693) [11:26:48] 06SRE, 06SRE-OnFire, 10Incident Tooling, 10Stewards-Onboarding-Tool, 07Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#11687275 (10MoritzMuehlenhoff) How many people are we potentially looking at here? 1-2 dozens or a three digit num... [11:27:33] (03PS6) 10Btullis: opensearch-cluster: Terminate TLS on the ingress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175) [11:29:28] I'm not quite sure what's going on with that codehealth failure but it looks to be unrelated to the change [11:29:33] > Could not open input file: tests/phpunit/generatePHPUnitConfig.php [11:29:40] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [11:29:44] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [11:29:56] mwext-phan-php83 also succeeded [11:30:09] OK. Deploying. Let's get these validation errors under control [11:30:17] (03PS1) 10Effie Mouzeli: kask: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249250 (https://phabricator.wikimedia.org/T412693) [11:30:34] (03PS7) 10Btullis: opensearch-cluster: Terminate TLS on the ingress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175) [11:30:52] (03Merged) 10jenkins-bot: tests: fix shellcheck issues [software/cumin] - 10https://gerrit.wikimedia.org/r/1224037 (owner: 10Volans) [11:31:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249245 (owner: 10Phuedx) [11:31:19] 06SRE, 06DBA, 10Observability-Alerting: single DB server replag / downtime should not page us anymore - https://phabricator.wikimedia.org/T396816#11687294 (10FCeratto-WMF) a:03FCeratto-WMF [11:31:28] (03PS1) 10Majavah: P:wmcs: kubeadm: preflight_checks: Remove unused variable [puppet] - 10https://gerrit.wikimedia.org/r/1249251 [11:32:19] (03Merged) 10jenkins-bot: Hooks: Really only add global logging context for pageviews [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249245 (owner: 10Phuedx) [11:32:40] !log phuedx@deploy2002 Started scap sync-world: Backport for [[gerrit:1249245|Hooks: Really only add global logging context for pageviews]] [11:34:05] 06SRE, 06SRE-OnFire, 10Incident Tooling, 10Stewards-Onboarding-Tool, 07Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#11687312 (10Urbanecm) > How many people are we potentially looking at here? 1-2 dozens or a three digit number? @... [11:34:27] !log phuedx@deploy2002 phuedx: Backport for [[gerrit:1249245|Hooks: Really only add global logging context for pageviews]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:34:55] Checking the test servers [11:35:57] (03CR) 10Volans: [C:03+2] clustershell: add command index to output headers [software/cumin] - 10https://gerrit.wikimedia.org/r/1224038 (owner: 10Volans) [11:37:30] (03PS2) 10Effie Mouzeli: eventrouter: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249247 (https://phabricator.wikimedia.org/T412693) [11:37:54] A quick browse of random articles looks OK. No obvious errors in the logs [11:38:18] !log phuedx@deploy2002 phuedx: Continuing with sync [11:43:14] (03PS3) 10Reedy: Revert "CommonSettings: Temporarily set $wgOATHUserHandlesTable = true" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239026 (https://phabricator.wikimedia.org/T416544) [11:43:14] (03PS1) 10Reedy: CommonSettings: Remove orphaned $wgWebAuthnNewCredsDisabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249253 [11:44:14] (03PS2) 10Effie Mouzeli: api-gateway: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249232 (https://phabricator.wikimedia.org/T412693) [11:44:22] (03CR) 10CI reject: [V:04-1] api-gateway: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249232 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [11:44:28] (03PS2) 10Effie Mouzeli: ipoid: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249249 (https://phabricator.wikimedia.org/T412693) [11:44:42] !log phuedx@deploy2002 Finished scap sync-world: Backport for [[gerrit:1249245|Hooks: Really only add global logging context for pageviews]] (duration: 12m 02s) [11:44:42] (03PS1) 10Elukey: aptrepo: add confluent77 component to install Kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1249254 (https://phabricator.wikimedia.org/T416670) [11:44:53] (03CR) 10Fabfur: varnish: Set custom glb_requests_limit for thumbs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248819 (owner: 10Vgutierrez) [11:45:03] (03PS1) 10Kevin Bazira: ml-services: add VLLM_ROCM_USE_AITER performance optimization env var to embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249255 (https://phabricator.wikimedia.org/T418976) [11:45:41] FIRING: SystemdUnitFailed: bitu-permission-request.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:38] FIRING: GnmiTargetDown: cr2-eqdfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [11:46:44] (03PS3) 10Effie Mouzeli: ipoid: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249249 (https://phabricator.wikimedia.org/T412693) [11:47:08] (03PS1) 10Ayounsi: Routed ganeti: net-common: fix if/else/fi logic error [puppet] - 10https://gerrit.wikimedia.org/r/1249257 (https://phabricator.wikimedia.org/T410314) [11:47:24] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1249257 (https://phabricator.wikimedia.org/T410314) (owner: 10Ayounsi) [11:48:04] (03Merged) 10jenkins-bot: clustershell: add command index to output headers [software/cumin] - 10https://gerrit.wikimedia.org/r/1224038 (owner: 10Volans) [11:48:11] (03PS2) 10Effie Mouzeli: eventgate, eventstreams: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249241 (https://phabricator.wikimedia.org/T412693) [11:48:59] (03CR) 10Vgutierrez: [V:03+1] varnish: Set custom glb_requests_limit for thumbs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248819 (owner: 10Vgutierrez) [11:49:21] (03Abandoned) 10Effie Mouzeli: api-gateway: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249232 (https://phabricator.wikimedia.org/T412693) (owner: 10Effie Mouzeli) [11:50:55] jouncebot: nowandnext [11:50:55] No deployments scheduled for the next 1 hour(s) and 9 minute(s) [11:50:55] In 1 hour(s) and 9 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T1300) [11:51:29] phuedx: Anything else to deploy? :P [11:51:45] Reedy: I mean… I've got a list :D [11:51:57] Reedy: I'm all done. It's all yours [11:52:09] (03CR) 10Reedy: [C:03+2] Revert "CommonSettings: Temporarily set $wgOATHUserHandlesTable = true" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239026 (https://phabricator.wikimedia.org/T416544) (owner: 10Reedy) [11:52:33] (03CR) 10Reedy: [C:03+2] CommonSettings: Remove orphaned $wgWebAuthnNewCredsDisabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249253 (owner: 10Reedy) [11:52:42] (03PS2) 10Effie Mouzeli: kask: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249250 (https://phabricator.wikimedia.org/T412693) [11:52:49] (03PS3) 10Effie Mouzeli: eventrouter: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249247 (https://phabricator.wikimedia.org/T412693) [11:52:56] (03PS3) 10Effie Mouzeli: eventgate, eventstreams: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249241 (https://phabricator.wikimedia.org/T412693) [11:53:10] (03Merged) 10jenkins-bot: Revert "CommonSettings: Temporarily set $wgOATHUserHandlesTable = true" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239026 (https://phabricator.wikimedia.org/T416544) (owner: 10Reedy) [11:53:17] (03PS1) 10Effie Mouzeli: api-gateway: add Chart.yaml metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249259 (https://phabricator.wikimedia.org/T412693) [11:53:28] (03Merged) 10jenkins-bot: CommonSettings: Remove orphaned $wgWebAuthnNewCredsDisabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249253 (owner: 10Reedy) [11:53:30] (03CR) 10Volans: [C:03+2] doc: update for the new API [software/cumin] - 10https://gerrit.wikimedia.org/r/1224039 (owner: 10Volans) [11:54:19] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1239026|Revert "CommonSettings: Temporarily set $wgOATHUserHandlesTable = true" (T416544)]], [[gerrit:1249253|CommonSettings: Remove orphaned $wgWebAuthnNewCredsDisabled]] [11:54:23] T416544: New database table for tracking WebAuthn userHandle values (oathauth_user_handles) - https://phabricator.wikimedia.org/T416544 [11:55:30] (03PS1) 10Majavah: kubeadm: Always install conntrack package [puppet] - 10https://gerrit.wikimedia.org/r/1249260 [11:56:10] !log reedy@deploy2002 reedy: Backport for [[gerrit:1239026|Revert "CommonSettings: Temporarily set $wgOATHUserHandlesTable = true" (T416544)]], [[gerrit:1249253|CommonSettings: Remove orphaned $wgWebAuthnNewCredsDisabled]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:56:37] !log reedy@deploy2002 reedy: Continuing with sync [11:56:50] (03PS1) 10Muehlenhoff: bitu: Fix argument passing to the permission cleaner module [puppet] - 10https://gerrit.wikimedia.org/r/1249261 (https://phabricator.wikimedia.org/T416152) [11:56:59] (03CR) 10Effie Mouzeli: [C:03+2] admin/lib.php: Remove curl_close() [puppet] - 10https://gerrit.wikimedia.org/r/1221219 (https://phabricator.wikimedia.org/T413538) (owner: 10Reedy) [11:57:36] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8236/co" [puppet] - 10https://gerrit.wikimedia.org/r/1249260 (owner: 10Majavah) [12:00:33] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1239026|Revert "CommonSettings: Temporarily set $wgOATHUserHandlesTable = true" (T416544)]], [[gerrit:1249253|CommonSettings: Remove orphaned $wgWebAuthnNewCredsDisabled]] (duration: 06m 13s) [12:00:37] T416544: New database table for tracking WebAuthn userHandle values (oathauth_user_handles) - https://phabricator.wikimedia.org/T416544 [12:00:59] (03CR) 10Brouberol: [C:03+1] aptrepo: add confluent77 component to install Kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1249254 (https://phabricator.wikimedia.org/T416670) (owner: 10Elukey) [12:01:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1249257 (https://phabricator.wikimedia.org/T410314) (owner: 10Ayounsi) [12:01:32] (03CR) 10FNegri: [C:03+1] kubeadm: Always install conntrack package [puppet] - 10https://gerrit.wikimedia.org/r/1249260 (owner: 10Majavah) [12:02:15] (03CR) 10Ayounsi: [C:03+2] Routed ganeti: net-common: fix if/else/fi logic error [puppet] - 10https://gerrit.wikimedia.org/r/1249257 (https://phabricator.wikimedia.org/T410314) (owner: 10Ayounsi) [12:04:30] (03Merged) 10jenkins-bot: doc: update for the new API [software/cumin] - 10https://gerrit.wikimedia.org/r/1224039 (owner: 10Volans) [12:05:23] (03PS4) 10Brouberol: deployment_server: sort the services alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/1249205 (https://phabricator.wikimedia.org/T417213) [12:05:23] (03PS3) 10Brouberol: deployment_server: deduplicate airflow private files configuration [puppet] - 10https://gerrit.wikimedia.org/r/1249206 (https://phabricator.wikimedia.org/T417213) [12:05:23] (03PS3) 10Brouberol: deployment_server: provision the airflow-fr-tech kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) [12:05:23] (03PS2) 10Brouberol: data: add dwisehaupt to the airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1249212 (https://phabricator.wikimedia.org/T417213) [12:05:24] (03PS2) 10Brouberol: idp: add the airflow_fr_tech service [puppet] - 10https://gerrit.wikimedia.org/r/1249229 (https://phabricator.wikimedia.org/T417213) [12:06:14] (03CR) 10CI reject: [V:04-1] deployment_server: deduplicate airflow private files configuration [puppet] - 10https://gerrit.wikimedia.org/r/1249206 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [12:06:30] (03CR) 10CI reject: [V:04-1] deployment_server: provision the airflow-fr-tech kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [12:06:38] (03CR) 10Majavah: [V:03+1 C:03+2] kubeadm: Always install conntrack package [puppet] - 10https://gerrit.wikimedia.org/r/1249260 (owner: 10Majavah) [12:07:01] (03CR) 10Muehlenhoff: "The patch adds him to analytics-privatedata-users, not airflow-deployers?" [puppet] - 10https://gerrit.wikimedia.org/r/1249212 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [12:09:10] (03PS1) 10Santiago Faci: Disable MetricsPlatform extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249262 (https://phabricator.wikimedia.org/T416865) [12:10:07] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM also lol re: $swap not used since 2021" [puppet] - 10https://gerrit.wikimedia.org/r/1249251 (owner: 10Majavah) [12:10:17] (03CR) 10Majavah: [C:03+2] P:wmcs: kubeadm: preflight_checks: Remove unused variable [puppet] - 10https://gerrit.wikimedia.org/r/1249251 (owner: 10Majavah) [12:10:33] (03PS5) 10Brouberol: deployment_server: sort the services alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/1249205 (https://phabricator.wikimedia.org/T417213) [12:10:33] (03PS4) 10Brouberol: deployment_server: deduplicate airflow private files configuration [puppet] - 10https://gerrit.wikimedia.org/r/1249206 (https://phabricator.wikimedia.org/T417213) [12:10:33] (03PS4) 10Brouberol: deployment_server: provision the airflow-fr-tech kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) [12:10:33] (03PS3) 10Brouberol: data: add dwisehaupt to the airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1249212 (https://phabricator.wikimedia.org/T417213) [12:10:34] (03PS3) 10Brouberol: idp: add the airflow_fr_tech service [puppet] - 10https://gerrit.wikimedia.org/r/1249229 (https://phabricator.wikimedia.org/T417213) [12:11:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus4003.ulsfo.wmnet with OS bookworm [12:11:35] (03CR) 10CI reject: [V:04-1] deployment_server: deduplicate airflow private files configuration [puppet] - 10https://gerrit.wikimedia.org/r/1249206 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [12:11:45] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11687454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host prometheus4003.ulsfo.wmnet with OS bookworm [12:11:50] (03CR) 10CI reject: [V:04-1] deployment_server: provision the airflow-fr-tech kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [12:13:11] (03CR) 10Santiago Faci: "Cool idea." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247547 (https://phabricator.wikimedia.org/T416865) (owner: 10Santiago Faci) [12:14:28] (03PS1) 10Muehlenhoff: Add netflow4003 [puppet] - 10https://gerrit.wikimedia.org/r/1249264 (https://phabricator.wikimedia.org/T418993) [12:14:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:15:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11687464 (10Jclark-ctr) @BTullis parts have arrived. please advise when they can be swapped [12:15:36] (03CR) 10Phuedx: [C:03+1] Disable MetricsPlatform extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249262 (https://phabricator.wikimedia.org/T416865) (owner: 10Santiago Faci) [12:15:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249262 (https://phabricator.wikimedia.org/T416865) (owner: 10Santiago Faci) [12:16:50] (03PS5) 10Brouberol: deployment_server: provision the airflow-fr-tech kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) [12:16:50] (03PS4) 10Brouberol: data: add dwisehaupt to the airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1249212 (https://phabricator.wikimedia.org/T417213) [12:16:50] (03PS4) 10Brouberol: idp: add the airflow_fr_tech service [puppet] - 10https://gerrit.wikimedia.org/r/1249229 (https://phabricator.wikimedia.org/T417213) [12:16:59] (03Abandoned) 10Brouberol: deployment_server: deduplicate airflow private files configuration [puppet] - 10https://gerrit.wikimedia.org/r/1249206 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [12:17:40] (03CR) 10CI reject: [V:04-1] deployment_server: provision the airflow-fr-tech kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [12:17:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:19:38] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11687475 (10ayounsi) [12:22:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:23:13] (03PS6) 10Brouberol: deployment_server: sort the services alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/1249205 (https://phabricator.wikimedia.org/T417213) [12:23:13] (03PS6) 10Brouberol: deployment_server: provision the airflow-fr-tech kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) [12:23:13] (03PS5) 10Brouberol: data: add dwisehaupt to the airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1249212 (https://phabricator.wikimedia.org/T417213) [12:23:13] (03PS5) 10Brouberol: idp: add the airflow_fr_tech service [puppet] - 10https://gerrit.wikimedia.org/r/1249229 (https://phabricator.wikimedia.org/T417213) [12:24:16] (03PS1) 10Majavah: kubeadm: containerd: Add explicit dependency on package [puppet] - 10https://gerrit.wikimedia.org/r/1249267 [12:27:12] (03CR) 10Fabfur: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1248819 (owner: 10Vgutierrez) [12:27:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:28:04] (03CR) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [12:29:38] !log installing python3.9 security updates [12:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:14] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:34:26] (03CR) 10Dpogorzelski: [C:03+1] "I imagine all clusters simply read the defaults so this is redundant" [puppet] - 10https://gerrit.wikimedia.org/r/1248812 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [12:35:21] (03CR) 10Dpogorzelski: [C:03+1] httpbb: fix rec-api-ng test [puppet] - 10https://gerrit.wikimedia.org/r/1248481 (owner: 10AikoChou) [12:35:40] (03CR) 10Dpogorzelski: [C:03+1] httpbb: fix ores-legacy test [puppet] - 10https://gerrit.wikimedia.org/r/1248491 (owner: 10AikoChou) [12:37:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:38:14] (03CR) 10Ozge: [C:03+2] ml-services: add VLLM_ROCM_USE_AITER performance optimization env var to embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249255 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [12:38:19] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249273 [12:38:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:40:35] (03Merged) 10jenkins-bot: ml-services: add VLLM_ROCM_USE_AITER performance optimization env var to embeddings-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249255 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [12:43:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:44:32] (03CR) 10Ayounsi: [C:03+1] Add netflow4003 [puppet] - 10https://gerrit.wikimedia.org/r/1249264 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [12:45:40] 06SRE, 06SRE-OnFire, 10Incident Tooling, 10Stewards-Onboarding-Tool, 07Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#11687545 (10MoritzMuehlenhoff) >>! In T343377#11687312, @Urbanecm wrote: > Would you mind expanding on this? As fa... [12:46:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:48:22] (03CR) 10Filippo Giunchedi: [C:03+1] kubeadm: containerd: Add explicit dependency on package [puppet] - 10https://gerrit.wikimedia.org/r/1249267 (owner: 10Majavah) [12:49:14] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:49:29] (03CR) 10Majavah: [C:03+2] kubeadm: containerd: Add explicit dependency on package [puppet] - 10https://gerrit.wikimedia.org/r/1249267 (owner: 10Majavah) [12:53:10] (03PS2) 10Santiago Faci: Removes `MetricsPlatform` configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247547 (https://phabricator.wikimedia.org/T416865) [12:53:19] (03PS3) 10Santiago Faci: Remove `MetricsPlatform` configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247547 (https://phabricator.wikimedia.org/T416865) [12:53:26] (03PS4) 10Santiago Faci: Remove `MetricsPlatform` configuration from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247547 (https://phabricator.wikimedia.org/T416865) [12:53:55] (03CR) 10Santiago Faci: Remove `MetricsPlatform` configuration from production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247547 (https://phabricator.wikimedia.org/T416865) (owner: 10Santiago Faci) [12:53:57] 06SRE, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11687563 (10MoritzMuehlenhoff) [12:54:26] (03PS5) 10Santiago Faci: Remove `MetricsPlatform` configuration from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247547 (https://phabricator.wikimedia.org/T416865) [12:54:33] (03PS6) 10Santiago Faci: Remove `MetricsPlatform` configuration from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247547 (https://phabricator.wikimedia.org/T416865) [12:55:46] !log installing Kerberos security updates [12:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:44] (03PS1) 10Slyngshede: CAS 7.3.5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1249279 (https://phabricator.wikimedia.org/T419419) [12:56:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:57:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T1300). [13:00:05] manfredi and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:40] o/ [13:04:25] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus4003.ulsfo.wmnet with OS bookworm [13:04:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11687634 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host prometheus4003.ulsfo.wmnet with OS bookworm executed with er... [13:05:46] (03PS2) 10Slyngshede: CAS 7.3.5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1249279 (https://phabricator.wikimedia.org/T419419) [13:05:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:07:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:08:05] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Send low-volume services via new nodes [puppet] - 10https://gerrit.wikimedia.org/r/1249286 (https://phabricator.wikimedia.org/T419231) [13:08:21] (03Abandoned) 10Slyngshede: Update to CAS version 7.3.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1223679 (owner: 10Slyngshede) [13:08:43] (03CR) 10JMeybohm: [C:03+1] Add new rdb101[56] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247977 (https://phabricator.wikimedia.org/T418916) (owner: 10Clément Goubert) [13:09:14] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Lumen 100g transport) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:09:30] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8238/co" [puppet] - 10https://gerrit.wikimedia.org/r/1249286 (https://phabricator.wikimedia.org/T419231) (owner: 10Majavah) [13:09:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:10:13] (03CR) 10JMeybohm: [C:03+1] Add new wikikube-worker23[57-74] [puppet] - 10https://gerrit.wikimedia.org/r/1247989 (https://phabricator.wikimedia.org/T418925) (owner: 10Clément Goubert) [13:10:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus4003.ulsfo.wmnet with OS bookworm [13:10:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11687657 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host prometheus4003.ulsfo.wmnet with OS bookworm [13:10:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:11:22] (03CR) 10JMeybohm: [C:03+1] aptrepo: add confluent77 component to install Kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1249254 (https://phabricator.wikimedia.org/T416670) (owner: 10Elukey) [13:12:22] manfredi: yt? [13:12:45] (03CR) 10Jforrester: [C:03+1] Disable MetricsPlatform extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249262 (https://phabricator.wikimedia.org/T416865) (owner: 10Santiago Faci) [13:12:47] (03PS6) 10Daniel Kinzler: rest-gateway: add CORS support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) [13:13:58] (03CR) 10Daniel Kinzler: rest-gateway: add CORS support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [13:14:06] I am around [13:14:44] manfredi: I can deploy your changes. Are you able to verify them? [13:14:57] yes ready [13:16:29] (03CR) 10Btullis: [C:03+1] Sort and visually align airflow records [dns] - 10https://gerrit.wikimedia.org/r/1249202 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [13:16:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248075 (https://phabricator.wikimedia.org/T415902) (owner: 10Mmartorana) [13:16:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247651 (https://phabricator.wikimedia.org/T415902) (owner: 10Mmartorana) [13:17:00] (03CR) 10Btullis: [C:03+1] Provision the airflow-fr-tech internal and public DNS records [dns] - 10https://gerrit.wikimedia.org/r/1249203 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [13:17:42] (03Merged) 10jenkins-bot: Enable confirmemail logstash channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247651 (https://phabricator.wikimedia.org/T415902) (owner: 10Mmartorana) [13:19:56] (03CR) 10Btullis: [C:03+1] deployment_server: sort the services alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/1249205 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [13:20:12] (03PS1) 10Mszwarc: Hide 2fa-warning Echo category from preferences [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249291 (https://phabricator.wikimedia.org/T419111) [13:20:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249291 (https://phabricator.wikimedia.org/T419111) (owner: 10Mszwarc) [13:20:58] (03CR) 10Btullis: deployment_server: provision the airflow-fr-tech kubeconfigs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [13:21:06] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#11687686 (10Ben.buchenau) 05Resolved→03Open Hello guys - follow-up request regarding Kerebos authentication: Can I get a `keytab` file for... [13:21:07] (03CR) 10Klausman: [C:03+1] kubernetes: Don't re-define default admission_plugins [puppet] - 10https://gerrit.wikimedia.org/r/1248812 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [13:21:36] phuedx: I'll have one more thing to deploy (I can do it myself). Could you ping me when you finish? [13:22:05] Msz2001: Will do 👍 [13:24:03] (03CR) 10Btullis: "nit: The change actually adds him to analytics-privatedata-users, but it happens that the membership list for this group is re-used for ai" [puppet] - 10https://gerrit.wikimedia.org/r/1249212 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [13:24:22] (03CR) 10Btullis: [C:03+1] idp: add the airflow_fr_tech service [puppet] - 10https://gerrit.wikimedia.org/r/1249229 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [13:24:42] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: provision the airflow-fr-tech namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249218 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [13:24:46] (03CR) 10Arnaudb: mailman: add lists to service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1247078 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [13:25:17] (03CR) 10Brouberol: "Oh you're right. I'd be happy to add him to `airflow-deployers` directly, but other fr-tech members are also in `analytics-privatedata`. E" [puppet] - 10https://gerrit.wikimedia.org/r/1249212 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [13:25:22] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: add the airflow-fr-tech ns to the ceph tenant list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249219 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [13:26:08] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: provision the airflow-fr-tech instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249222 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [13:28:24] (03Merged) 10jenkins-bot: Confirmemail: Log delay between email sent and confirmation [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248075 (https://phabricator.wikimedia.org/T415902) (owner: 10Mmartorana) [13:28:42] !log phuedx@deploy2002 Started scap sync-world: Backport for [[gerrit:1248075|Confirmemail: Log delay between email sent and confirmation (T415902)]], [[gerrit:1247651|Enable confirmemail logstash channel (T415902)]] [13:28:45] T415902: Instrument data for tracking email verifications - https://phabricator.wikimedia.org/T415902 [13:30:30] !log phuedx@deploy2002 mmartorana, phuedx: Backport for [[gerrit:1248075|Confirmemail: Log delay between email sent and confirmation (T415902)]], [[gerrit:1247651|Enable confirmemail logstash channel (T415902)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:30:45] (03PS7) 10Brouberol: deployment_server: provision the airflow-fr-tech kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) [13:30:45] (03PS6) 10Brouberol: data: add dwisehaupt to the airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1249212 (https://phabricator.wikimedia.org/T417213) [13:30:45] (03PS6) 10Brouberol: idp: add the airflow_fr_tech service [puppet] - 10https://gerrit.wikimedia.org/r/1249229 (https://phabricator.wikimedia.org/T417213) [13:30:46] manfredi: Over to you :) [13:31:30] (03CR) 10Brouberol: deployment_server: provision the airflow-fr-tech kubeconfigs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [13:31:58] (03CR) 10Ottomata: [C:03+2] stream: mw-page-html-content-change-enrich (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235827 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [13:32:00] (03Abandoned) 10Arnaudb: mailman: add lists to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1247078 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [13:33:21] o/ [13:33:25] forgot about daylight confusion time ^^ [13:34:07] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235827 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [13:34:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:35:00] (03CR) 10Ottomata: [C:03+1] stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249217 (https://phabricator.wikimedia.org/T419258) (owner: 10JavierMonton) [13:35:10] phuedx - verified on mwdebug, logs look good [13:35:53] OK. Continuing [13:35:56] !log phuedx@deploy2002 mmartorana, phuedx: Continuing with sync [13:36:07] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::k8s::haproxy: Send low-volume services via new nodes [puppet] - 10https://gerrit.wikimedia.org/r/1249286 (https://phabricator.wikimedia.org/T419231) (owner: 10Majavah) [13:36:44] (03CR) 10Phuedx: "Recheck" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249242 (https://phabricator.wikimedia.org/T419191) (owner: 10Phuedx) [13:36:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:36:56] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::k8s::haproxy: Send low-volume services via new nodes [puppet] - 10https://gerrit.wikimedia.org/r/1249286 (https://phabricator.wikimedia.org/T419231) (owner: 10Majavah) [13:37:46] (03CR) 10Mszwarc: [C:03+2] "Merging ahead of deployment, to speed things up" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249291 (https://phabricator.wikimedia.org/T419111) (owner: 10Mszwarc) [13:39:58] !log phuedx@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248075|Confirmemail: Log delay between email sent and confirmation (T415902)]], [[gerrit:1247651|Enable confirmemail logstash channel (T415902)]] (duration: 11m 16s) [13:40:02] T415902: Instrument data for tracking email verifications - https://phabricator.wikimedia.org/T415902 [13:40:10] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11687791 (10Papaul) @ayounsi yes I can do that. Do we have like some Documentation on how to factory reset the the Nokia switch somewhere or it is just "delete /"... [13:40:31] Waiting for CI on the two experiment patches [13:40:35] I'll do the config changes [13:40:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249262 (https://phabricator.wikimedia.org/T416865) (owner: 10Santiago Faci) [13:41:01] *change [13:41:20] (03CR) 10Mszwarc: "Not yet..." [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249291 (https://phabricator.wikimedia.org/T419111) (owner: 10Mszwarc) [13:41:44] (03PS2) 10Mszwarc: Hide 2fa-warning Echo category from preferences [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249291 (https://phabricator.wikimedia.org/T419111) [13:41:45] (03Merged) 10jenkins-bot: Disable MetricsPlatform extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249262 (https://phabricator.wikimedia.org/T416865) (owner: 10Santiago Faci) [13:41:51] (03PS1) 10Volans: cli: update interactive banner for new API [software/cumin] - 10https://gerrit.wikimedia.org/r/1249297 [13:42:02] (03PS1) 10Bking: wdqs: repurpose wdqs2009 to test blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1249298 (https://phabricator.wikimedia.org/T415073) [13:42:04] Ok, sorry, I thought you did those earlier :) I'll wait [13:42:05] !log phuedx@deploy2002 Started scap sync-world: Backport for [[gerrit:1249262|Disable MetricsPlatform extension (T416865)]] [13:42:10] T416865: Remove references to MetricsPlatform extension - https://phabricator.wikimedia.org/T416865 [13:42:24] (03PS1) 10Anzx: kaiwiki: add logo, stiename, projectnamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249035 (https://phabricator.wikimedia.org/T414237) [13:42:50] phuedx thx everything is working on my side [13:43:26] (03CR) 10Bking: [C:03+2] "self-merging, as these are non-production hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1249298 (https://phabricator.wikimedia.org/T415073) (owner: 10Bking) [13:43:33] Msz2001: Did you want to go next? I've got two changes left that I can do in one deployment? [13:43:42] This config change will be quick :) [13:43:44] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind CDN - https://phabricator.wikimedia.org/T286066#11687802 (10ABran-WMF) 05Open→03In progress thanks @Fabfur for the answers. We'll keep the configuration as it is now, with mailma... [13:43:54] !log phuedx@deploy2002 phuedx, sfaci: Backport for [[gerrit:1249262|Disable MetricsPlatform extension (T416865)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:44:07] It's not urgent, Mine can go with yours if you prefer to send them all in a single batch [13:44:22] (03CR) 10AikoChou: [C:03+1] pyrra(ML): fix updated revertrisk metric name [puppet] - 10https://gerrit.wikimedia.org/r/1248818 (https://phabricator.wikimedia.org/T419235) (owner: 10Dpogorzelski) [13:44:22] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind CDN - https://phabricator.wikimedia.org/T286066#11687820 (10ABran-WMF) [13:44:24] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [13:44:40] (03CR) 10Gergő Tisza: rest-gateway: add CORS support (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [13:46:10] I've checked enwiki, dewiki, and frwiki. No errors in the browser console. The logs look clear [13:46:15] !log phuedx@deploy2002 phuedx, sfaci: Continuing with sync [13:47:33] (03CR) 10Elukey: [C:03+2] aptrepo: add confluent77 component to install Kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1249254 (https://phabricator.wikimedia.org/T416670) (owner: 10Elukey) [13:48:55] (03PS1) 10Elukey: Revert "role::kafka::test::broker: upgrade Kafka to 3.5" [puppet] - 10https://gerrit.wikimedia.org/r/1249300 [13:49:17] (03CR) 10CI reject: [V:04-1] Revert "role::kafka::test::broker: upgrade Kafka to 3.5" [puppet] - 10https://gerrit.wikimedia.org/r/1249300 (owner: 10Elukey) [13:49:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin2003.codfw.wmnet [13:49:39] phuedx - I'll be afk for 5-10 minutes. I can deploy my patch later, it's not urgent [13:50:07] !log phuedx@deploy2002 Finished scap sync-world: Backport for [[gerrit:1249262|Disable MetricsPlatform extension (T416865)]] (duration: 08m 02s) [13:50:11] T416865: Remove references to MetricsPlatform extension - https://phabricator.wikimedia.org/T416865 [13:50:44] (03PS2) 10Elukey: Revert "role::kafka::test::broker: upgrade Kafka to 3.5" [puppet] - 10https://gerrit.wikimedia.org/r/1249300 [13:54:02] (03CR) 10Elukey: [C:03+2] Revert "role::kafka::test::broker: upgrade Kafka to 3.5" [puppet] - 10https://gerrit.wikimedia.org/r/1249300 (owner: 10Elukey) [13:54:24] There was a spike in DBConnectionErrors during the deployment but I don't see how it could have been related [13:54:35] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2009.codfw.wmnet with OS bullseye [13:55:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin2003.codfw.wmnet [13:55:41] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Figure out plan for mailman IP situation - https://phabricator.wikimedia.org/T278495#11687872 (10ABran-WMF) I've been able to confirm headers are sent downstream: ` 2026-03-09T13:52:58 331643 2620:0:861:3:208:80:154:81 proxy-server/200 9227 GET htt... [13:56:43] The logs have settled and the error has gone away [13:57:00] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind CDN - https://phabricator.wikimedia.org/T286066#11687873 (10ABran-WMF) I've been able to confirm headers are sent downstream: ` 2026-03-09T13:52:58 331643 2620:0:861:3:208:80:1... [13:57:39] Last two… [13:57:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249243 (https://phabricator.wikimedia.org/T419191) (owner: 10Phuedx) [13:57:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249242 (https://phabricator.wikimedia.org/T419191) (owner: 10Phuedx) [13:58:18] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind CDN - https://phabricator.wikimedia.org/T286066#11687890 (10ABran-WMF) [14:00:19] I'm back [14:01:56] (03CR) 10JMeybohm: "One thing to keep in mind is that with the current configuration all k8s resource will be named `aqs-http-gateway-...` (see CI output) whi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248148 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [14:03:39] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus4003.ulsfo.wmnet with OS bookworm [14:03:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11687921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host prometheus4003.ulsfo.wmnet with OS bookworm executed with er... [14:04:04] Msz2001: Ah [14:05:31] (03Merged) 10jenkins-bot: JS SDK: Add getExperimentByPrefix() [extensions/TestKitchen] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249243 (https://phabricator.wikimedia.org/T419191) (owner: 10Phuedx) [14:05:33] (03Merged) 10jenkins-bot: ext.wikimediaEvents: pageVisit -> loggedOutReaderRetention [extensions/WikimediaEvents] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249242 (https://phabricator.wikimedia.org/T419191) (owner: 10Phuedx) [14:05:53] No problem, take as much time as your patches need :) [14:05:54] !log phuedx@deploy2002 Started scap sync-world: Backport for [[gerrit:1249243|JS SDK: Add getExperimentByPrefix() (T419191)]], [[gerrit:1249242|ext.wikimediaEvents: pageVisit -> loggedOutReaderRetention (T419191)]] [14:06:01] T419191: Start new A/A test for reader retention for March 2026 - https://phabricator.wikimedia.org/T419191 [14:06:18] (03PS1) 10Ottomata: flink - bump to 1.20.3 to pick up fix for FLINK-36457 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249312 (https://phabricator.wikimedia.org/T408918) [14:07:46] !log phuedx@deploy2002 phuedx: Backport for [[gerrit:1249243|JS SDK: Add getExperimentByPrefix() (T419191)]], [[gerrit:1249242|ext.wikimediaEvents: pageVisit -> loggedOutReaderRetention (T419191)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:08:00] (03PS3) 10Vgutierrez: varnish: Set custom glb_requests_limit for thumbs [puppet] - 10https://gerrit.wikimedia.org/r/1248819 [14:08:05] (03PS4) 10Jasmine: install_server: use UEFI for new control plane nodes wikikube-ctrl200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1247695 (https://phabricator.wikimedia.org/T390861) [14:10:36] (03PS1) 10Jelto: gerrit: remove bad_browser from apache config [puppet] - 10https://gerrit.wikimedia.org/r/1249314 (https://phabricator.wikimedia.org/T417263) [14:10:42] (03CR) 10Fabfur: [C:03+1] varnish: Set custom glb_requests_limit for thumbs [puppet] - 10https://gerrit.wikimedia.org/r/1248819 (owner: 10Vgutierrez) [14:11:27] (03PS2) 10Ottomata: flink - bump to 1.20.3 to pick up fix for FLINK-36457 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249312 (https://phabricator.wikimedia.org/T408918) [14:11:32] I've confirmed that the new TestKitchen method is available and that the instrument in the WikimediaEvents extension is loading and using it [14:11:34] Continuing [14:11:43] !log phuedx@deploy2002 phuedx: Continuing with sync [14:12:03] (03PS2) 10Arnaudb: mailman: prepare mailman-web behind CDN [dns] - 10https://gerrit.wikimedia.org/r/1249299 (https://phabricator.wikimedia.org/T286066) [14:12:03] (03CR) 10Arnaudb: "that first patch will decouple the fqdn lists.wm.o from the MX record used by mailman." [dns] - 10https://gerrit.wikimedia.org/r/1249299 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [14:13:23] (03CR) 10Arnaudb: [C:03+1] "lgtm, thanks for the cleanup! 🎉" [puppet] - 10https://gerrit.wikimedia.org/r/1249314 (https://phabricator.wikimedia.org/T417263) (owner: 10Jelto) [14:14:11] jouncebot nowandnext [14:14:11] No deployments scheduled for the next 0 hour(s) and 15 minute(s) [14:14:11] In 0 hour(s) and 15 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T1430) [14:14:14] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:14:54] (03PS2) 10Arnaudb: mailman: move mailman-web behind CDN [dns] - 10https://gerrit.wikimedia.org/r/1249310 (https://phabricator.wikimedia.org/T286066) [14:14:54] (03CR) 10Arnaudb: "that follow up change will effectively move lists behind CDN. Lets wait a bit between merges, so we're able to handle any eventual issues " [dns] - 10https://gerrit.wikimedia.org/r/1249310 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [14:15:17] Msz2001: I think we've got space for yours. There's a new experiment starting in 15 but I can't see there being overlap between your change and that [14:15:34] !log phuedx@deploy2002 Finished scap sync-world: Backport for [[gerrit:1249243|JS SDK: Add getExperimentByPrefix() (T419191)]], [[gerrit:1249242|ext.wikimediaEvents: pageVisit -> loggedOutReaderRetention (T419191)]] (duration: 09m 39s) [14:15:37] T419191: Start new A/A test for reader retention for March 2026 - https://phabricator.wikimedia.org/T419191 [14:15:38] Okay, I can deploy then [14:15:51] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2009.codfw.wmnet with reason: host reimage [14:16:00] Msz2001: The logs look fine. Over to you [14:16:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249291 (https://phabricator.wikimedia.org/T419111) (owner: 10Mszwarc) [14:16:25] (03CR) 10AOkoth: [C:03+2] catalog: add wmf-navigator to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1247608 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [14:16:27] (03CR) 10JMeybohm: [C:03+1] install_server: use UEFI for new control plane nodes wikikube-ctrl200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1247695 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [14:16:48] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind CDN - https://phabricator.wikimedia.org/T286066#11687994 (10ABran-WMF) The move will be done in 2 steps: 1. update the MX record to use lists1004.wm.o as MX 2. update lists.wm.o to u... [14:18:36] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind CDN - https://phabricator.wikimedia.org/T286066#11688006 (10ABran-WMF) [14:20:02] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2009.codfw.wmnet with reason: host reimage [14:21:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:22:47] !log fceratto@cumin1003 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis urwikisource in section s5 [14:22:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:24:18] (03PS1) 10Jasmine: wikikube: add wikikube-ctrl2006 [puppet] - 10https://gerrit.wikimedia.org/r/1249321 (https://phabricator.wikimedia.org/T406596) [14:25:20] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Checking sanitization for wikis urwikisource in section s5 [14:25:51] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11688018 (10VRiley-WMF) Hey @Dzahn for some reason the server had reverted back with it's IP address. However, I hav... [14:28:08] (03CR) 10Btullis: [C:03+1] deployment_server: provision the airflow-fr-tech kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1249207 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [14:28:11] (03CR) 10Herron: [C:03+1] "I'm cool with either approach, either straight to task now or revisit it later. Whichever you prefer" [alerts] - 10https://gerrit.wikimedia.org/r/1248866 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [14:28:15] (03PS1) 10Jelto: aptrepo: Update GitLab 3F01618A51312F3F gpg key [puppet] - 10https://gerrit.wikimedia.org/r/1249323 [14:28:54] (03PS1) 10Blake: switchdc: update set-readonly comment [cookbooks] - 10https://gerrit.wikimedia.org/r/1249322 (https://phabricator.wikimedia.org/T418133) [14:29:18] (03Merged) 10jenkins-bot: Hide 2fa-warning Echo category from preferences [extensions/WikimediaMessages] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1249291 (https://phabricator.wikimedia.org/T419111) (owner: 10Mszwarc) [14:29:37] (03CR) 10Btullis: "I'd say let's go with adding him to `analytics-privatedata-users` (as you currently are) but just explain this in the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/1249212 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [14:29:40] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1249291|Hide 2fa-warning Echo category from preferences (T419111)]] [14:29:43] T419111: Send Echo notification to 2FA-less users who are required to have 2FA - https://phabricator.wikimedia.org/T419111 [14:30:04] !log fceratto@cumin1003 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis urwikisource in section s5 [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T1430) [14:30:32] (03CR) 10Herron: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1249196 (owner: 10Muehlenhoff) [14:31:23] (03PS7) 10Brouberol: data: add dwisehaupt to the analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1249212 (https://phabricator.wikimedia.org/T417213) [14:31:23] (03PS7) 10Brouberol: idp: add the airflow_fr_tech service [puppet] - 10https://gerrit.wikimedia.org/r/1249229 (https://phabricator.wikimedia.org/T417213) [14:31:26] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1249291|Hide 2fa-warning Echo category from preferences (T419111)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:31:52] !log mszwarc@deploy2002 mszwarc: Continuing with sync [14:32:43] (03CR) 10Clément Goubert: [C:03+2] Add new rdb101[56] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247977 (https://phabricator.wikimedia.org/T418916) (owner: 10Clément Goubert) [14:32:51] (03CR) 10Clément Goubert: [C:03+2] Add new wikikube-worker23[57-74] [puppet] - 10https://gerrit.wikimedia.org/r/1247989 (https://phabricator.wikimedia.org/T418925) (owner: 10Clément Goubert) [14:34:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install wikikube-worker23[57-74] - https://phabricator.wikimedia.org/T418925#11688049 (10Clement_Goubert) a:05Clement_Goubert→03None [14:34:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11688053 (10Clement_Goubert) a:05Clement_Goubert→03None [14:34:40] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis urwikisource in section s5 [14:35:22] !log fceratto@cumin1003 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis kaiwiki in section s5 [14:35:47] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1249291|Hide 2fa-warning Echo category from preferences (T419111)]] (duration: 06m 07s) [14:35:50] T419111: Send Echo notification to 2FA-less users who are required to have 2FA - https://phabricator.wikimedia.org/T419111 [14:36:07] (for the record, finished deploying) [14:36:18] (03CR) 10Clément Goubert: [C:03+1] wikikube: add wikikube-ctrl2006 [puppet] - 10https://gerrit.wikimedia.org/r/1249321 (https://phabricator.wikimedia.org/T406596) (owner: 10Jasmine) [14:37:04] (03CR) 10Clément Goubert: [C:03+1] install_server: use UEFI for new control plane nodes wikikube-ctrl200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1247695 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [14:37:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:37:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:38:03] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdg) failed on ms-be2064 - https://phabricator.wikimedia.org/T419394#11688066 (10Jhancock.wm) @MatthewVernon disk replaced [14:39:31] fceratto@cumin1003 sanitize-wiki (PID 1900267) is awaiting input [14:39:37] (03CR) 10Vgutierrez: [C:03+2] varnish: Set custom glb_requests_limit for thumbs [puppet] - 10https://gerrit.wikimedia.org/r/1248819 (owner: 10Vgutierrez) [14:42:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:44:21] (03CR) 10JMeybohm: [C:03+1] "LGTM, but please make sure to condense the node blocks later on" [puppet] - 10https://gerrit.wikimedia.org/r/1249321 (https://phabricator.wikimedia.org/T406596) (owner: 10Jasmine) [14:44:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:45:12] !log elukey@cumin1003 START - Cookbook sre.kafka.change-confluent-distro-version Change Confluent distribution for Kafka A:kafka-test-eqiad cluster: Change Confluent distribution. [14:47:22] fceratto@cumin1003 sanitize-wiki (PID 1900267) is awaiting input [14:49:47] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs2009.codfw.wmnet with OS bullseye [14:49:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:50:12] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2009.codfw.wmnet with OS bookworm [14:50:40] 06SRE, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11688136 (10Clement_Goubert) >>! In T419212#11681013, @MLechvien-WMF wrote: > @JMeybohm as discussed today: > > - Redis hosts upgrade should be... [14:51:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11688137 (10MoritzMuehlenhoff) p:05Triage→03High [14:51:09] 06SRE, 06Traffic: Image Rate Limiting Issues For Future Audiences Project - https://phabricator.wikimedia.org/T418377#11688138 (10CDanis) >>! In T418377#11678861, @HSwan-WMF wrote: > Hey Chris, > > Perhaps we can talk live about this. I'm concerned about you mentioning that there will be no version of the li... [14:52:04] elukey@cumin1003 change-confluent-distro-version (PID 1909431) is awaiting input [14:52:24] (03PS2) 10Ebenezer Rao: fixed typo of the word initial in the test_init.py [software/cumin] - 10https://gerrit.wikimedia.org/r/1224801 (https://phabricator.wikimedia.org/T201491) [14:53:41] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for EMcFarland - https://phabricator.wikimedia.org/T419145#11688145 (10EMcFarland-WMF) Hi, I wanted to add the context that I already have deployment access, which was granted to me via this task. https://phabricator.wikimedia.org/T... [14:53:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [14:54:21] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdg) failed on ms-be2064 - https://phabricator.wikimedia.org/T419394#11688148 (10MatthewVernon) Brilliant, thanks! Replacement is back in service and refilling now. [14:57:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:59:29] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11688170 (10MoritzMuehlenhoff) p:05Triage→03High [14:59:35] elukey@cumin1003 change-confluent-distro-version (PID 1909431) is awaiting input [15:03:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-jumbo1001.eqiad.wmnet [15:05:48] (03PS1) 10Gkyziridis: ml-services: Deploy the newest version of rr-multulingual model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249330 (https://phabricator.wikimedia.org/T415892) [15:07:17] (03CR) 10JavierMonton: [V:03+1] flink - bump to 1.20.3 to pick up fix for FLINK-36457 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249312 (https://phabricator.wikimedia.org/T408918) (owner: 10Ottomata) [15:07:31] (03CR) 10Dpogorzelski: [C:03+2] httpbb: fix ores-legacy test [puppet] - 10https://gerrit.wikimedia.org/r/1248491 (owner: 10AikoChou) [15:07:37] (03CR) 10Dpogorzelski: [C:03+2] httpbb: fix rec-api-ng test [puppet] - 10https://gerrit.wikimedia.org/r/1248481 (owner: 10AikoChou) [15:07:51] (03PS1) 10Herron: udp2log: switch to new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249332 (https://phabricator.wikimedia.org/T417002) [15:08:24] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2009.codfw.wmnet with reason: host reimage [15:09:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-jumbo1001.eqiad.wmnet [15:10:35] 06SRE, 06Traffic: Image Rate Limiting Issues For Future Audiences Project - https://phabricator.wikimedia.org/T418377#11688217 (10derenrich) Amazing. Thank you so much. [15:12:57] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2009.codfw.wmnet with reason: host reimage [15:13:44] 06SRE, 10ServiceOps-Upgrades-Hardware, 06ServiceOps new (Next quarter): Migrate the Serviceops roles away from Bullseye - https://phabricator.wikimedia.org/T419212#11688224 (10JMeybohm) >>! In T419212#11688136, @Clement_Goubert wrote: >>>! In T419212#11681013, @MLechvien-WMF wrote: >> @JMeybohm as discussed... [15:14:26] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thanks!!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249332 (https://phabricator.wikimedia.org/T417002) (owner: 10Herron) [15:14:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people2004.codfw.wmnet [15:15:40] (03PS1) 10Kevin Bazira: ml-services: update embeddings-staging image to one that installs the hipblaslt-dev headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249333 (https://phabricator.wikimedia.org/T418976) [15:17:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:17:48] (03CR) 10Elukey: [C:03+1] cli: update interactive banner for new API [software/cumin] - 10https://gerrit.wikimedia.org/r/1249297 (owner: 10Volans) [15:18:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2004.codfw.wmnet [15:18:55] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:19:10] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/1224801 (https://phabricator.wikimedia.org/T201491) (owner: 10Ebenezer Rao) [15:20:28] (03CR) 10Ozge: [C:03+2] ml-services: update embeddings-staging image to one that installs the hipblaslt-dev headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249333 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [15:21:06] 06SRE, 06Traffic: Image Rate Limiting Issues For Future Audiences Project - https://phabricator.wikimedia.org/T418377#11688263 (10HSwan-WMF) >>! In T418377#11688138, @CDanis wrote: > Sorry for being unclear -- there's no version of the //bot// ratelimits that could yield an acceptable UX for this. Which is so... [15:21:49] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Deploy the newest version of rr-multulingual model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249330 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [15:22:22] (03Merged) 10jenkins-bot: ml-services: update embeddings-staging image to one that installs the hipblaslt-dev headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249333 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [15:24:02] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [15:26:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.kafka.change-confluent-distro-version (exit_code=0) Change Confluent distribution for Kafka A:kafka-test-eqiad cluster: Change Confluent distribution. [15:30:05] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T1530). [15:30:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus4003.ulsfo.wmnet with OS bookworm [15:30:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11688324 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host prometheus4003.ulsfo.wmnet with OS bookworm [15:32:40] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy the newest version of rr-multulingual model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249330 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [15:34:14] (03CR) 10Btullis: [C:03+1] data: add dwisehaupt to the analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1249212 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [15:34:46] (03Merged) 10jenkins-bot: ml-services: Deploy the newest version of rr-multulingual model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249330 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [15:35:49] (03CR) 10Btullis: [C:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1249212 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [15:36:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1005.wikimedia.org [15:36:50] (03CR) 10Elukey: [C:03+2] cloud: update the PKI's api trusted CA certificate [puppet] - 10https://gerrit.wikimedia.org/r/1249224 (https://phabricator.wikimedia.org/T419099) (owner: 10Elukey) [15:37:18] (03CR) 10Volans: [C:03+2] "Thanks for fix!" [software/cumin] - 10https://gerrit.wikimedia.org/r/1224801 (https://phabricator.wikimedia.org/T201491) (owner: 10Ebenezer Rao) [15:37:40] !log gkyziridis@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:39:25] (03CR) 10Dpogorzelski: [C:03+2] pyrra(ML): fix updated revertrisk metric name [puppet] - 10https://gerrit.wikimedia.org/r/1248818 (https://phabricator.wikimedia.org/T419235) (owner: 10Dpogorzelski) [15:40:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1005.wikimedia.org [15:45:41] FIRING: SystemdUnitFailed: bitu-permission-request.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:46:38] FIRING: GnmiTargetDown: cr2-eqdfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [15:47:08] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1249212 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [15:48:36] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978#11688418 (10RobH) I'll work on this now. [15:48:51] (03Merged) 10jenkins-bot: fixed typo of the word initial in the test_init.py [software/cumin] - 10https://gerrit.wikimedia.org/r/1224801 (https://phabricator.wikimedia.org/T201491) (owner: 10Ebenezer Rao) [15:49:01] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1249323 (owner: 10Jelto) [15:52:17] (03CR) 10Elukey: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1249323 (owner: 10Jelto) [15:52:26] (03CR) 10Volans: [C:03+2] cli: update interactive banner for new API [software/cumin] - 10https://gerrit.wikimedia.org/r/1249297 (owner: 10Volans) [15:53:35] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus4003.ulsfo.wmnet with reason: host reimage [15:58:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus4003.ulsfo.wmnet with reason: host reimage [16:00:09] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978#11688523 (10RobH) CS1253254 filed, listed myself, Arzhel, Cathal, and Papaul on the CC list. > Account: WIKIMEDIA > Contact: Robert McMahon Halsell > D... [16:01:38] (03CR) 10Muehlenhoff: [C:03+2] Add netflow4003 [puppet] - 10https://gerrit.wikimedia.org/r/1249264 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [16:05:16] (03Merged) 10jenkins-bot: cli: update interactive banner for new API [software/cumin] - 10https://gerrit.wikimedia.org/r/1249297 (owner: 10Volans) [16:06:44] 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11688558 (10Gehel) [16:06:49] (03PS1) 10Elukey: profile::kafka::broker: allow to specify all 7.x confluent distros [puppet] - 10https://gerrit.wikimedia.org/r/1249344 (https://phabricator.wikimedia.org/T416670) [16:07:32] (03PS2) 10Elukey: profile::kafka::broker: allow to specify all 7.x confluent distros [puppet] - 10https://gerrit.wikimedia.org/r/1249344 (https://phabricator.wikimedia.org/T416670) [16:07:46] 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11688568 (10Gehel) This requires SLI/SLO to be defined by DPE SRE for the OpenSearch on... [16:08:55] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:29] (03PS3) 10Elukey: profile::kafka::broker: allow to specify all 7.x confluent distros [puppet] - 10https://gerrit.wikimedia.org/r/1249344 (https://phabricator.wikimedia.org/T416670) [16:09:53] (03CR) 10Slyngshede: [C:03+1] bitu: Fix argument passing to the permission cleaner module [puppet] - 10https://gerrit.wikimedia.org/r/1249261 (https://phabricator.wikimedia.org/T416152) (owner: 10Muehlenhoff) [16:10:00] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1249344 (https://phabricator.wikimedia.org/T416670) (owner: 10Elukey) [16:11:47] (03CR) 10Muehlenhoff: [C:03+2] bitu: Fix argument passing to the permission cleaner module [puppet] - 10https://gerrit.wikimedia.org/r/1249261 (https://phabricator.wikimedia.org/T416152) (owner: 10Muehlenhoff) [16:13:19] (03CR) 10Elukey: profile::kafka::broker: allow to specify all 7.x confluent distros (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1249344 (https://phabricator.wikimedia.org/T416670) (owner: 10Elukey) [16:13:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:14:10] (03PS4) 10Elukey: profile::kafka::broker: allow to specify all 7.x confluent distros [puppet] - 10https://gerrit.wikimedia.org/r/1249344 (https://phabricator.wikimedia.org/T416670) [16:14:56] (03PS5) 10Elukey: profile::kafka::broker: allow to specify all 7.x confluent distros [puppet] - 10https://gerrit.wikimedia.org/r/1249344 (https://phabricator.wikimedia.org/T416670) [16:15:17] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1249344 (https://phabricator.wikimedia.org/T416670) (owner: 10Elukey) [16:15:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:16:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus4003.ulsfo.wmnet with OS bookworm [16:16:18] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11688618 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host prometheus4003.ulsfo.wmnet with OS bookworm completed: - pro... [16:17:03] (03PS1) 10Kevin Bazira: ml-services: update embeddings-staging image to one that installs the hipsolver-dev headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249345 (https://phabricator.wikimedia.org/T418976) [16:17:10] 06SRE, 10SRE-swift-storage, 10Ceph, 06Data-Persistence, and 2 others: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11688621 (10elukey) 05Open→03Resolved a:03elukey So far it seems that the new settings work nicely, so I am inclined to close this task as finally... [16:17:57] 10SRE-swift-storage, 10Ceph, 06ServiceOps new, 07Epic, and 2 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11688626 (10elukey) 05Open→03Resolved a:03elukey The new settings seems working really nicely eve... [16:20:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:22:59] (03CR) 10Ozge: [C:03+2] ml-services: update embeddings-staging image to one that installs the hipsolver-dev headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249345 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [16:23:20] (03PS1) 10Muehlenhoff: Switch prometheus4003 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1249347 (https://phabricator.wikimedia.org/T419430) [16:23:23] (03PS1) 10SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1249348 (https://phabricator.wikimedia.org/T419265) [16:23:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:25:03] (03Merged) 10jenkins-bot: ml-services: update embeddings-staging image to one that installs the hipsolver-dev headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249345 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [16:25:04] (03PS1) 10Tiziano Fogli: alertmanager/o11y: add route to handle alerts with severity=task [puppet] - 10https://gerrit.wikimedia.org/r/1249349 (https://phabricator.wikimedia.org/T415317) [16:26:15] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [16:26:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host netflow4003.ulsfo.wmnet [16:26:23] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [16:27:33] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1249229 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [16:30:03] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow4003.ulsfo.wmnet - jmm@cumin2002" [16:30:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow4003.ulsfo.wmnet - jmm@cumin2002" [16:30:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:30:22] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netflow4003.ulsfo.wmnet on all recursors [16:30:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow4003.ulsfo.wmnet on all recursors [16:30:55] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow4003.ulsfo.wmnet - jmm@cumin2002" [16:31:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow4003.ulsfo.wmnet - jmm@cumin2002" [16:31:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host netflow4003.ulsfo.wmnet with OS bookworm [16:33:55] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:14] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:37:00] !log installing gnupg security updates [16:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:38:55] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:39:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:41:59] (03CR) 10Gergő Tisza: rest-gateway: add CORS support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [16:44:03] jouncebot: nowandnext [16:44:03] No deployments scheduled for the next 0 hour(s) and 15 minute(s) [16:44:03] In 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T1700) [16:44:03] In 0 hour(s) and 15 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T1700) [16:44:57] (03CR) 10Tiziano Fogli: "This is a preliminary patch to allow sending alerts with priority task to o11y." [puppet] - 10https://gerrit.wikimedia.org/r/1249349 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [16:48:14] Deploying a private code change [16:49:14] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:50:15] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11688894 (10MoritzMuehlenhoff) [16:53:22] (03PS3) 10Tiziano Fogli: prometheus: add cardinality explosion alerts [alerts] - 10https://gerrit.wikimedia.org/r/1248866 (https://phabricator.wikimedia.org/T415317) [16:53:40] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow4003.ulsfo.wmnet with reason: host reimage [16:55:51] (03CR) 10Ottomata: "Local build works fine!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249312 (https://phabricator.wikimedia.org/T408918) (owner: 10Ottomata) [16:56:53] (03CR) 10Btullis: [C:03+1] flink - bump to 1.20.3 to pick up fix for FLINK-36457 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249312 (https://phabricator.wikimedia.org/T408918) (owner: 10Ottomata) [16:58:09] (03CR) 10Btullis: [C:03+2] flink - bump to 1.20.3 to pick up fix for FLINK-36457 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249312 (https://phabricator.wikimedia.org/T408918) (owner: 10Ottomata) [16:58:13] (03CR) 10Btullis: [V:03+2 C:03+2] flink - bump to 1.20.3 to pick up fix for FLINK-36457 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249312 (https://phabricator.wikimedia.org/T408918) (owner: 10Ottomata) [16:59:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow4003.ulsfo.wmnet with reason: host reimage [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T1700) [17:00:05] herron and swfrench-wmf: A patch you scheduled for MediaWiki infrastructure (UTC late) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:05] ryankemper: OwO what's this, a deployment window?? Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T1700). nyaa~ [17:00:21] o/ [17:00:25] hey [17:00:43] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis kaiwiki in section s5 [17:01:06] herron: mine is going to require a bit of testing, so please go ahead with yours if it's straightforward [17:01:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [17:03:24] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host contint1003.wikimedia.org with OS trixie [17:03:30] swfrench-wmf: no worries, I haven't deployed in a few and need to brush up anyway [17:03:38] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11688966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint... [17:06:01] (Finished the private code deploy a few mins ago for people reading back) [17:06:25] (03CR) 10Volans: [C:03+1] "Formally LGTM, nit in the commit message. I'll leave to the ones working on the related tasks to validate the list itself." [puppet] - 10https://gerrit.wikimedia.org/r/1249348 (https://phabricator.wikimedia.org/T419265) (owner: 10SBassett) [17:07:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [17:07:42] (03CR) 10Dzahn: [C:03+1] mailman: prepare mailman-web behind CDN [dns] - 10https://gerrit.wikimedia.org/r/1249299 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [17:08:00] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:09:03] herron: apologies, does that mean I should go ahead? also, let me know if you need help [17:09:14] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Lumen 100g transport) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:10:08] swfrench-wmf: thanks just need to get a config change out [17:10:45] (03CR) 10Dzahn: [C:03+1] mailman: move mailman-web behind CDN [dns] - 10https://gerrit.wikimedia.org/r/1249310 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [17:11:15] (03CR) 10Dzahn: [C:03+1] gerrit: remove bad_browser from apache config [puppet] - 10https://gerrit.wikimedia.org/r/1249314 (https://phabricator.wikimedia.org/T417263) (owner: 10Jelto) [17:13:16] (03CR) 10Herron: [C:03+2] udp2log: switch to new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249332 (https://phabricator.wikimedia.org/T417002) (owner: 10Herron) [17:13:48] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:14:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:14:08] (03Merged) 10jenkins-bot: udp2log: switch to new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249332 (https://phabricator.wikimedia.org/T417002) (owner: 10Herron) [17:15:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow4003.ulsfo.wmnet with OS bookworm [17:15:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow4003.ulsfo.wmnet [17:15:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:16:29] (03PS15) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [17:18:46] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdg) failed on ms-be2064 - https://phabricator.wikimedia.org/T419394#11689022 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:19:12] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:23:35] !log herron@deploy2002 Started scap sync-world: Backport for [[gerrit:1249332|udp2log: switch to new hosts (T417002)]] [17:23:38] T417002: Upgrade Observability mwlog hosts to trixie - https://phabricator.wikimedia.org/T417002 [17:25:23] !log herron@deploy2002 herron: Backport for [[gerrit:1249332|udp2log: switch to new hosts (T417002)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:25:27] PROBLEM - SSH on dse-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:25:45] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:28:06] FIRING: [3x] ProbeDown: Service dse-k8s-ctrl1001:6443 has failed probes (http_dse_k8s_eqiad_kube_apiserver_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:29:33] FIRING: KubernetesAPINotScrapable: k8s-dse@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [17:29:33] FIRING: [2x] KubernetesCalicoDown: dse-k8s-ctrl1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:30:32] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [17:30:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:31:33] PROBLEM - SSH on dse-k8s-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:31:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:31:55] (03CR) 10Daniel Kinzler: [C:04-1] rest-gateway: add CORS support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [17:32:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:33:06] FIRING: [4x] ProbeDown: Service dse-k8s-ctrl1001:6443 has failed probes (http_dse_k8s_eqiad_kube_apiserver_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:33:23] RECOVERY - SSH on dse-k8s-ctrl1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:33:25] RECOVERY - SSH on dse-k8s-ctrl1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:34:29] !log contint1003.mgmt - racadm serveraction powercycle T418544 - not reacting [17:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:33] T418544: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544 [17:34:33] RESOLVED: KubernetesAPINotScrapable: k8s-dse@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [17:34:33] FIRING: [2x] KubernetesCalicoDown: dse-k8s-ctrl1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:37:40] RESOLVED: [6x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:38:06] RESOLVED: [4x] ProbeDown: Service dse-k8s-ctrl1001:6443 has failed probes (http_dse_k8s_eqiad_kube_apiserver_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:38:11] !log contint1003 - unable to get uptime Caused by: Cumin execution failed (exit_code=2) [101/240] - attempted manual powercycle - Initializing Firmware Interfaces... blank screen T418544 [17:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:10] (03PS1) 10AKhatun: stream: mediawiki.page_edit_type_simple [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249360 (https://phabricator.wikimedia.org/T351225) [17:39:33] FIRING: [14x] KubernetesCalicoDown: dse-k8s-ctrl1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:39:54] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:40:35] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:41:16] FIRING: [2x] SLOMetricAbsent: xlab-combined-latency-success-v1 - https://slo.wikimedia.org/?search=xlab-combined-latency-success-v1 - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [17:42:24] !log herron@deploy2002 Sync cancelled. [17:42:42] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:44:33] FIRING: [18x] KubernetesCalicoDown: dse-k8s-ctrl1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:46:09] (03PS1) 10Herron: Revert "udp2log: switch to new hosts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249361 [17:47:17] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:47:54] (03PS1) 10Herron: networkpolicy: allow udp 8420 towards new mwlog hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249362 (https://phabricator.wikimedia.org/T417002) [17:48:02] (03CR) 10Herron: [C:03+2] Revert "udp2log: switch to new hosts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249361 (owner: 10Herron) [17:48:53] (03Merged) 10jenkins-bot: Revert "udp2log: switch to new hosts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249361 (owner: 10Herron) [17:49:53] (03PS1) 10Aaron Schulz: Remove redundant math spec file from wwwportal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249363 (https://phabricator.wikimedia.org/T418188) [17:51:49] (03CR) 10Scott French: [C:03+1] networkpolicy: allow udp 8420 towards new mwlog hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249362 (https://phabricator.wikimedia.org/T417002) (owner: 10Herron) [17:51:50] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11689200 (10Dzahn) Hi @VRiley-WMF Thanks! Now I can reach the DRAC mgmt console. I started the reimage cookbook aga... [17:54:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249363 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz) [17:54:34] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:57:18] jouncebot: notandnext [17:57:23] jouncebot: nowandnext [17:57:23] For the next 0 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T1700) [17:57:23] In 2 hour(s) and 2 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T2000) [18:01:36] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1001.eqiad.wmnet [18:02:16] !log herron@deploy2002 Started scap sync-world: Backport for [[gerrit:1249361|Revert "udp2log: switch to new hosts"]] [18:04:05] !log herron@deploy2002 herron: Backport for [[gerrit:1249361|Revert "udp2log: switch to new hosts"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:05:30] !log herron@deploy2002 Sync cancelled. [18:05:33] (03CR) 10Jforrester: "recheck" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244816 (https://phabricator.wikimedia.org/T414621) (owner: 10Jforrester) [18:05:57] (03PS1) 10Jsn.sherman: PersonalDashbaord: enable CTA for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249364 (https://phabricator.wikimedia.org/T418613) [18:06:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1001.eqiad.wmnet [18:06:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249364 (https://phabricator.wikimedia.org/T418613) (owner: 10Jsn.sherman) [18:06:54] (03PS1) 10Herron: udp2log: switch to new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249365 [18:06:55] (03PS2) 10Jsn.sherman: PersonalDashboard: enable CTA for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249364 (https://phabricator.wikimedia.org/T418613) [18:07:16] (03PS2) 10Herron: udp2log: switch to new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249365 [18:08:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249364 (https://phabricator.wikimedia.org/T418613) (owner: 10Jsn.sherman) [18:10:50] FYI, the work planned for the infra window is running over today. ETA at least another 30-40m of work is pending. [18:11:24] (03CR) 10Scott French: [C:03+2] mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248889 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [18:14:00] (03Merged) 10jenkins-bot: mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248889 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [18:14:14] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:14:22] (03PS2) 10SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1249348 (https://phabricator.wikimedia.org/T419265) [18:15:29] (03PS3) 10SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1249348 (https://phabricator.wikimedia.org/T419265) [18:16:08] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [18:16:34] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [18:17:25] (03CR) 10Jforrester: [C:03+1] Remove `MetricsPlatform` configuration from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247547 (https://phabricator.wikimedia.org/T416865) (owner: 10Santiago Faci) [18:18:29] (03PS1) 10AKhatun: stream: mediawiki.page_edit_type_simple.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249367 (https://phabricator.wikimedia.org/T351225) [18:18:49] (03PS2) 10Jforrester: plugins/wm-pcc: Switch commands from experimental to new puppet [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1244816 (https://phabricator.wikimedia.org/T414621) [18:18:49] (03PS1) 10Jforrester: build: Upgrade eslint-config-wikimedia from 0.23.0 to 0.32.3 and make pass [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1249368 [18:20:01] (03CR) 10CI reject: [V:04-1] stream: mediawiki.page_edit_type_simple.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249367 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [18:20:11] (03PS5) 10Anzx: lift IP cap for womens month editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249316 (https://phabricator.wikimedia.org/T419109) [18:20:11] (03CR) 10CI reject: [V:04-1] build: Upgrade eslint-config-wikimedia from 0.23.0 to 0.32.3 and make pass [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1249368 (owner: 10Jforrester) [18:20:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249316 (https://phabricator.wikimedia.org/T419109) (owner: 10Anzx) [18:21:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249035 (https://phabricator.wikimedia.org/T414237) (owner: 10Anzx) [18:21:57] (03PS1) 10DDesouza: Pre-deploy participant recruitment survey on ptwiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249370 (https://phabricator.wikimedia.org/T419275) [18:23:31] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:23:39] (03PS2) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248148 (https://phabricator.wikimedia.org/T414112) [18:23:41] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host contint1003.wikimedia.org with OS trixie [18:23:51] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11689371 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint1003... [18:23:55] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:26:17] RESOLVED: [2x] SLOMetricAbsent: xlab-combined-latency-success-v1 - https://slo.wikimedia.org/?search=xlab-combined-latency-success-v1 - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [18:27:16] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [18:27:29] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [18:29:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249370 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza) [18:29:42] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [18:29:47] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [18:32:03] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:32:25] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:33:07] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [18:33:24] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [18:34:37] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [18:34:49] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [18:35:42] 10ops-codfw, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker2332:9290 - https://phabricator.wikimedia.org/T419462 (10phaultfinder) 03NEW [18:37:18] (03PS1) 10Scott French: Revert "mw-debug: Pilot new drain configuration" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249373 (https://phabricator.wikimedia.org/T364245) [18:39:02] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [18:39:33] RESOLVED: [18x] KubernetesCalicoDown: dse-k8s-ctrl1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:39:47] (03PS2) 10AKhatun: stream: mediawiki.page_edit_type_simple [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249360 (https://phabricator.wikimedia.org/T351225) [18:40:38] (03CR) 10Scott French: [C:03+2] Revert "mw-debug: Pilot new drain configuration" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249373 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [18:41:40] (03CR) 10Ottomata: stream: mediawiki.page_edit_type_simple.dev0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249367 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [18:42:15] (03PS1) 10Mforns: Revert "dse-k8s-services Airflow: Bump image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249376 [18:42:43] (03PS1) 10Btullis: Revert "dse-k8s-services Airflow: Bump image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249377 [18:43:05] (03Merged) 10jenkins-bot: Revert "mw-debug: Pilot new drain configuration" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249373 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [18:43:22] (03CR) 10Btullis: [C:03+2] Revert "dse-k8s-services Airflow: Bump image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249376 (owner: 10Mforns) [18:43:39] (03Abandoned) 10Btullis: Revert "dse-k8s-services Airflow: Bump image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249377 (owner: 10Btullis) [18:44:24] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [18:44:50] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [18:45:37] herron: all clear on my end if you'd like to move forward with your change. [18:45:40] (03PS1) 10Majavah: P:wmcs: maintain_dbusers: Restart service after configuration changes [puppet] - 10https://gerrit.wikimedia.org/r/1249378 [18:45:48] 10ops-codfw, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker2333:9290 - https://phabricator.wikimedia.org/T419465 (10phaultfinder) 03NEW [18:45:53] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [18:46:00] (03Merged) 10jenkins-bot: Revert "dse-k8s-services Airflow: Bump image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249376 (owner: 10Mforns) [18:46:01] swfrench-wmf: thanks! ok proceeding [18:46:24] (03CR) 10Herron: [C:03+2] networkpolicy: allow udp 8420 towards new mwlog hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249362 (https://phabricator.wikimedia.org/T417002) (owner: 10Herron) [18:47:08] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8239/co" [puppet] - 10https://gerrit.wikimedia.org/r/1249378 (owner: 10Majavah) [18:47:33] (03CR) 10CI reject: [V:04-1] P:wmcs: maintain_dbusers: Restart service after configuration changes [puppet] - 10https://gerrit.wikimedia.org/r/1249378 (owner: 10Majavah) [18:48:41] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8240/co" [puppet] - 10https://gerrit.wikimedia.org/r/1249378 (owner: 10Majavah) [18:49:03] (03PS1) 10Majavah: definitions: Add port for x1 on the wiki replicas [homer/public] - 10https://gerrit.wikimedia.org/r/1249379 (https://phabricator.wikimedia.org/T407485) [18:49:07] (03CR) 10AKhatun: stream: mediawiki.page_edit_type_simple.dev0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249367 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [18:49:12] (03PS2) 10Majavah: P:wmcs: maintain_dbusers: Restart service after configuration changes [puppet] - 10https://gerrit.wikimedia.org/r/1249378 [18:49:18] (03Merged) 10jenkins-bot: networkpolicy: allow udp 8420 towards new mwlog hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249362 (https://phabricator.wikimedia.org/T417002) (owner: 10Herron) [18:49:51] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [18:50:38] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [18:51:08] herron: cool, your networkpolicy change is merged. lemme confirm the diff is live. [18:51:17] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8241/co" [puppet] - 10https://gerrit.wikimedia.org/r/1249378 (owner: 10Majavah) [18:52:29] (03CR) 10Majavah: [V:03+1] "(In case you're wondering, T407485#11689496 is the current surprise in this area.)" [puppet] - 10https://gerrit.wikimedia.org/r/1249378 (owner: 10Majavah) [18:53:28] herron: you're good to backport your mediawiki-config patch, or we can roll this out first. either should work. [18:53:48] (03CR) 10Ottomata: stream: mediawiki.page_edit_type_simple.dev0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249367 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [18:54:12] (03PS7) 10Santiago Faci: Remove `MetricsPlatform` configuration from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247547 (https://phabricator.wikimedia.org/T416865) [18:54:18] swfrench-wmf: great, sure I'll start the backport now [18:54:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by herron@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249365 (owner: 10Herron) [18:55:12] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp7016.magru.wmnet with OS trixie [18:55:30] (03Merged) 10jenkins-bot: udp2log: switch to new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249365 (owner: 10Herron) [18:55:43] (03PS1) 10Lerickson: Add lerickson and trueg to analytics-wikidata-users [puppet] - 10https://gerrit.wikimedia.org/r/1249380 [18:55:47] !log herron@deploy2002 Started scap sync-world: Backport for [[gerrit:1249365|udp2log: switch to new hosts]] [18:57:10] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp7001.magru.wmnet with OS trixie [18:57:26] herron: FYI, while you're getting up through testservers, you might see me run some helmfile applies in the background preemptively. these are for some weird edge cases that scap's internal helmfile'ing won't pick up. [18:57:45] !log herron@deploy2002 herron: Backport for [[gerrit:1249365|udp2log: switch to new hosts]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:58:34] swfrench-wmf: alright thanks. testing look good but should I wait for you to finish that before syncing? [18:59:25] herron: great. yeah, maybe give me a second to do these first. ETA 2 minutes. [18:59:44] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [18:59:51] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [18:59:56] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [18:59:58] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [19:00:06] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [19:00:13] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:01:04] herron: all yours [19:01:05] (03PS2) 10Lerickson: Add lerickson and trueg to analytics-wikidata-users [puppet] - 10https://gerrit.wikimedia.org/r/1249380 [19:01:12] (03PS3) 10Lerickson: Add lerickson and trueg to analytics-wikidata-users [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) [19:01:14] swfrench-wmf: ok, here goes [19:01:24] !log herron@deploy2002 herron: Continuing with sync [19:01:30] (03CR) 10Lerickson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [19:01:39] (03CR) 10CDobbins: [C:03+2] prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [19:01:53] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [19:02:04] (03CR) 10CI reject: [V:04-1] Add lerickson and trueg to analytics-wikidata-users [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [19:03:28] (03PS1) 10Ebernhardson: semanticsearch: Increase heap by 1gb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249382 (https://phabricator.wikimedia.org/T414623) [19:03:43] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:03:54] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [19:05:25] !log herron@deploy2002 Finished scap sync-world: Backport for [[gerrit:1249365|udp2log: switch to new hosts]] (duration: 09m 38s) [19:07:24] thanks for the help swfrench-wmf! [19:08:19] (03PS4) 10SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1249348 (https://phabricator.wikimedia.org/T419265) [19:08:47] herron: no problem! thanks for letting me jump in for a bit there [19:09:17] (03CR) 10Dzahn: "the reason you got the CI downvote is just because you have a new line in the commit message after the "Bug:" line" [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [19:09:48] (03Abandoned) 10Dzahn: cumin: add alias variant for tcp-proxy vs tcpproxy [puppet] - 10https://gerrit.wikimedia.org/r/1238798 (owner: 10Dzahn) [19:11:20] (03PS4) 10Lerickson: Add lerickson and trueg to analytics-wikidata-users [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) [19:12:05] (03PS5) 10Lerickson: Add lerickson and trueg to analytics-wikidata-users [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) [19:12:06] (03CR) 10CI reject: [V:04-1] Add lerickson and trueg to analytics-wikidata-users [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [19:12:20] (03CR) 10Lerickson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [19:12:52] (03CR) 10CI reject: [V:04-1] Add lerickson and trueg to analytics-wikidata-users [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [19:14:19] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [19:14:36] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [19:15:43] !log cwhite@deploy2002 Started deploy [performance/arc-lamp@aa8da8b]: Ie7e0355f89294a2927f9dbc28afec3a62d1752de [19:15:51] !log cwhite@deploy2002 Finished deploy [performance/arc-lamp@aa8da8b]: Ie7e0355f89294a2927f9dbc28afec3a62d1752de (duration: 00m 08s) [19:16:37] (03PS6) 10Lerickson: Add lerickson and trueg to analytics-wikidata-users [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) [19:16:40] (03CR) 10Lerickson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [19:18:18] 10SRE-Access-Requests, 07OKR-Work, 13Patch-For-Review, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency - https://phabricator.wikimedia.org/T418723#11689563 (10Dzahn) [19:18:31] (03CR) 10Lerickson: "Thanks! I think I got it into a compliant state now." [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [19:18:55] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:19:25] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7016.magru.wmnet with reason: host reimage [19:21:19] (03PS1) 10Mforns: Revert^2 "dse-k8s-services Airflow: Bump image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249384 [19:22:14] (03CR) 10Dzahn: "yea, you did. looks good now. it's just that an access request normally should be linked to an access request ticket." [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [19:23:32] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7001.magru.wmnet with reason: host reimage [19:24:15] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7016.magru.wmnet with reason: host reimage [19:24:39] (03CR) 10Btullis: [C:03+2] Revert^2 "dse-k8s-services Airflow: Bump image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249384 (owner: 10Mforns) [19:27:06] (03Merged) 10jenkins-bot: Revert^2 "dse-k8s-services Airflow: Bump image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249384 (owner: 10Mforns) [19:27:16] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978#11689626 (10RobH) Ok, they swapped the optic in cr2-magru but still shows down: et-0/0/1 up down Core: asw1-b3-magru:et-0/0/50 {#70130} The o... [19:28:35] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7001.magru.wmnet with reason: host reimage [19:28:44] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [19:29:33] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [19:35:41] FIRING: [2x] SystemdUnitFailed: bitu-permission-request.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:35:51] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978#11689696 (10RobH) > Support, > > Thank you, we can see the old module QSFP-100GBASE-SR4 SN GT3AAG00321 was removed and replaced with QSFP-100GBASE-SR4 mo... [19:36:00] (03PS1) 10CDobbins: prometheus: fix pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1249385 (https://phabricator.wikimedia.org/T406641) [19:36:44] (03CR) 10CI reject: [V:04-1] prometheus: fix pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1249385 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [19:36:48] (03CR) 10BCornwall: [C:03+1] prometheus: fix pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1249385 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [19:37:16] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8242/co" [puppet] - 10https://gerrit.wikimedia.org/r/1249385 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [19:37:38] (03PS2) 10CDobbins: prometheus: fix pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1249385 (https://phabricator.wikimedia.org/T406641) [19:38:39] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8243/co" [puppet] - 10https://gerrit.wikimedia.org/r/1249385 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [19:40:02] (03CR) 10CDobbins: [V:03+1 C:03+2] prometheus: fix pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1249385 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [19:40:15] jouncebot: nowandnext [19:40:16] No deployments scheduled for the next 0 hour(s) and 19 minute(s) [19:40:16] In 0 hour(s) and 19 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T2000) [19:41:40] (03CR) 10Zabe: [C:03+2] Stop writing to il_to on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248911 (https://phabricator.wikimedia.org/T415787) (owner: 10Zabe) [19:42:33] (03Merged) 10jenkins-bot: Stop writing to il_to on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248911 (https://phabricator.wikimedia.org/T415787) (owner: 10Zabe) [19:43:02] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248911|Stop writing to il_to on commonswiki (T415787)]] [19:43:06] T415787: Stop writing to il_to by setting imagelinks migration to write new - https://phabricator.wikimedia.org/T415787 [19:44:50] !log zabe@deploy2002 zabe: Backport for [[gerrit:1248911|Stop writing to il_to on commonswiki (T415787)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:45:08] !log zabe@deploy2002 zabe: Continuing with sync [19:46:36] PROBLEM - SSH on an-druid1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:46:38] FIRING: GnmiTargetDown: cr2-eqdfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [19:46:50] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:49:05] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248911|Stop writing to il_to on commonswiki (T415787)]] (duration: 06m 04s) [19:49:58] PROBLEM - Druid historical on an-druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [19:50:26] RECOVERY - SSH on an-druid1006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:50:41] FIRING: [2x] SystemdUnitFailed: bitu-permission-request.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:51:12] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7016.magru.wmnet with OS trixie [19:51:50] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:54:50] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7001.magru.wmnet with OS trixie [19:57:58] (03PS2) 10AKhatun: stream: mediawiki.page_edit_type_simple.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249367 (https://phabricator.wikimedia.org/T351225) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T2000). [20:00:05] tgr, AaronSchulz, anzx, and danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] o/ [20:00:16] I can self-deploy if needed [20:00:27] * AaronSchulz is here [20:01:11] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7016.* [20:01:36] (03CR) 10AKhatun: stream: mediawiki.page_edit_type_simple.dev0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249367 (https://phabricator.wikimedia.org/T351225) (owner: 10AKhatun) [20:03:02] o/ [20:03:50] I'll do mine [20:05:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aaron@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249363 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz) [20:05:58] (03Merged) 10jenkins-bot: Remove redundant math spec file from wwwportal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249363 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz) [20:06:19] !log aaron@deploy2002 Started scap sync-world: Backport for [[gerrit:1249363|Remove redundant math spec file from wwwportal (T418188)]] [20:06:22] T418188: Simplify static Restbase json spec file configuration - https://phabricator.wikimedia.org/T418188 [20:07:15] danisztls: do you need to test those changes? if not, I'll bundle them with some others [20:08:11] !log aaron@deploy2002 aaron: Backport for [[gerrit:1249363|Remove redundant math spec file from wwwportal (T418188)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:08:58] RECOVERY - Druid historical on an-druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [20:09:22] !log aaron@deploy2002 aaron: Continuing with sync [20:11:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247119 (https://phabricator.wikimedia.org/T414852) (owner: 10C. Scott Ananian) [20:11:59] tgr_: I need to visually test but it's quick to do. I can deploy them after if you prefer. [20:12:01] o/ [20:12:21] i'm late to the party, but if anyone making a config patch could throw https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1247119 in, i'd appreciate it [20:13:15] !log aaron@deploy2002 Finished scap sync-world: Backport for [[gerrit:1249363|Remove redundant math spec file from wwwportal (T418188)]] (duration: 06m 56s) [20:13:19] T418188: Simplify static Restbase json spec file configuration - https://phabricator.wikimedia.org/T418188 [20:13:58] (it looks like everyone has config changes today?) [20:14:03] can do [20:14:11] done [20:14:17] tgr_: thanks! i'll be here to test, although there's not a lot to test [20:14:26] 06SRE, 10SRE-Access-Requests, 07OKR-Work, 13Patch-For-Review, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency - https://phabricator.wikimedia.org/T418723#11689899 (10lerickson) [20:15:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:15:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249316 (https://phabricator.wikimedia.org/T419109) (owner: 10Anzx) [20:15:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247119 (https://phabricator.wikimedia.org/T414852) (owner: 10C. Scott Ananian) [20:15:12] (03CR) 10Lerickson: "Hey, thanks for updating the ticket to include the access request. I'm new here (clearly) and didn't realize. I appreciate the help!" [puppet] - 10https://gerrit.wikimedia.org/r/1249380 (https://phabricator.wikimedia.org/T418723) (owner: 10Lerickson) [20:15:58] (03CR) 10CI reject: [V:04-1] Migrate EmailAuth, step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:16:24] (03Merged) 10jenkins-bot: lift IP cap for womens month editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249316 (https://phabricator.wikimedia.org/T419109) (owner: 10Anzx) [20:16:30] (03Merged) 10jenkins-bot: Enable parser survey for opted-out users on German/French/Polish wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247119 (https://phabricator.wikimedia.org/T414852) (owner: 10C. Scott Ananian) [20:17:52] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1247119|Enable parser survey for opted-out users on German/French/Polish wikis (T414852)]], [[gerrit:1249316|lift IP cap for womens month editathon (T419109)]] [20:17:59] T414852: Run a survey to understand why existing logged in users might be opting out of Parsoid - https://phabricator.wikimedia.org/T414852 [20:17:59] T419109: Requesting temporary lift of IP cap for 14 March 2026 - https://phabricator.wikimedia.org/T419109 [20:19:46] !log tgr@deploy2002 cscott, tgr, anzx: Backport for [[gerrit:1247119|Enable parser survey for opted-out users on German/French/Polish wikis (T414852)]], [[gerrit:1249316|lift IP cap for womens month editathon (T419109)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:19:58] ok, taking a look [20:27:24] looks good, ok to continue [20:27:32] !log tgr@deploy2002 cscott, tgr, anzx: Continuing with sync [20:29:29] (03CR) 10Gergő Tisza: [C:03+2] Migrate EmailAuth, step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:30:32] (03CR) 10CI reject: [V:04-1] Migrate EmailAuth, step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:30:47] bah [20:30:52] filed as T419476 [20:30:52] T419476: Bogus PhanPluginDuplicateArrayKey error in MediaWiki-config - https://phabricator.wikimedia.org/T419476 [20:31:28] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247119|Enable parser survey for opted-out users on German/French/Polish wikis (T414852)]], [[gerrit:1249316|lift IP cap for womens month editathon (T419109)]] (duration: 13m 36s) [20:31:34] T414852: Run a survey to understand why existing logged in users might be opting out of Parsoid - https://phabricator.wikimedia.org/T414852 [20:31:34] T419109: Requesting temporary lift of IP cap for 14 March 2026 - https://phabricator.wikimedia.org/T419109 [20:31:47] I'll deploy a private change [20:32:04] (03PS2) 10Pppery: Ncredir cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1249390 [20:32:10] (03PS3) 10Pppery: Ncredir cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1249390 [20:33:00] !log tgr@deploy2002 Locking from deployment [MediaWiki]: working on private change [20:34:14] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:34:59] (03PS3) 10Jforrester: Migrate EmailAuth, step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:34:59] (03PS1) 10Jforrester: build: Upgrade mediawiki-phan-config from 0.18.0 to 0.19.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249393 (https://phabricator.wikimedia.org/T419476) [20:36:01] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp7002.magru.wmnet with OS trixie [20:38:41] (03PS2) 10Jforrester: build: Upgrade mediawiki-phan-config from 0.18.0 to 0.20.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249393 (https://phabricator.wikimedia.org/T419476) [20:38:41] (03PS4) 10Jforrester: Migrate EmailAuth, step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:38:41] (03PS1) 10Jforrester: build: Upgrade mediawiki-codesniffer from 49.0.0 to 50.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249394 [20:38:43] (03CR) 10Andrew Bogott: [C:03+2] site: Use nftables insetup role for cloudgw2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1248004 (https://phabricator.wikimedia.org/T418765) (owner: 10Majavah) [20:40:38] (03PS1) 10Jforrester: build: Upgrade symfony/yaml from 7.4.0 to 7.4.6 and alpha-sort [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249395 [20:40:44] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:40:44] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:41:45] tgr_: are you still deploying a private change? [20:43:10] !log tgr@deploy2002 Unlocked for deployment [MediaWiki]: working on private change (duration: 10m 10s) [20:43:22] yeah [20:43:38] sorry, should have read the manual first [20:43:58] tgr_: no problem, can you ping me after you finish? [20:44:03] haven't done this since we moved to k8s [20:44:05] sure [20:44:08] thanks! [20:44:25] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2004-dev.codfw.wmnet with OS trixie [20:45:08] tgr_: I believe the phan glitch you encountered isn't a real report. My patches should change things anyway, but as you can see they pass CI. [20:46:30] I imagine I triggered it because the patch touched the line immediately next to it? [20:46:38] the other patches merged fine [20:47:12] tgr_: Given that it passes CI, I imagine it's a cosmic ray / rebase issue. [20:47:17] (03CR) 10Daimona Eaytoy: [C:03+1] build: Upgrade mediawiki-phan-config from 0.18.0 to 0.20.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249393 (https://phabricator.wikimedia.org/T419476) (owner: 10Jforrester) [20:47:38] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2010 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:48:38] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2010 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:49:14] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:49:17] rebase issue sounds about right [20:49:51] maybe phan runs on a merge conflicted version, somehow? [20:51:49] There's no magic here. It runs what the state of the CI repo is, which is made by zuul inside CI if your patch isn't already on top of main. [20:52:40] ugh, can't test the patch due the account creation limit and resetAuthenticationThrottle.php seems broken [20:53:07] time to go on an adventure with resi-proxies I suppose [20:54:30] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:54:40] !log [WDQS] `ryankemper@cumin2002:~$ sudo -E cumin 'A:wdqs-main AND P{wdqs2*}' 'systemctl restart wdqs-blazegraph'` [20:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:44] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:56:12] !log [WDQS] `ryankemper@cumin2002:~$ sudo -E cumin 'A:wdqs-main AND P{wdqs1*}' 'systemctl restart wdqs-blazegraph'` [20:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:30] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:57:00] !log [WDQS] Auto-remediation would have eventually restarted these, but some of them were staying below our current threshold of `threads > 1200`. May want to lower threshold, or examine an additional metric-type to look at in the future [20:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:44] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:59:30] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T2100). [21:00:30] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:00:34] we are running over with the backport window [21:00:38] !log cdobbins@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: Trixie reimaging] [21:01:01] tgr_: no worries just let us know when you are done [21:01:01] !log [WDQS] Alright, these are re-entering a failed state soon enough that we will need to identify the offender if we want to restore proper service. We could put some temporary hack to restart every few minutes so we at least maintain some uptime, but root cause is the usual 'we need a requestctl rule to block whoever's killing us' scenario [21:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:03] (I should have probably used the security window for the private deploy, sorry) [21:01:17] we do have a few security patches to get out today [21:01:49] !log removed private code for T397244 [21:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:52] T397244: Private mitigation blocks registration from certain email domains but gives misleading error about rate limits - https://phabricator.wikimedia.org/T397244 [21:01:56] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2004-dev.codfw.wmnet with reason: host reimage [21:02:06] maryum: is there time for one more config change? [21:02:18] tgr_: go ahead [21:02:22] danisztls: ^ [21:02:26] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7002.magru.wmnet with reason: host reimage [21:03:10] I'll mark the other two as not done, one is in CI hell and one needs testing and anzx doesn't seem to be around [21:03:44] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:04:30] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:05:42] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp7002.magru.wmnet with reason: host reimage [21:05:45] is there still time? [21:06:17] (03PS1) 10Dzahn: microsites: add monitoring for status.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1240417 (https://phabricator.wikimedia.org/T414098) [21:07:01] (03CR) 10Scott French: [C:03+1] "+1 from the standpoint of this being valid VCL." [puppet] - 10https://gerrit.wikimedia.org/r/1249348 (https://phabricator.wikimedia.org/T419265) (owner: 10SBassett) [21:07:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249370 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza) [21:08:25] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2004-dev.codfw.wmnet with reason: host reimage [21:09:13] (03Merged) 10jenkins-bot: Pre-deploy participant recruitment survey on ptwiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1249370 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza) [21:09:14] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Lumen 100g transport) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:09:17] we would like to run scap now [21:09:24] is anyone deploying anything? [21:09:32] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1249370|Pre-deploy participant recruitment survey on ptwiki and trwiki (T419275)]] [21:09:36] T419275: Deploy QuickSurvey for research participant registration drive on trwiki & ptwiki - https://phabricator.wikimedia.org/T419275 [21:09:37] yeah, sorry, this is the last one [21:09:49] tgr_: okay [21:10:02] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:11:02] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2021 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:11:28] !log dani@deploy2002 dani: Backport for [[gerrit:1249370|Pre-deploy participant recruitment survey on ptwiki and trwiki (T419275)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:12:48] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp7002 is CRITICAL: connect to address 10.140.1.4 and port 3128: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:12:48] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [21:12:48] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:12:48] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:13:21] read that as "PayPal" backends and was very confused about since when did we have that as a service [21:13:46] !log dani@deploy2002 dani: Continuing with sync [21:15:27] (03PS1) 10Ryan Kemper: wdqs: Tighten deadlock remediation thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1249405 (https://phabricator.wikimedia.org/T242453) [21:16:02] (03CR) 10RLazarus: [C:03+2] "Merging to unblock a vopsbot upgrade, thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/1249323 (owner: 10Jelto) [21:16:46] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:16:47] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:17:48] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1249370|Pre-deploy participant recruitment survey on ptwiki and trwiki (T419275)]] (duration: 08m 15s) [21:17:51] T419275: Deploy QuickSurvey for research participant registration drive on trwiki & ptwiki - https://phabricator.wikimedia.org/T419275 [21:18:21] tgr_: thanks! [21:18:31] maryum: sorry, just finished [21:18:57] danisztls: thank you! [21:19:03] (03CR) 10Ryan Kemper: "Active incident, low blast radius, shotgun merging." [puppet] - 10https://gerrit.wikimedia.org/r/1249405 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [21:19:08] (03CR) 10Ryan Kemper: [C:03+2] wdqs: Tighten deadlock remediation thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1249405 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [21:19:23] andrewbogott: your puppet-merge isn't stuck by any chance, is it? :) [21:19:26] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp7002 is OK: HTTP OK: HTTP/1.1 200 OK - 47771 bytes in 0.447 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:19:48] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:19:48] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:20:06] ryankemper: I'll want to eyeball the diff on my puppet-merge to make sure the key is correct, happy to merge yours along with it [21:20:22] rzl: yes, please do. [21:20:28] PROBLEM - haproxy process on cp7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [21:20:28] PROBLEM - statsv Varnishkafka log producer on cp7002 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [21:20:28] just waiting for the lock to be released [21:20:54] dang, yes it is sorry [21:21:48] which renders useless the reimage I have in progress :( [21:21:59] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudgw2004-dev.codfw.wmnet with OS trixie [21:22:24] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2004-dev.codfw.wmnet with OS trixie [21:22:41] puppetserver is all yours [21:22:47] rzl ^ [21:22:51] thanks :) [21:23:31] ryankemper: merge complete [21:23:40] wonderful :) [21:23:45] (03CR) 10Umherirrender: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235552 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [21:23:55] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:24:22] LOL. service is in such bad shape it's not writing a metric it can use for slo. (that's my theory, could be wrong) [21:24:22] (03PS1) 10CDobbins: prometheus: fix pooled host check (again) [puppet] - 10https://gerrit.wikimedia.org/r/1249407 [21:24:48] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:24:48] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:25:50] PROBLEM - Host cp7002 is DOWN: PING CRITICAL - Packet loss = 100% [21:25:59] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8244/console" [puppet] - 10https://gerrit.wikimedia.org/r/1249407 (owner: 10CDobbins) [21:25:59] Hey swfrench-wmf - we’re currently deploying a security patch, but I think it’s probably ok to do the CSP updates in tandem? if they won’t conflict with scap? I’m on cp1112 btw. [21:27:57] sbassett: yes, that should be fine in terms of scap. I'm just pinging traffic real quick for an ack about touching the cp hosts, since there's potentially some work ongoig. [21:27:59] *ongoing [21:28:07] also thanks for the host :) [21:28:55] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:29:00] RECOVERY - Host cp7002 is UP: PING OK - Packet loss = 0%, RTA = 110.60 ms [21:29:02] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [21:29:12] PROBLEM - haproxy process on cp7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [21:29:12] PROBLEM - statsv Varnishkafka log producer on cp7002 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [21:29:44] !log Deployed security fix for T419186 [21:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:03] swfrench-wmf: sounds good! [21:31:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:31:58] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [21:31:58] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [21:33:08] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp7002 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 64 days) https://wikitech.wikimedia.org/wiki/HTTPS [21:33:12] RECOVERY - haproxy process on cp7002 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [21:33:58] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp7002 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 28 days) https://wikitech.wikimedia.org/wiki/HTTPS [21:33:58] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp7002 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-04-18 10:56:56 +0000 (expires in 39 days) https://wikitech.wikimedia.org/wiki/HTTPS [21:34:12] RECOVERY - statsv Varnishkafka log producer on cp7002 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [21:34:53] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7002.magru.wmnet with OS trixie [21:37:22] !log cdobbins@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7002.magru.wmnet [21:38:14] sbassett: I think we're good to go [21:40:11] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2004-dev.codfw.wmnet with reason: host reimage [21:43:40] (03CR) 10Scott French: [C:03+2] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1249348 (https://phabricator.wikimedia.org/T419265) (owner: 10SBassett) [21:44:26] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2004-dev.codfw.wmnet with reason: host reimage [21:46:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:47:01] (03PS1) 10JHathaway: mailman: send web posting through Spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/1249415 (https://phabricator.wikimedia.org/T386559) [21:47:21] PROBLEM - Host dse-k8s-worker1028 is DOWN: PING CRITICAL - Packet loss = 100% [21:47:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:49:33] FIRING: KubernetesCalicoDown: dse-k8s-worker1028.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1028.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:49:44] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists via HyperKitty web ui - https://phabricator.wikimedia.org/T386559#11690217 (10jhathaway) Ok we have two options now: # disable: [mailman: disable... [21:49:46] need to run scap again for the security deploy [21:49:57] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1249415 (https://phabricator.wikimedia.org/T386559) (owner: 10JHathaway) [21:52:42] sbassett: alright, changes are live on cp1112 if you'd like to verify they look like what you expect [21:52:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:53:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:57:30] sbassett: if you could please confirm you'd like to proceed, that would be appreciated [21:58:55] sure, one second [21:58:57] (03PS2) 10JHathaway: mailman: send web posting through Spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/1249415 (https://phabricator.wikimedia.org/T386559) [21:59:10] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1249415 (https://phabricator.wikimedia.org/T386559) (owner: 10JHathaway) [21:59:31] swfrench-wmf: lgtm, thanks [21:59:54] sbassett: great, thanks! this should be everywhere in ~ 15-20m [22:01:19] (03PS1) 10Majavah: extdist: Set up a logrotate rule [puppet] - 10https://gerrit.wikimedia.org/r/1249417 (https://phabricator.wikimedia.org/T253588) [22:02:19] (03CR) 10Reedy: [C:03+1] extdist: Set up a logrotate rule [puppet] - 10https://gerrit.wikimedia.org/r/1249417 (https://phabricator.wikimedia.org/T253588) (owner: 10Majavah) [22:02:21] swfrench-wmf: thanks! [22:02:33] !log Redeployed security fix for T419186 [22:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:02] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2004-dev.codfw.wmnet with OS trixie [22:03:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:03:15] (03PS2) 10Majavah: extdist: Set up a logrotate rule [puppet] - 10https://gerrit.wikimedia.org/r/1249417 (https://phabricator.wikimedia.org/T253588) [22:03:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:03:53] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8246/co" [puppet] - 10https://gerrit.wikimedia.org/r/1249417 (https://phabricator.wikimedia.org/T253588) (owner: 10Majavah) [22:04:18] (03CR) 10Majavah: [V:03+1 C:03+2] extdist: Set up a logrotate rule [puppet] - 10https://gerrit.wikimedia.org/r/1249417 (https://phabricator.wikimedia.org/T253588) (owner: 10Majavah) [22:08:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:08:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:14:14] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:18:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:22:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:27:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:28:08] !log bking@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM dse-k8s-ctrl1002.eqiad.wmnet [22:28:13] !log bking@cumin2002 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM dse-k8s-ctrl1002.eqiad.wmnet [22:28:36] !log bking@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM dse-k8s-ctrl1002.eqiad.wmnet [22:28:36] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978#11690332 (10RobH) They've now replace the patch cable but we're still seeing down: > Comentário gerado em Smart Hands: Dear, evening. > > As requested... [22:28:41] !log bking@cumin2002 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM dse-k8s-ctrl1002.eqiad.wmnet [22:29:02] !log bking@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM dse-k8s-ctrl1002.eqiad.wmnet [22:30:54] !log bking@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM dse-k8s-ctrl1002.eqiad.wmnet [22:31:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:32:07] !log bking@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM dse-k8s-ctrl1001.eqiad.wmnet [22:34:24] !log bking@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM dse-k8s-ctrl1001.eqiad.wmnet [22:38:57] (03CR) 10BCornwall: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1249390 (owner: 10Pppery) [22:41:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:45:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:51:17] !log root@apt1002:~# reprepro --noskipold --restrict vopsbot update bookworm-wikimedia [22:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:59:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260309T2300) [23:04:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:09:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:10:34] (03PS1) 10Jasmine: wmnet: add wikikube-ctrl2006 to etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1249423 (https://phabricator.wikimedia.org/T406596) [23:10:40] (03CR) 10Brennen Bearnes: [C:03+1] phab_deploy_finalize: Remove support for Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248466 (owner: 10Muehlenhoff) [23:14:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:19:14] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:20:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:28:55] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Lumen 100g transport) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:33:55] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Lumen 100g transport) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:40:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:42:13] (03PS1) 10Scott French: envoy: Decouple graceful drain from drain strategy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1249428 (https://phabricator.wikimedia.org/T364245) [23:43:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:46:38] FIRING: GnmiTargetDown: cr2-eqdfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [23:47:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:50:41] FIRING: SystemdUnitFailed: bitu-permission-request.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed