[00:01:06] (03CR) 10Scott French: "Thanks, Raine!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266250 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [00:09:10] (03CR) 10Scott French: "Thanks, Raine!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266264 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [00:56:14] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:02:33] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 284378408 and 30 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:06:33] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7050408 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:06:35] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:08:23] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:09:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:11:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1266500 [01:11:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1266500 (owner: 10TrainBranchBot) [01:18:44] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:19:48] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:24:09] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1266500 (owner: 10TrainBranchBot) [01:30:13] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:30:53] !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:41:15] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [01:51:17] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:54:29] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [02:00:56] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:01:29] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [02:06:11] FIRING: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun [02:07:20] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 23s) [02:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:46:33] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 786199704 and 51 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:47:33] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:09:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [04:41:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [04:55:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:42] (03CR) 10Giuseppe Lavagetto: [C:03+1] wmnet: remove image-suggestion k8s ingress CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1198584 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [05:09:37] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:16:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [05:33:30] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [05:51:32] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:56:17] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0600) [06:00:05] marostegui, Amir1, and federico3: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0600). [06:06:11] FIRING: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun [06:10:25] FIRING: [2x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:15:12] (03CR) 10Muehlenhoff: [C:03+1] "The patch looks good, but I left a comment on the comment :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking) [06:19:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: 101F616EMO) [06:29:22] (03PS2) 101F616EMO: arbcom_zhwiki: Enable SecurePoll without PII rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) [06:30:25] FIRING: [3x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:52:10] FIRING: [2x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:56:17] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:57:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:00:05] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0700). [07:00:05] georgekyz: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:24] Good morning folks! [07:00:59] I am planning to deploy my patch now, is anybody around ? [07:03:22] I running it. [07:03:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [07:04:26] (03Merged) 10jenkins-bot: EventStreamConfig: Add rr-multilingual prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [07:05:19] !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1266228|EventStreamConfig: Add rr-multilingual prediction_change stream (T415892)]] [07:05:22] T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change - https://phabricator.wikimedia.org/T415892 [07:07:35] !log gkyziridis@deploy1003 gkyziridis: Backport for [[gerrit:1266228|EventStreamConfig: Add rr-multilingual prediction_change stream (T415892)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:08:03] !log gkyziridis@deploy1003 gkyziridis: Continuing with sync [07:08:16] syncing [07:08:42] 06SRE, 06Traffic: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11780898 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:08:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:12:19] !log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266228|EventStreamConfig: Add rr-multilingual prediction_change stream (T415892)]] (duration: 07m 00s) [07:12:23] T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change - https://phabricator.wikimedia.org/T415892 [07:12:53] the deployment finished successfully! [07:13:09] 06SRE, 10LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11780904 (10MoritzMuehlenhoff) Was this linked in some onboarding doc that you followed? If so, it can be removed for now. We're currently reworking 2FA support in CAS and the originally... [07:13:58] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [07:16:01] (03Merged) 10jenkins-bot: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [07:20:49] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11780907 (10MoritzMuehlenhoff) Since Andrea is working as a contractor the tracking entry in data.yaml should use the The t... [07:22:25] 06SRE, 10SRE-Access-Requests: Requesting access to superset dashboard for mpostoronca - https://phabricator.wikimedia.org/T421471#11780912 (10MoritzMuehlenhoff) 05In progress→03Resolved a:03hnowlan @MPostoronca-WMF Your access is enabled, so I'm rmarking this as resolved. If you run into any issues,... [07:24:57] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [07:25:06] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [07:27:56] (03PS1) 10Jaime Nuche: ApiAuthManagerHelper: Accept fields with undefined label [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266861 (https://phabricator.wikimedia.org/T422027) [07:29:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266861 (https://phabricator.wikimedia.org/T422027) (owner: 10Jaime Nuche) [07:30:33] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 64049 [07:32:13] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 64049 [07:38:00] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@fea7794]: deploy to freshly reimaged wdqs host [07:38:05] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@fea7794]: deploy to freshly reimaged wdqs host (duration: 00m 05s) [07:38:07] !log purge prometheus-nginx-exporter from url downloaders, remnants of early hcapcha rollout [07:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:36] (03PS1) 10Mszwarc: Disable external link analysis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837) [07:40:42] (03Merged) 10jenkins-bot: ApiAuthManagerHelper: Accept fields with undefined label [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266861 (https://phabricator.wikimedia.org/T422027) (owner: 10Jaime Nuche) [07:41:06] !log jnuche@deploy1003 Started scap sync-world: Backport for [[gerrit:1266861|ApiAuthManagerHelper: Accept fields with undefined label (T422027)]] [07:41:09] T422027: TypeError: MediaWiki\Api\ApiAuthManagerHelper::formatMessage(): Argument #3 ($message) must be of type MediaWiki\Message\Message, null given - https://phabricator.wikimedia.org/T422027 [07:41:17] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:42:21] I'll deploy a config change if there's nothing going on [07:42:42] (I see it is, I'll wit) [07:42:45] wait* [07:43:08] !log jnuche@deploy1003 jnuche: Backport for [[gerrit:1266861|ApiAuthManagerHelper: Accept fields with undefined label (T422027)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:43:33] !log jnuche@deploy1003 jnuche: Continuing with sync [07:46:17] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:46:54] (03CR) 10Kosta Harlan: [C:03+1] Disable external link analysis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [07:47:40] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (, T421714) xfer wdqs-all from wdqs2016.codfw.wmnet -> wdqs1027.eqiad.wmnet, repooling both afterwards [07:47:44] T421714: Data platform: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421714 [07:47:55] !log jnuche@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266861|ApiAuthManagerHelper: Accept fields with undefined label (T422027)]] (duration: 06m 39s) [07:47:58] T422027: TypeError: MediaWiki\Api\ApiAuthManagerHelper::formatMessage(): Argument #3 ($message) must be of type MediaWiki\Message\Message, null given - https://phabricator.wikimedia.org/T422027 [07:48:54] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 07Sustainability (Incident Followup): ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11780961 (10ayounsi) [07:49:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [07:50:16] (03Merged) 10jenkins-bot: Disable external link analysis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [07:50:17] (03PS1) 10Kevin Bazira: ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350) [07:50:40] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1266866|Disable external link analysis (T419837)]] [07:50:43] T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837 [07:50:56] RESOLVED: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun [07:51:23] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:52:23] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:52:40] !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1266866|Disable external link analysis (T419837)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:53:58] (03CR) 10Muehlenhoff: [C:03+2] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/1266242 (owner: 10Muehlenhoff) [07:54:14] !log jmm@dns1004 START - running authdns-update [07:55:55] !log jmm@dns1004 END - running authdns-update [07:56:39] !log mszwarc@deploy1003 mszwarc: Continuing with sync [07:58:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:00:05] jnuche and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0800) [08:00:53] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266866|Disable external link analysis (T419837)]] (duration: 10m 13s) [08:00:57] T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837 [08:01:15] morning, I will begin the train shortly [08:01:58] (03PS1) 10Arnaudb: apt-staging: error handling for restricted projects [puppet] - 10https://gerrit.wikimedia.org/r/1266920 (https://phabricator.wikimedia.org/T422070) [08:02:03] (03CR) 10Arnaudb: [C:03+2] apt-staging: error handling for restricted projects [puppet] - 10https://gerrit.wikimedia.org/r/1266920 (https://phabricator.wikimedia.org/T422070) (owner: 10Arnaudb) [08:03:25] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266924 (https://phabricator.wikimedia.org/T420480) [08:03:28] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266924 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot) [08:04:19] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266924 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot) [08:07:49] (03CR) 10Ozge: [C:03+1] ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [08:08:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:10:28] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.22 refs T420480 [08:10:31] T420480: 1.46.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T420480 [08:11:03] (03CR) 10Kevin Bazira: [C:03+2] ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [08:11:59] (03PS1) 10Muehlenhoff: Update email record for andreawest [puppet] - 10https://gerrit.wikimedia.org/r/1266931 (https://phabricator.wikimedia.org/T420053) [08:12:45] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11781036 (10MoritzMuehlenhoff) >>! In T420053#11778139, @AWesterinen wrote: > I still have the error,... [08:13:10] (03Merged) 10jenkins-bot: ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [08:14:38] (03PS4) 10Volans: webproxies: allow cloudcumin to openstack [puppet] - 10https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) [08:14:38] (03CR) 10Volans: "PCC available at:" [puppet] - 10https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [08:16:16] 10ops-eqiad, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111 (10FCeratto-WMF) 03NEW [08:16:17] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:16:24] (03PS1) 10Brouberol: deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) [08:16:49] (03PS2) 10Brouberol: deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) [08:17:10] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:17:56] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [08:18:38] (03PS1) 10Arnaudb: aptrepo: add an alert for failed prepare [alerts] - 10https://gerrit.wikimedia.org/r/1266932 (https://phabricator.wikimedia.org/T422070) [08:18:41] (03CR) 10Arnaudb: [C:03+2] aptrepo: add an alert for failed prepare [alerts] - 10https://gerrit.wikimedia.org/r/1266932 (https://phabricator.wikimedia.org/T422070) (owner: 10Arnaudb) [08:19:02] (03CR) 10CI reject: [V:04-1] deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: 10Brouberol) [08:19:21] (03PS3) 10Brouberol: deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) [08:19:38] (03PS4) 10Brouberol: deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) [08:20:00] (03Merged) 10jenkins-bot: aptrepo: add an alert for failed prepare [alerts] - 10https://gerrit.wikimedia.org/r/1266932 (https://phabricator.wikimedia.org/T422070) (owner: 10Arnaudb) [08:20:57] (03CR) 10Ayounsi: [C:03+1] "lgtm, pcc looks good too, to be carefully rolled out/tested." [puppet] - 10https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [08:21:07] (03PS5) 10Brouberol: deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) [08:23:15] (03CR) 10CI reject: [V:04-1] deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: 10Brouberol) [08:24:10] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8368/co" [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: 10Brouberol) [08:24:15] (03PS6) 10Brouberol: deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) [08:30:22] !log briefly disabling puppet on P:installserver::proxy to deploy g/1266885 [08:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:21] (03CR) 10Volans: [C:03+2] webproxies: allow cloudcumin to openstack [puppet] - 10https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [08:33:26] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: 10Brouberol) [08:40:18] (03CR) 10Brouberol: [C:03+2] deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: 10Brouberol) [08:40:45] slyngs, effie, I'm going to reboot mr1-esams for a software upgrade, it will go down for up to 20min, device itself is downtimed, but there might be some alerting noise from esams mgmt being unreachable [08:41:15] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:41:17] FIRING: [3x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:41:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [08:41:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:42:00] !log reboot mr1-esams - T416450 [08:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:04] T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450 [08:42:36] XioNoX: thank you, break a leg [08:43:59] PROBLEM - ps1-by27-esams-infeed-load-tower-B-single-phase on ps1-by27-esams is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:44:20] 06SRE, 10LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781126 (10atsuko) Thanks, I'll update the onboarding. [08:44:32] !log dpogorzelski@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: sync [08:44:42] 06SRE, 10LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781127 (10atsuko) a:03atsuko [08:44:45] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [08:44:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2149 (T419635)', diff saved to https://phabricator.wikimedia.org/P90206 and previous config saved to /var/cache/conftool/dbconfig/20260402-084452-fceratto.json [08:44:56] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:45:07] !log dpogorzelski@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [08:45:23] PROBLEM - Host ps1-by27-esams is DOWN: PING CRITICAL - Packet loss = 100% [08:45:23] PROBLEM - Host ps1-bw27-esams is DOWN: PING CRITICAL - Packet loss = 100% [08:45:32] !log dpogorzelski@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync [08:45:39] FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-by27-esams and mr1-esams (10.80.127.5) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=asw1-by27-esams:9804&var-bgp_group=Management&var-bgp_neighbor=mr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDow [08:46:09] !log dpogorzelski@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [08:46:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:46:17] 06SRE, 10LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781130 (10atsuko) [08:47:08] (03PS1) 10Gkyziridis: ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266941 [08:47:29] 06SRE, 10LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781133 (10atsuko) [08:47:50] 06SRE, 10LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781135 (10atsuko) 05Open→03Declined [08:49:13] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:49:49] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266941 (owner: 10Gkyziridis) [08:49:54] !log added Atsuko to the cn=ops LDAP group T421860 [08:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:58] T421860: Requesting shell access and membership of the ops group for atsuko - https://phabricator.wikimedia.org/T421860 [08:50:23] FIRING: GnmiTargetDown: asw1-bw27-esams is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [08:50:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between asw1-by27-esams and mr1-esams (10.80.127.5) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=asw1-by27-esams:9804&var-bgp_group=Management&var-bgp_neighbor=mr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPD [08:50:47] RECOVERY - Host ps1-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 81.26 ms [08:50:47] RECOVERY - Host ps1-bw27-esams is UP: PING OK - Packet loss = 0%, RTA = 81.25 ms [08:51:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:51:27] (03CR) 10Dpogorzelski: [C:03+1] ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266941 (owner: 10Gkyziridis) [08:51:32] router is back up - 10min downtime [08:52:15] (03CR) 10Gkyziridis: [C:03+2] ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266941 (owner: 10Gkyziridis) [08:53:34] 06SRE, 10SRE-Access-Requests: Requesting shell access and membership of the ops group for atsuko - https://phabricator.wikimedia.org/T421860#11781141 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @atsuko Your SSH access should now be working. You can e.g. try to connect to cumin1003.e... [08:54:13] (03Merged) 10jenkins-bot: ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266941 (owner: 10Gkyziridis) [08:54:13] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:55:23] RESOLVED: GnmiTargetDown: asw1-bw27-esams is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [08:55:27] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:55:41] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:56:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:57:47] (03CR) 10Muehlenhoff: [C:03+2] Update Cumin alias for contint to also cover the spun-off Trixie role [puppet] - 10https://gerrit.wikimedia.org/r/1266215 (owner: 10Muehlenhoff) [08:58:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:08:30] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards Has been acknowledged - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [09:12:31] (03PS1) 10Klausman: admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266947 [09:17:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T419635)', diff saved to https://phabricator.wikimedia.org/P90207 and previous config saved to /var/cache/conftool/dbconfig/20260402-091743-fceratto.json [09:17:47] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:19:48] !log upgrading Envoy on the config-master servers to 1.35.9 T419637 T410975 [09:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:58] T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637 [09:19:59] T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975 [09:21:37] (03PS1) 10Gkyziridis: ml-services: Revert the changes and the model version into the previous state. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266948 [09:23:16] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1265258 (owner: 10Slyngshede) [09:23:51] (03CR) 10Gkyziridis: [C:03+2] ml-services: Revert the changes and the model version into the previous state. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266948 (owner: 10Gkyziridis) [09:25:57] (03PS1) 10Volans: Add missing includes from Netbox exported data [dns] - 10https://gerrit.wikimedia.org/r/1266952 (https://phabricator.wikimedia.org/T422115) [09:26:07] (03Merged) 10jenkins-bot: ml-services: Revert the changes and the model version into the previous state. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266948 (owner: 10Gkyziridis) [09:27:36] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:27:42] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:27:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P90208 and previous config saved to /var/cache/conftool/dbconfig/20260402-092751-fceratto.json [09:28:30] RESOLVED: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards Has been acknowledged - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [09:29:31] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master-codfw [09:29:35] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors [09:29:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors [09:30:35] (03PS4) 10Arnaudb: gerrit: fix Envoy idle timeout handling for slow HTTPS git requests [puppet] - 10https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909) [09:33:23] (03PS1) 10Arnaudb: gerrit: update sshd timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1266149 (https://phabricator.wikimedia.org/T417996) [09:33:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master-codfw [09:33:47] (03Abandoned) 10Arnaudb: gerrit: update timeouts for gitiles [puppet] - 10https://gerrit.wikimedia.org/r/1265448 (https://phabricator.wikimedia.org/T421904) (owner: 10Arnaudb) [09:37:53] (03CR) 10Muehlenhoff: [C:03+2] Obsolete airflow-search-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1242407 (owner: 10Muehlenhoff) [09:38:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P90209 and previous config saved to /var/cache/conftool/dbconfig/20260402-093759-fceratto.json [09:39:25] (03PS5) 10Arnaudb: gerrit: fix Envoy idle timeout handling for slow HTTPS git requests [puppet] - 10https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909) [09:39:29] (03CR) 10Effie Mouzeli: [C:03+1] image-suggestion: remove service configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [09:39:45] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [09:40:13] (03CR) 10Effie Mouzeli: [C:03+1] profile::service_proxy::envoy: remove mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1265420 (https://phabricator.wikimedia.org/T420468) (owner: 10Elukey) [09:40:30] (03PS2) 10Elukey: profile::service_proxy::envoy: remove mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1265420 (https://phabricator.wikimedia.org/T420468) [09:41:18] (03CR) 10Arnaudb: [C:03+2] gerrit: fix Envoy idle timeout handling for slow HTTPS git requests [puppet] - 10https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [09:41:50] (03Abandoned) 10Effie Mouzeli: profile::service_proxy::envoy: remove mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1265420 (https://phabricator.wikimedia.org/T420468) (owner: 10Elukey) [09:43:42] (03CR) 10Ayounsi: [C:03+1] "thanks!" [dns] - 10https://gerrit.wikimedia.org/r/1266952 (https://phabricator.wikimedia.org/T422115) (owner: 10Volans) [09:45:33] !log javiermonton@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: sync [09:45:42] !log javiermonton@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [09:46:56] FIRING: GitlabPackagePullerFailedOnPrepare: Package puller has some run errors while preparing projects. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnPrepare [09:47:41] !log javiermonton@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [09:48:02] !log javiermonton@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [09:48:03] (03Abandoned) 10Majavah: hieradata: Add dumps.wikimedia.org CDN mapping [puppet] - 10https://gerrit.wikimedia.org/r/1237194 (https://phabricator.wikimedia.org/T306550) (owner: 10Majavah) [09:48:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T419635)', diff saved to https://phabricator.wikimedia.org/P90210 and previous config saved to /var/cache/conftool/dbconfig/20260402-094808-fceratto.json [09:48:11] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:48:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [09:48:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90211 and previous config saved to /var/cache/conftool/dbconfig/20260402-094834-fceratto.json [09:48:37] !log javiermonton@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [09:48:58] !log javiermonton@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [09:53:07] (03PS1) 10Muehlenhoff: Obsolete airflow-wmde-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1266959 [09:58:30] (03CR) 10Muehlenhoff: [C:03+2] Update email record for andreawest [puppet] - 10https://gerrit.wikimedia.org/r/1266931 (https://phabricator.wikimedia.org/T420053) (owner: 10Muehlenhoff) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1000) [10:00:04] dues: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:00:25] (03CR) 10Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [10:02:05] (03PS1) 10Volans: cumin: use webproxy to connect to openstack APIs [puppet] - 10https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360) [10:02:05] (03CR) 10Volans: "PCC available for cloudcumin1001 here:" [puppet] - 10https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [10:03:22] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266229 (owner: 10Muehlenhoff) [10:03:35] (03PS1) 10Arnaudb: gerrit: update upstream_idle_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1266962 (https://phabricator.wikimedia.org/T421827) [10:03:38] (03CR) 10Arnaudb: [C:03+2] gerrit: update upstream_idle_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1266962 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [10:04:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:04:17] (03PS1) 10Volans: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 [10:04:26] (03CR) 10CI reject: [V:04-1] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (owner: 10Volans) [10:05:23] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [10:05:32] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [10:05:34] (03PS2) 10Volans: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) [10:05:49] (03CR) 10CI reject: [V:04-1] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [10:08:55] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [10:09:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:09:42] (03PS1) 10Mhorsey: Enable the CampaignEvents extension on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) [10:10:18] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [10:10:38] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: define authed-user class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler) [10:10:57] (03CR) 10CI reject: [V:04-1] Enable the CampaignEvents extension on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) (owner: 10Mhorsey) [10:11:19] (03PS3) 10Volans: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) [10:11:41] (03CR) 10Volans: [C:03+2] cumin: use webproxy to connect to openstack APIs [puppet] - 10https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [10:12:36] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [10:12:49] (03Merged) 10jenkins-bot: rest gateway: define authed-user class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler) [10:13:17] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [10:14:30] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [10:15:11] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker-exp2001.codfw.wmnet [10:16:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:16:52] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:16:54] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [10:17:00] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:17:05] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [10:17:14] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:17:19] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:17:24] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [10:17:32] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [10:17:36] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:17:40] !incidents [10:17:40] 7803 (UNACKED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [10:17:46] !ack 7803 [10:17:46] 7803 (ACKED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [10:17:50] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:17:55] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:18:03] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:18:08] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:18:24] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:18:28] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:18:45] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:18:46] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:18:50] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:19:11] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:19:15] !log jayme@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [10:19:17] !log installing freetype security updates [10:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:25] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker-exp2001.codfw.wmnet [10:19:27] !log jayme@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [10:19:30] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [10:19:31] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:19:35] !log jayme@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [10:19:36] !log jayme@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [10:21:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90212 and previous config saved to /var/cache/conftool/dbconfig/20260402-102105-fceratto.json [10:21:09] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:21:41] (03PS2) 10Mhorsey: Enable the CampaignEvents extension on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) [10:22:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [10:22:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:23:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) (owner: 10Mhorsey) [10:24:44] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11781519 (10Volans) Given this has been moved to the backlog I'll leave here a comment for our future selves: i... [10:26:33] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 166195784 and 22 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:27:29] (03PS1) 10Hashar: wm-checks-api: add tag for PostgreSQL jobs [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1266965 [10:27:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:28:33] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3533304 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:30:40] FIRING: [3x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:30:41] 06SRE, 06DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781562 (10Peachey88) [10:31:02] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:31:12] 06SRE, 06DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781588 (10MBH) Many such servers: 26, 31. When just opening pages for read. [10:31:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P90213 and previous config saved to /var/cache/conftool/dbconfig/20260402-103113-fceratto.json [10:31:25] 06SRE, 06DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781591 (10Peachey88) [10:31:27] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:32:19] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet [10:33:27] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [10:34:45] 06SRE, 06DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781642 (10Thryduulf) I've been experiencing these errors intermittently on English Wikipedia today, but only on trying to save edits. Each time trying again has resulted in the save being successful. [10:37:41] !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:38:10] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [10:38:22] (03CR) 10CI reject: [V:04-1] rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [10:39:14] (03PS5) 10Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) [10:39:29] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1148 - https://phabricator.wikimedia.org/T421892#11781672 (10Jclark-ctr) 05Open→03Declined This ticket automated ticket was opened by mistake it was still being worked on in In T411919 [10:39:44] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [10:40:02] !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:41:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P90214 and previous config saved to /var/cache/conftool/dbconfig/20260402-104121-fceratto.json [10:41:51] (03Merged) 10jenkins-bot: rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [10:41:57] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421970#11781681 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr rebalanced [10:43:18] 06SRE, 06DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781698 (10Aklapper) [10:43:53] !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet [10:44:33] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 76721280 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:45:00] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:45:16] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS2 Status - issue on ml-serve1015:9290 - https://phabricator.wikimedia.org/T421599#11781704 (10Jclark-ctr) 05Open→03Resolved replaced failed psu Outbound ticket for psu 1-258638557493 [10:45:33] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3553128 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:45:43] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:48:21] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:48:23] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:48:23] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:48:23] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:48:28] fwiw I jusst got 'cannot access the database: database servers in cluster31 are overloaded' when trying to save an edit on metawiki. worked fine on the second attempt. [10:48:33] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 298909248 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:49:26] oh i see it's already known, apologies :) [10:49:33] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 4010680 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:49:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:49:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:50:33] 06SRE, 06DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781731 (10Wellverywell) p:05Triage→03Unbreak! [10:50:41] 10SRE-Access-Requests, 06Data-Platform-SRE, 06Wikidata Platform Team: Request: wdqs shell access for user @AWesterinen-WMF - https://phabricator.wikimedia.org/T422141 (10gmodena) 03NEW [10:51:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90215 and previous config saved to /var/cache/conftool/dbconfig/20260402-105129-fceratto.json [10:51:33] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:51:35] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [10:51:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T419635)', diff saved to https://phabricator.wikimedia.org/P90216 and previous config saved to /var/cache/conftool/dbconfig/20260402-105142-fceratto.json [10:52:14] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781750 (10RhinosF1) [10:52:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:54:13] 10SRE-Access-Requests, 06Data-Platform-SRE, 06Wikidata Platform Team: Request: wdqs shell access for user AWesterinen-WMF - https://phabricator.wikimedia.org/T422141#11781774 (10gmodena) [10:56:57] 10SRE-Access-Requests, 06Data-Platform-SRE, 06Wikidata Platform Team: Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11781779 (10gmodena) [10:57:52] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781783 (101F616EMO) I experienced such errors when diffing and saving edits. [10:58:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011#11781785 (10Jclark-ctr) a:03Jclark-ctr This server is out of warranty. Replaced Drive slot 16 with matching 8tb sata drive [10:58:45] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781790 (10Ladsgroup) We are on it. [10:59:47] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781793 (101F616EMO) Should I expect the coming backport window be cancelled or delayed due to this incident? [11:00:25] (03PS4) 10Btullis: Add analytics-fr-tech system user and corresponding groups [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) [11:00:25] (03PS1) 10Btullis: Grant sudo privileges for the analytics-fr-tech-users group [puppet] - 10https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213) [11:01:28] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781814 (10RhinosF1) >>! In T422130#11781793, @1F616EMO wrote: > Should I expect the coming backport window be cancelled or delayed due to this incident? Very likely yes. A dep... [11:02:00] (03CR) 10Btullis: [C:04-1] "Set to -1 pending the review by Infrastructure Foundations." [puppet] - 10https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [11:04:16] (03PS1) 10Esanders: Fix suggestion mode availability check [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143) [11:04:20] (03CR) 10Muehlenhoff: Add analytics-fr-tech system user and corresponding groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [11:05:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143) (owner: 10Esanders) [11:07:26] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011#11781846 (10Jclark-ctr) After replacement Server showed drive as foreign. continued to fail to clear foreign config. Replaced drive again with new seagate 8tb sata drive [11:07:48] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781847 (101F616EMO) >>! In T422130#11781814, @RhinosF1 wrote: >>>! In T422130#11781793, @1F616EMO wrote: >> Should I expect the coming backport window be cancelled or delayed d... [11:13:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 11.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:14:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:20:54] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781890 (10Lucas_Werkmeister_WMDE) [11:21:41] FIRING: [8x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [11:24:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T419635)', diff saved to https://phabricator.wikimedia.org/P90217 and previous config saved to /var/cache/conftool/dbconfig/20260402-112421-fceratto.json [11:24:25] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:26:41] FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [11:26:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:27:00] !incidents [11:27:00] 7803 (ACKED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [11:27:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [11:27:49] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11781903 (10BTullis) [11:27:50] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011#11781904 (10Jclark-ctr) 05Open→03Resolved [11:28:38] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11781909 (10Jclark-ctr) a:03Jclark-ctr [11:29:02] (03CR) 10Jforrester: [C:03+1] REST: Publish ReadingLists v0 module in REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: 10KineticPelagic) [11:32:25] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is CRITICAL: 6.702e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [11:32:45] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [11:34:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P90218 and previous config saved to /var/cache/conftool/dbconfig/20260402-113429-fceratto.json [11:34:33] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 97599648 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:35:23] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11781922 (10Jclark-ctr) updating bios firmware , expander firmware due to coms error on backplain. and idrac firmware additionally [11:35:33] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3557000 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:36:54] (03PS1) 10Arnaudb: gerrit: bump upstream_idle_timeout to 900s [puppet] - 10https://gerrit.wikimedia.org/r/1266989 (https://phabricator.wikimedia.org/T421904) [11:37:12] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11781927 (10BTullis) I have validated all SSH keys via out-of... [11:37:15] (03CR) 10Arnaudb: [C:03+2] gerrit: bump upstream_idle_timeout to 900s [puppet] - 10https://gerrit.wikimedia.org/r/1266989 (https://phabricator.wikimedia.org/T421904) (owner: 10Arnaudb) [11:37:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [11:38:23] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:39:19] 06SRE, 10SRE-Access-Requests, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11781930 (10Gehel) p:05Triage→03High [11:42:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:44:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P90219 and previous config saved to /var/cache/conftool/dbconfig/20260402-114437-fceratto.json [11:47:45] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [11:48:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 15.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:48:23] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781968 (10Thryduulf) I've just encountered what I presume is the same error, this time when trying to use the reply tool [6a4d47bf-961e-4513-9b1f-c6970e11f156] Caught exception... [11:48:23] (03PS5) 10Btullis: Add analytics-fr-tech system user and corresponding groups [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) [11:48:24] (03PS2) 10Btullis: Grant sudo privileges for the analytics-fr-tech-users group [puppet] - 10https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213) [11:51:15] (03PS1) 10Daniel Kinzler: rest gateway: refactor ratelimit integration test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266995 [11:51:51] RESOLVED: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:52:11] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:52:17] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254925 (owner: 10PipelineBot) [11:52:24] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254926 (owner: 10PipelineBot) [11:52:34] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241846 (owner: 10PipelineBot) [11:52:44] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258153 (owner: 10PipelineBot) [11:52:45] RESOLVED: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [11:52:55] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254927 (owner: 10PipelineBot) [11:53:04] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1246819 (owner: 10PipelineBot) [11:54:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:54:30] (03CR) 10Ayounsi: [C:03+1] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [11:54:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T419635)', diff saved to https://phabricator.wikimedia.org/P90220 and previous config saved to /var/cache/conftool/dbconfig/20260402-115446-fceratto.json [11:54:50] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:55:03] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2190.codfw.wmnet with reason: Maintenance [11:55:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T419635)', diff saved to https://phabricator.wikimedia.org/P90221 and previous config saved to /var/cache/conftool/dbconfig/20260402-115511-fceratto.json [11:59:00] I have a high visibility UBN in for the deployment window - just waiting for it to merge [11:59:59] (03PS1) 10Brouberol: deployment_server: tweak the labels on opensearch_k8s_master_cert_expiry_seconds [puppet] - 10https://gerrit.wikimedia.org/r/1266999 (https://phabricator.wikimedia.org/T418175) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1200) [12:02:02] ah - timezone change - the window starts in one hour [12:02:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:03:19] (03CR) 10Brouberol: [C:03+2] deployment_server: tweak the labels on opensearch_k8s_master_cert_expiry_seconds [puppet] - 10https://gerrit.wikimedia.org/r/1266999 (https://phabricator.wikimedia.org/T418175) (owner: 10Brouberol) [12:05:25] FIRING: [3x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:41] FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:07:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 23.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:09:35] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1373.eqiad.wmnet with OS trixie [12:09:47] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1374.eqiad.wmnet with OS trixie [12:09:57] !log jayme@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1373 [12:09:57] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1373 [12:10:08] !log jayme@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1374 [12:10:08] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1374 [12:10:47] edsanders: fyi there is a incident at the moment (T422130) so the window might be effected [12:10:48] T422130: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130 [12:11:02] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [12:11:31] (03CR) 10Volans: [C:03+2] Add missing includes from Netbox exported data [dns] - 10https://gerrit.wikimedia.org/r/1266952 (https://phabricator.wikimedia.org/T422115) (owner: 10Volans) [12:11:41] FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:11:41] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:11:57] !log volans@dns1004 START - running authdns-update [12:12:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 22.53% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:12:19] (03CR) 10JMeybohm: [C:03+1] admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266947 (owner: 10Klausman) [12:12:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782064 (10Jclark-ctr) 05Open→03Resolved ` A configuration related issue on the device Backplane is resolved. ` [12:13:46] !log volans@dns1004 END - running authdns-update [12:13:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:14:26] p858snake I'd like to start my deployment asap, is everything on hold at the moment? [12:16:13] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782090 (10FCeratto-WMF) Thanks! [12:16:41] FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:17:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11782094 (10Jclark-ctr) @herron can you assist with updating puppet on this install ticket ? [12:18:38] Rhoni [12:18:46] *typo [12:18:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:19:02] !incidents [12:19:02] 7804 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad) [12:19:03] 7803 (RESOLVED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [12:19:12] RhinosF1: is there any chance of getting a UBN backported, despite T422130? [12:19:13] T422130: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130 [12:19:32] edsanders: no idea why you are asking me [12:19:32] (this: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1266984) [12:19:39] I saw you commented on the incident task [12:19:42] You need to ask the IC [12:19:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [12:19:50] I suggest in #wikimedia-sre [12:19:53] Thanks [12:19:53] Much quieter there [12:20:01] (03CR) 10JMeybohm: Upgrade aux-k8s-codfw to k8s 1.31 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1265426 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [12:20:08] (03CR) 10JMeybohm: [C:03+1] admin_ng: upgrade aux-k8s-codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265427 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [12:21:41] FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:22:32] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1373.eqiad.wmnet with reason: host reimage [12:22:35] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1374.eqiad.wmnet with reason: host reimage [12:22:40] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782127 (10taavi) [12:24:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11782144 (10Jclark-ctr) a:03Jclark-ctr [12:25:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11782146 (10Jclark-ctr) [12:26:41] FIRING: [54x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:26:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T419635)', diff saved to https://phabricator.wikimedia.org/P90222 and previous config saved to /var/cache/conftool/dbconfig/20260402-122642-fceratto.json [12:26:46] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:27:05] (03CR) 10Btullis: Add analytics-fr-tech system user and corresponding groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [12:27:44] 06SRE, 10DNS, 06Infrastructure-Foundations, 10netbox, and 3 others: Missing includes in DNS repo from Netbox-generated snippets - https://phabricator.wikimedia.org/T422115#11782158 (10Volans) p:05Triage→03Medium I've merged and release the fix, do you want to keep the task open to implement some form o... [12:28:49] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for es1042.eqiad.wmnet [12:28:50] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es1042.eqiad.wmnet [12:29:17] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1373.eqiad.wmnet with reason: host reimage [12:30:46] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool es1042.eqiad.wmnet: Restoring section [12:30:57] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782163 (10FCeratto-WMF) The host booted, I triggered a puppet run manually, started MariaDB, enabled alarming and checked that icinga is green and started pooling in to help with T422130 [12:31:11] (03CR) 10JMeybohm: service::catalog: add sophroid service catalog entry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [12:31:23] (03CR) 10JMeybohm: [C:03+1] conftool: add sophroid etcd data [puppet] - 10https://gerrit.wikimedia.org/r/1248611 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [12:31:41] RESOLVED: [44x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:31:46] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:31:57] (03CR) 10JMeybohm: [C:03+1] wmnet: add sophroid svc IPs [dns] - 10https://gerrit.wikimedia.org/r/1248617 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [12:32:20] (03CR) 10Klausman: [V:03+2 C:03+2] admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266947 (owner: 10Klausman) [12:32:32] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 86555328 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:32:39] (03PS1) 10Anne Tomasevich: Add logged-in reader retention instrument [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) [12:32:46] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool es1042.eqiad.wmnet: Restoring section [12:32:49] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1374.eqiad.wmnet with reason: host reimage [12:32:58] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool es1042.eqiad.wmnet: Restoring section [12:32:59] !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool es1042.eqiad.wmnet: Restoring section [12:33:10] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool es1042: Restoring section [12:33:25] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782172 (10ops-monitoring-bot) Starting pool of es1042 by fceratto@cumin1003: Restoring section [12:33:26] (03CR) 10JMeybohm: [C:04-1] "This is the wrong file. Since you're targeting the aux cluster you need to add the pool there (`hieradata/role/common/aux_k8s/worker.yaml`" [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [12:33:34] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 200752 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:33:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:34:33] !incidents [12:34:33] 7804 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad) [12:34:33] 7803 (RESOLVED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [12:34:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: 10Anne Tomasevich) [12:35:52] (03CR) 10JMeybohm: [C:04-1] role::kubernetes::worker: add sophroid to the lvs pools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [12:36:32] (03CR) 10Aude: [C:03+1] Add logged-in reader retention instrument [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: 10Anne Tomasevich) [12:36:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P90224 and previous config saved to /var/cache/conftool/dbconfig/20260402-123650-fceratto.json [12:38:22] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:38:22] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:38:22] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:39:23] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:39:39] (03Merged) 10jenkins-bot: admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266947 (owner: 10Klausman) [12:39:48] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:39:48] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:41:20] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986#11782182 (10Jclark-ctr) a:03Jclark-ctr ` 2026-01-12 21:59:21 An unrecoverable disk media error occurred on Disk 20 in Backplane 2 of Integrated RAID Controller 1. Part Number =... [12:41:31] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:41:32] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:41:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:41:43] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782184 (10BTullis) I have run `cross-validate-accounts` for... [12:42:33] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986#11782190 (10Jclark-ctr) 05Open→03Resolved [12:44:17] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:45:04] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:45:29] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1373.eqiad.wmnet with OS trixie [12:45:51] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:46:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P90225 and previous config saved to /var/cache/conftool/dbconfig/20260402-124659-fceratto.json [12:48:33] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1042: Restoring section [12:48:58] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782211 (10ops-monitoring-bot) Completed pooling of es1042 by fceratto@cumin1003: Restoring section [12:49:21] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1374.eqiad.wmnet with OS trixie [12:49:22] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:49:36] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986#11782217 (10MatthewVernon) Thanks for the quick fixes @Jclark-ctr :-) [12:50:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:50:19] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:54:43] hi folks, just a reminder that we will repooling codfw at 14:00 utc today [12:55:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 22.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:55:32] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 468938744 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:56:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11782255 (10Jclark-ctr) @Jgreen replaced cable link came up. Sorry for delay [12:56:37] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:57:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T419635)', diff saved to https://phabricator.wikimedia.org/P90227 and previous config saved to /var/cache/conftool/dbconfig/20260402-125707-fceratto.json [12:57:11] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:57:25] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2194.codfw.wmnet with reason: Maintenance [12:57:32] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 120 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:57:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2194 (T419635)', diff saved to https://phabricator.wikimedia.org/P90228 and previous config saved to /var/cache/conftool/dbconfig/20260402-125732-fceratto.json [12:58:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 20.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1300). [13:00:05] manfredi, HouseOfM, edsanders, and annet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] o/ [13:00:23] o/ [13:00:24] I can deploy but I need to catch up with the incident first [13:00:32] not sure if it’s okay to deploy at the moment [13:00:41] last I heard it isn't [13:01:01] I've also asked to deploy my UBN asap once the incident is resolved [13:01:14] https://www.wikimediastatus.net/incidents/kq46rrxd2yy4 is still up [13:01:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:04] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782282 (10Aklapper) [13:02:15] I agree that edsanders’ change seems top priority once we can deploy at all [13:02:19] (03PS1) 10Btullis: Grant the WME engineering team production access suitable for Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) [13:03:07] (03CR) 10CI reject: [V:04-1] Grant the WME engineering team production access suitable for Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: 10Btullis) [13:03:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:08:53] (03PS2) 10Btullis: Grant the WME engineering team production access suitable for Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) [13:09:39] (03CR) 10CI reject: [V:04-1] Grant the WME engineering team production access suitable for Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: 10Btullis) [13:13:31] (03PS3) 10Btullis: Grant the WME engineering team production access suitable for Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) [13:15:21] (03PS1) 10Ayounsi: Add Mayotte to geo-maps - prefer drmrs [dns] - 10https://gerrit.wikimedia.org/r/1267042 [13:16:34] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 47811456 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:17:24] (the codfw repool is being pulled ahead, if that solves the incident then we *may* be able to deploy one or two patches in the window after all) [13:17:33] !log jasmine@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool codfw [reason: no reason specified, T414486] [13:17:37] T414486: Upgrade AUX clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414486 [13:17:46] !log jasmine@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool codfw [reason: no reason specified, T414486] [13:18:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:18:33] !log jasmine@cumin1003 START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: maintenance - T414486 [13:19:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:20:31] (03CR) 10Btullis: [C:04-1] "I'm just waiting for final approval from Haroon on the ticket, for his 6 reports." [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: 10Btullis) [13:20:32] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3981016 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:22:09] !incidents [13:22:09] 7804 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad) [13:22:09] 7803 (RESOLVED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [13:23:50] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#11782358 (10Jclark-ctr) [13:27:16] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) [13:28:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 23.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:28:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:29:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T419635)', diff saved to https://phabricator.wikimedia.org/P90229 and previous config saved to /var/cache/conftool/dbconfig/20260402-132914-fceratto.json [13:29:18] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:29:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11782375 (10Jgreen) >>! In T417295#11782255, @Jclark-ctr wrote: > @Jgreen replaced cable link came up. Sorry for delay @Jclark-ctr looks good, it's imaging now. Thanks! [13:29:52] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782376 (10BTullis) This patch for the... [13:30:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 22.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:30:53] (03CR) 10Muehlenhoff: [C:03+1] "Patch looks good, can be merged once approval is done" [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: 10Btullis) [13:31:11] (03CR) 10Eevans: [C:03+2] charts/cassandra-http-gateway: configurable Cassandra keyspace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259188 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [13:31:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11782380 (10Jclark-ctr) [13:32:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [13:32:44] (03CR) 10Eevans: [C:03+2] services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [13:33:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:33:58] !ack [13:33:59] All incidents are already acked. [13:34:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:34:51] (03Merged) 10jenkins-bot: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [13:35:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 21.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:35:57] 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11782401 (10Jclark-ctr) @VRiley-WMF Thanks for following up I had Sent the email with instructions to Papaul while I was out on Tuesday. This will require som... [13:36:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11782402 (10Jclark-ctr) 05Open→03Resolved [13:37:45] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [13:39:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P90230 and previous config saved to /var/cache/conftool/dbconfig/20260402-133923-fceratto.json [13:39:45] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:41:28] (03PS1) 10Kosta Harlan: hCaptcha: Emit Prometheus counter on health check failover [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267056 (https://phabricator.wikimedia.org/T421204) [13:41:47] !log jasmine@cumin1003 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in codfw: maintenance - T414486 [13:41:51] T414486: Upgrade AUX clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414486 [13:42:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:42:58] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [13:43:51] RESOLVED: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:44:45] RESOLVED: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:49:23] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143) (owner: 10Esanders) [13:49:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P90231 and previous config saved to /var/cache/conftool/dbconfig/20260402-134931-fceratto.json [13:49:44] ^ there’s some chance we’ll be able to deploy; otherwise I’ll undo that CR+2 (cc edsanders) [13:50:16] I'm here [13:50:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267056 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan) [13:50:41] are we ready to deploy? [13:50:54] I just got the go-ahead in the security channel, so i think yes [13:50:55] (03Merged) 10jenkins-bot: Fix suggestion mode availability check [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143) (owner: 10Esanders) [13:50:57] ye [13:51:02] * Lucas_WMDE spiders the pig [13:51:15] oh, that gate-and-submit was a lot faster than I expected [13:51:25] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]] [13:51:28] T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143 [13:51:40] (03PS2) 10Arnaudb: gerrit: add Cache-Control for Gitiles with mod_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1266238 (https://phabricator.wikimedia.org/T409422) [13:51:40] Lucas_WMDE: thanks [13:52:53] (03CR) 10Btullis: [C:03+2] Add analytics-fr-tech system user and corresponding groups [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [13:53:09] !log lucaswerkmeister-wmde@deploy1003 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmne [13:53:10] t/restricted/mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_ [13:53:10] dir=/srv/mediawiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 44s) [13:53:28] * Lucas_WMDE looks [13:54:06] I think the sudo docker-pusher falied with “blob upload unknown”? [13:54:09] let me try again… [13:54:47] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]] [13:55:45] !log lucaswerkmeister-wmde@deploy1003 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmne [13:55:45] t/restricted/mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_ [13:55:45] dir=/srv/mediawiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 00m 58s) [13:56:06] :( [13:56:25] same error I think [13:56:29] “blob upload unknown” [13:57:11] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782509 (10cmooney) We are hopeful the situation should have improved after codfw was repooled, adding additional capacity. Root cause of the circuit breaking is still being in... [13:57:15] oh dear [13:58:03] jasmine_: as the codfw repooler (thanks again), any idea if this could be related? [13:58:17] (03CR) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [13:58:19] (03PS1) 10Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) [13:58:26] I’m imagining something like, scap now has to push the new mw image to codfw, but something on codfw might not be ready for it… [13:58:29] juts guessing though [13:58:35] I'll try once more for luck [13:58:48] ok [13:58:53] !log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]] [13:58:56] T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143 [13:58:58] I didn’t realize you can deploy, I should’ve asked ^^ [13:59:00] sorry [13:59:17] lucas_wmde: looking [13:59:20] thx [13:59:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T419635)', diff saved to https://phabricator.wikimedia.org/P90232 and previous config saved to /var/cache/conftool/dbconfig/20260402-135939-fceratto.json [13:59:43] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:59:56] jouncebot: nowandnext [13:59:56] For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1300) [13:59:56] In 0 hour(s) and 0 minute(s): DC Switchover: Day 8 - Codfw Repool (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1400) [13:59:57] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2205.codfw.wmnet with reason: Maintenance [14:00:04] jasmine_: May I have your attention please! DC Switchover: Day 8 - Codfw Repool. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1400) [14:00:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2205 (T419635)', diff saved to https://phabricator.wikimedia.org/P90233 and previous config saved to /var/cache/conftool/dbconfig/20260402-140004-fceratto.json [14:00:08] !log esanders@deploy1003 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/ [14:00:08] mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/medi [14:00:08] awiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 15s) [14:00:48] jasmine_: I need to reload the CI Jenkins [14:01:05] it does not take long, I don't think it affects the switchover [14:03:07] !log Jenkins CI: reloading configuration from disk to poll new nodes # T421114 [14:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:11] hashar: FYI, codfw was already repooled to respond to the incident (but I’m not sure how complete it is) [14:03:12] T421114: Rebuild all Jenkins agents VM to Bookworm to support Java 21 - https://phabricator.wikimedia.org/T421114 [14:03:17] done [14:03:27] Lucas_WMDE: ah cool, thank you! [14:03:48] (03PS2) 10Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) [14:03:48] (we’re also still trying to deploy an UBN fix backport, but running into issues in scap) [14:04:16] (03CR) 10Elukey: [WIP] Move linting to Ruff and apply code fixes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [14:05:34] (03CR) 10Elukey: "First pass! I have intentionally removed a lot of problems allowing exceptions for tests etc.., I think it would be impossible (and probab" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [14:05:48] (03CR) 10Ottomata: stream: mw-page-html-content-change-enrich (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton) [14:06:07] hashar: yes we repooled a little bit earlier than scheduled, codfw is back up now [14:07:25] jasmine_: thank you and congratulations [14:08:22] could/should we make the config reload a part of a repool/depool? [14:09:00] (03PS3) 10Bking: opensearch: handle IP changes for software firewall [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) [14:09:05] (03PS2) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) [14:09:07] (03CR) 10Bking: [C:03+2] opensearch: handle IP changes for software firewall (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking) [14:09:11] (03CR) 10Bking: [V:03+2 C:03+2] opensearch: handle IP changes for software firewall [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking) [14:09:16] !log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]] [14:09:18] hnowlan: the Jenkins reload? Nope it is unrelated, I had to do it for some unrelated configuration changes I have made on Jenkins [14:09:19] T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143 [14:09:23] I confess I’m a bit torn between “revert the backport so the deployment is in a known state” and “leave it to be rolled out with the next deploy because it’s small and we really want it deployed” [14:09:26] hashar: ah okay [14:10:01] hnowlan: and whenever I act on Jenkins/Zuul I try to remember to check the deployment calendar to ensure that is not going to break some ongoing deployment :] [14:10:24] (03PS3) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) [14:10:26] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 11 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:10:32] !log esanders@deploy1003 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/ [14:10:32] mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/medi [14:10:32] awiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 16s) [14:10:48] still the same error [14:11:17] (03CR) 10JavierMonton: stream: mw-page-html-content-change-enrich (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton) [14:11:45] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782589 (10HShaikh) I approve these re... [14:11:47] (03PS3) 10Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) [14:12:31] (03CR) 10Elukey: [WIP] Move linting to Ruff and apply code fixes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [14:13:23] (03CR) 10Ottomata: "It is quite annoying that 'staging' AKA -next in dse-k8s is a different helmfile. It makes it hard to share common settings between 'stagi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton) [14:13:44] 06SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166 (10Lucas_Werkmeister_WMDE) 03NEW [14:13:47] I filed T422166 for the deploy blocker (cc edsanders), not sure how it should be tagged [14:13:48] T422166: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166 [14:14:06] 06SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11782617 (10Lucas_Werkmeister_WMDE) p:05Triage→03Unbreak! [14:14:11] cc jasmine_ ^ if you’re still looking into it [14:14:18] Lucas_WMDE: looking now if perhaps it's swift related see [14:14:18] [0] - https://wikitech.wikimedia.org/wiki/Docker-registry/Runbook [14:14:55] (03PS1) 10Ladsgroup: Bump maxConnCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267062 [14:15:28] (03CR) 10CDanis: [C:03+1] Bump maxConnCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267062 (owner: 10Ladsgroup) [14:16:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267062 (owner: 10Ladsgroup) [14:16:50] 06SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11782637 (10Lucas_Werkmeister_WMDE) Timeline note: this comes hot on the tail of T422130, for which @jasmine_ repooled codfw slightly earlier than [scheduled](https://wikitech.wikimedia.org/w/index.php?title=Deployments&old... [14:16:54] Amir1: good luck with that deploy [14:16:59] (03Merged) 10jenkins-bot: Bump maxConnCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267062 (owner: 10Ladsgroup) [14:17:13] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1267062|Bump maxConnCount]] [14:17:15] (I expect you’ll run into T422166) [14:17:23] Lucas_WMDE: that hopefully should prevent it from happening? [14:17:46] oh that's a different issue [14:17:48] yay [14:17:48] yeah [14:18:25] !log ladsgroup@deploy1003 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted [14:18:25] /mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/med [14:18:25] iawiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 11s) [14:18:28] yup :( [14:19:23] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782654 (10BTullis) [14:19:39] (03CR) 10Btullis: [C:03+2] "Manager approval received." [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: 10Btullis) [14:23:17] (03PS4) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) [14:23:24] (03CR) 10JavierMonton: stream: mw-page-html-content-change-enrich (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton) [14:23:36] (further investigation happening in -sre FTR) [14:24:35] (03CR) 10CDanis: [C:03+1] Add Mayotte to geo-maps - prefer drmrs [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [14:27:10] 06SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11782695 (10Scott_French) dockerd logs on deploy1003 for the above example: ` Apr 02 14:09:17 deploy1003 dockerd[1070]: time="2026-04-02T14:09:17.561327804Z" level=info msg="ignoring event" container=c8f32695fd426caa327d6d... [14:28:22] (03CR) 10Volans: [C:03+2] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [14:28:30] !log installing pyasn1 security updates [14:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:42] (03Merged) 10jenkins-bot: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [14:30:05] jasmine_: Time to snap out of that daydream and deploy DC Switchover: Day 8 - Codfw Repool. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1400). [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1430) [14:33:14] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782729 (10BTullis) I have now modified the `airflow-platfor... [14:34:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T419635)', diff saved to https://phabricator.wikimedia.org/P90236 and previous config saved to /var/cache/conftool/dbconfig/20260402-143452-fceratto.json [14:34:56] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:36:28] 10SRE-tools, 10Cumin, 06Infrastructure-Foundations: Add proxy support to cumin openstack backend - https://phabricator.wikimedia.org/T420360#11782751 (10Volans) 05Open→03Resolved The cloudcumin hosts are now using the webproxies to connect to the openstack APIs and the firewall rule has been reverted... [14:37:31] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782760 (10MoritzMuehlenhoff) p:05Unbreak!→03Medium The immediate impact has been mitigated, reducing priority, the task might still be used to collect followups. [14:41:11] huge spike of PHP warnings from ExperimentManager all of a sudden [14:41:11] (03PS1) 10Eevans: cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) [14:41:19] (logspam-watch) [14:42:09] 06SRE, 10SRE-Access-Requests, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11782776 (10MoritzMuehlenhoff) What kind of access is needed? root access or simply shell access? We have exist... [14:42:17] !log installing libxml-parser-perl security updates [14:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:33] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782789 (10BTullis) You should also now be able to start con... [14:45:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P90237 and previous config saved to /var/cache/conftool/dbconfig/20260402-144500-fceratto.json [14:46:38] (03CR) 10Eevans: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [14:47:27] (03CR) 10Elukey: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [14:48:34] (03CR) 10Elukey: [C:03+1] "Final review - this is currently a ok-ish use case since we already run the same config in prod. We agreed to open a task and follow up on" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [14:49:26] (03CR) 10JMeybohm: [C:03+1] cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [14:50:08] (03CR) 10Eevans: [C:03+2] cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [14:50:17] edsanders: are you still around and available to test your backport? (see -sre) [14:50:45] (03CR) 10Eevans: [V:03+2 C:03+2] cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [14:51:13] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [14:51:41] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [14:52:01] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782824 (10BTullis) 4 Kerberos principals created and welcom... [14:52:25] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [14:52:40] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [14:53:40] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [14:53:54] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [14:54:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11782828 (10Jgreen) 05Open→03Resolved hosts are up and running [14:55:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P90239 and previous config saved to /var/cache/conftool/dbconfig/20260402-145508-fceratto.json [14:55:12] (the ExperimentManager warning spike seems to have abated again fwiw) [14:56:38] !log swfrench@deploy1003 Started scap sync-world: Manual sync-world to pick up 1267062, 1266985 - T422143 [14:56:41] T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143 [14:56:44] !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mr1-eqiad,mr1-eqiad IPv6 with reason: switching from OSFP to BGP [14:56:46] \o/ [14:57:44] !log swfrench@deploy1003 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/ [14:57:44] mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/medi [14:57:44] awiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 06s) [14:58:20] (03CR) 10Ssingh: "I am guessing this is based on probenet data? (not that everything else in the repo currently is but I am mostly curious)" [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [14:59:32] !log ongoing maintenance on mr1-eqiad [14:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:40] !log swfrench@deploy1003 Started scap sync-world: Manual full-rebuild sync-world to pick up 1267062, 1266985 - T422143 [15:00:04] jnuche and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1500) [15:00:38] (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton) [15:02:15] (03CR) 10Dzahn: [C:03+2] buildkitd: Bump buildkit image to wmf-v0.29.0 [puppet] - 10https://gerrit.wikimedia.org/r/1266395 (https://phabricator.wikimedia.org/T415284) (owner: 10Ahmon Dancy) [15:02:37] (03CR) 10Muehlenhoff: [C:03+1] "Preseed notes often use globbing where applicable, but with our ongoing migration of all servers to UEFI for hardware there will be a lot " [puppet] - 10https://gerrit.wikimedia.org/r/1260727 (https://phabricator.wikimedia.org/T418929) (owner: 10Herron) [15:03:03] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [15:03:45] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [15:04:20] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:04:20] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:05:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T419635)', diff saved to https://phabricator.wikimedia.org/P90241 and previous config saved to /var/cache/conftool/dbconfig/20260402-150517-fceratto.json [15:05:20] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:05:34] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2227.codfw.wmnet with reason: Maintenance [15:05:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T419635)', diff saved to https://phabricator.wikimedia.org/P90242 and previous config saved to /var/cache/conftool/dbconfig/20260402-150542-fceratto.json [15:05:49] (03PS1) 10Papaul: Remove OSFP from mr1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1267081 (https://phabricator.wikimedia.org/T421238) [15:06:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:07:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:07:55] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephmon2007-dev - https://phabricator.wikimedia.org/T416396#11782910 (10Jhancock.wm) [15:08:45] (03CR) 10Papaul: [C:03+2] Remove OSFP from mr1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1267081 (https://phabricator.wikimedia.org/T421238) (owner: 10Papaul) [15:09:29] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton) [15:11:23] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton) [15:11:40] !log installing apache2 security updates [15:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:20] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:12:20] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:12:45] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [15:12:59] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [15:16:17] (03PS1) 10Papaul: Add back "replace osfp" to be able to remove it [homer/public] - 10https://gerrit.wikimedia.org/r/1267085 (https://phabricator.wikimedia.org/T421238) [15:20:29] (03CR) 10Papaul: [C:03+2] Add back "replace osfp" to be able to remove it [homer/public] - 10https://gerrit.wikimedia.org/r/1267085 (https://phabricator.wikimedia.org/T421238) (owner: 10Papaul) [15:22:31] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [15:23:08] !log maintenance complete on mr1-eqiad [15:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:22] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:26:12] !log restarted docker-registry-restricted.service on registry200[45] - T422166 [15:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:14] T422166: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166 [15:26:28] !log swfrench@deploy1003 sync-world aborted: Manual full-rebuild sync-world to pick up 1267062, 1266985 - T422143 (duration: 26m 48s) [15:26:31] T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143 [15:27:38] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [15:27:46] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [15:31:16] !log restarted docker-registry-ml.service on registry200[45] - T422166 [15:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:19] T422166: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166 [15:32:34] !log installing freetype security updates [15:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:59] (03CR) 10Dzahn: [C:03+1] gerrit: adjust idleTimeout on Jetty [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [15:33:00] !log swfrench@deploy1003 Started scap sync-world: Manual full-rebuild sync-world to pick up 1267062, 1266985 (attempt 2) - T422143 [15:33:02] T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143 [15:34:43] (03PS4) 10Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) [15:35:06] (03CR) 10Elukey: [WIP] Move linting to Ruff and apply code fixes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [15:38:44] (03CR) 10Elukey: "Local, venvs created (so not the first run):" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [15:39:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T419635)', diff saved to https://phabricator.wikimedia.org/P90244 and previous config saved to /var/cache/conftool/dbconfig/20260402-153918-fceratto.json [15:39:22] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:41:49] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1256301/8370/" [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [15:41:50] (03PS5) 10Elukey: Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) [15:44:23] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:45:37] (03PS14) 10Herron: site: opt-in insetup defaults by hostname prefix [puppet] - 10https://gerrit.wikimedia.org/r/1260727 (https://phabricator.wikimedia.org/T418929) [15:46:55] (03CR) 10A smart kitten: "FWIW that [phab1004 NOOP result](https://puppet-compiler.wmflabs.org/output/1256301/8370/phab1004.eqiad.wmnet/index.html) seems wrong - it" [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [15:46:59] (03CR) 10A smart kitten: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [15:48:31] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards Has improved - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [15:48:57] (03CR) 10A smart kitten: "(FWIW @dzahn@wikimedia.org, feel free to shoot me a message in IRC if you want to sync-up e.g. if/when deploying/testing this patch. I'm n" [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [15:49:08] (03CR) 10Herron: [C:03+2] "thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1260727 (https://phabricator.wikimedia.org/T418929) (owner: 10Herron) [15:49:22] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:49:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P90245 and previous config saved to /var/cache/conftool/dbconfig/20260402-154925-fceratto.json [15:50:05] !log swfrench@deploy1003 swfrench: Manual full-rebuild sync-world to pick up 1267062, 1266985 (attempt 2) - T422143 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:50:09] T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143 [15:50:10] (03CR) 10A smart kitten: "(if I'm around in IRC at the time you'll be deploying this, that is; otherwise feel free to just deploy it if/when is good for you :) )" [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [15:51:13] !log swfrench@deploy1003 swfrench: Continuing with sync [15:55:31] (03PS3) 10Btullis: Remove the site.pp definitions for decommissioned an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1242516 (https://phabricator.wikimedia.org/T414948) [15:55:47] (03PS1) 10Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) [15:56:13] (03CR) 10Dzahn: [V:03+1 C:03+1] "it's because puppet DB queries were introduced somewhere (not by your patch) which often breaks compiler runs (Failed to execute '/pdb/que" [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [15:59:23] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:59:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P90246 and previous config saved to /var/cache/conftool/dbconfig/20260402-155934-fceratto.json [16:00:05] jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1600). Please do the needful. [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:34] we’re so close to finishing the backport+config window lol [16:00:49] (with 1/4 patches deployed) [16:01:31] (03PS2) 10Herron: preseed: use efi for new kafka-logging hosts [puppet] - 10https://gerrit.wikimedia.org/r/1267102 (https://phabricator.wikimedia.org/T418929) [16:01:33] (03CR) 10CI reject: [V:04-1] preseed: use efi for new kafka-logging hosts [puppet] - 10https://gerrit.wikimedia.org/r/1267102 (https://phabricator.wikimedia.org/T418929) (owner: 10Herron) [16:01:38] (03PS2) 10Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) [16:02:56] !log swfrench@deploy1003 Finished scap sync-world: Manual full-rebuild sync-world to pick up 1267062, 1266985 (attempt 2) - T422143 (duration: 29m 56s) [16:02:59] T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143 [16:02:59] \i/ [16:03:04] \o/ \o/ \o/ [16:03:40] !log UTC afternoon backport+config window (very belatedly) done ^^ [16:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:50] thanks for figuring it out and deploying! [16:04:08] Amir1: your maxConnCount bump got deployed now btw ^ [16:04:15] thanks! [16:04:22] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:05:09] 06SRE, 07Datacenter-Switchover: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11783170 (10Scott_French) p:05Unbreak!→03Medium This was a curious one. Many thanks to @elukey and @CDanis for the assistance. tl;dr - Cached connections in the (restricted) docker registry's... [16:05:26] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11783179 (10Ahoelzl) I approve the addition of the listed WME... [16:05:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:23] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:09:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T419635)', diff saved to https://phabricator.wikimedia.org/P90247 and previous config saved to /var/cache/conftool/dbconfig/20260402-160942-fceratto.json [16:09:46] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:09:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2239.codfw.wmnet with reason: Maintenance [16:10:44] (03Abandoned) 10Federico Ceratto: wmnet: update CNAME records for DB masters to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1255655 (https://phabricator.wikimedia.org/T416705) (owner: 10Federico Ceratto) [16:11:45] (03PS3) 10Herron: preseed: use efi for new kafka-logging hosts [puppet] - 10https://gerrit.wikimedia.org/r/1267102 (https://phabricator.wikimedia.org/T418929) [16:12:31] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:12:43] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:12:55] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:13:01] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:14:07] (03CR) 10Herron: [C:03+2] "ok! lets give this a try" [alerts] - 10https://gerrit.wikimedia.org/r/1262175 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [16:14:23] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:15:28] (03Merged) 10jenkins-bot: burrow: update expressions to handle multiple instances [alerts] - 10https://gerrit.wikimedia.org/r/1262175 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [16:15:28] (03CR) 10Dzahn: [V:03+1 C:03+2] phabricator: Set a custom default-mail-address for the test instance [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [16:15:53] jouncebot: nowandnext [16:15:53] For the next 0 hour(s) and 44 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1600) [16:15:53] In 0 hour(s) and 44 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700) [16:15:53] In 0 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700) [16:16:55] (03CR) 10Herron: [C:03+2] "thanks all!" [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [16:18:02] (03CR) 10Dzahn: [V:03+1 C:03+2] "deployed. confirmed it is a NOOP / no error on production host." [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [16:18:31] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1198576 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [16:19:10] (03CR) 10Scott French: [C:03+2] deployment_server: absent image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198576 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [16:23:33] (03Restored) 10Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [16:24:14] FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:24:35] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11783241 (10BTullis) 05Open→03Resolved p:05Triage→... [16:25:39] (03PS6) 10Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) [16:25:48] (03CR) 10CI reject: [V:04-1] config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [16:26:51] (03Abandoned) 10Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [16:29:13] FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:31:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: 10Anne Tomasevich) [16:32:25] (03PS1) 10Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267116 (https://phabricator.wikimedia.org/T421366) [16:33:19] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 4 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#11783346 (10Ladsgroup) I was looking into this a bit yesterday (more general... [16:34:13] FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:48] (03CR) 10Btullis: data-platform: Add alerts for opensearch on k8s certificate expiry (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking) [16:37:32] 06SRE, 06Traffic: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11783388 (10Alberto) Thank you very much for your help! I have correctly implemented the User-Agent in my LocalSettings.php for both MediaWiki core and the QuickInstantCommons... [16:39:14] RESOLVED: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:39:22] (03CR) 10Scott French: [C:03+2] deployment_server: remove absented image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [16:39:46] (03PS4) 10Scott French: deployment_server: remove absented image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096) [16:40:30] (03PS1) 10Daniel Kinzler: Revert "rest gateway: introduce policy for abstractwiki/wikifunctions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267119 [16:41:02] (03CR) 10Daniel Kinzler: [C:03+2] "revert undeployed change" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267119 (owner: 10Daniel Kinzler) [16:43:22] (03CR) 10Scott French: [C:03+2] deployment_server: remove absented image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [16:44:00] (03Merged) 10jenkins-bot: Revert "rest gateway: introduce policy for abstractwiki/wikifunctions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267119 (owner: 10Daniel Kinzler) [16:45:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11783408 (10Jclark-ctr) a:05herron→03Jclark-ctr [16:45:58] (03PS1) 10Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) [16:47:02] 10SRE-Access-Requests, 06Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189 (10prabhat) 03NEW [16:47:35] (03PS1) 10Herron: kafkamon: update burrow ports [puppet] - 10https://gerrit.wikimedia.org/r/1267121 (https://phabricator.wikimedia.org/T418858) [16:47:47] 10SRE-Access-Requests, 06Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783451 (10prabhat) [16:49:51] (03CR) 10Scott French: "Thank you both for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [16:50:07] (03CR) 10Scott French: [C:03+2] image-suggestion: remove service configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [16:52:26] 10SRE-Access-Requests, 06Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783519 (10ssingh) request and key confirmed out of band. [16:53:23] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs3009.esams.wmnet} and A:liberica [16:54:23] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:57:02] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs3009.esams.wmnet} and A:liberica [16:58:15] (03Merged) 10jenkins-bot: image-suggestion: remove service configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [16:59:30] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs3008.esams.wmnet} and A:liberica [17:00:05] bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700) [17:00:07] o/ [17:00:25] I'll be deploying some admin_ng changes shortly [17:02:25] !log swfrench@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:03:03] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs3008.esams.wmnet} and A:liberica [17:03:30] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216) [17:04:46] (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton) [17:05:13] !log swfrench@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:05:34] !log swfrench@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:07:04] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton) [17:07:04] !log swfrench@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:08:15] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [17:08:37] * bd808 checks for things that need releasing [17:09:06] (03PS1) 10DCausse: search: add space-discount for wikidata custom prefix search profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267130 (https://phabricator.wikimedia.org/T420427) [17:09:09] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton) [17:09:17] nothing for my window this week [17:09:39] (03PS4) 10Dzahn: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) [17:10:12] (03CR) 10CI reject: [V:04-1] ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [17:10:34] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:10:37] (03CR) 10Dzahn: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [17:10:48] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:11:08] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:11:20] (03PS5) 10Dzahn: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) [17:11:31] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:11:50] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:12:02] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:12:12] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:12:40] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:13:56] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:14:49] (03CR) 10Scott French: "Thanks for the review!" [dns] - 10https://gerrit.wikimedia.org/r/1198584 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [17:15:32] (03CR) 10Scott French: [C:03+2] wmnet: remove image-suggestion k8s ingress CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1198584 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [17:15:41] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [17:16:11] !log swfrench@dns1004 START - running authdns-update [17:18:08] !log swfrench@dns1004 END - running authdns-update [17:20:27] (03PS4) 10Scott French: service: remove image-suggestion [puppet] - 10https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096) [17:26:28] 10SRE-Access-Requests, 06Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783746 (10prabhat) [17:27:48] alright, I believe I'm done with my side of this window [17:28:10] (03PS1) 10Eevans: cassandra-dev: add ferm srange for k8s staging [puppet] - 10https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444) [17:28:39] (03CR) 10CI reject: [V:04-1] cassandra-dev: add ferm srange for k8s staging [puppet] - 10https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444) (owner: 10Eevans) [17:29:04] (03PS1) 10Snwachukwu: Add rest gateway routes for video_plays path. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) [17:31:23] (03CR) 10Dzahn: [V:03+1 C:03+2] phabricator: Set a custom default-mail-address for the test instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [17:31:54] (03CR) 10Mforns: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [17:32:10] (03PS2) 10Eevans: cassandra-dev: add ferm srange for k8s staging [puppet] - 10https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444) [17:35:42] (03PS1) 10Scott French: fixtures: clean up reference to image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) [17:36:02] (03CR) 10Snwachukwu: [C:03+2] Add rest gateway routes for video_plays path. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [17:36:07] (03PS3) 10Eevans: cassandra-dev: add ferm srange for k8s staging [puppet] - 10https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444) [17:36:12] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444) (owner: 10Eevans) [17:36:51] (03PS1) 10Ssingh: admin: update SSH key for ptiwary [puppet] - 10https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) [17:36:54] (03CR) 10Snwachukwu: [C:03+2] "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [17:37:00] (03CR) 10Snwachukwu: [V:03+2 C:03+2] Add rest gateway routes for video_plays path. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [17:39:23] (03PS3) 10Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) [17:39:32] (03CR) 10Eevans: [C:03+2] cassandra-dev: add ferm srange for k8s staging [puppet] - 10https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444) (owner: 10Eevans) [17:39:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11783799 (10Jgreen) 05Open→03Resolved boxes are imaged, in replication, and ready for traffic once pfw policy is done [17:40:49] (03CR) 10CI reject: [V:04-1] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking) [17:42:20] (03CR) 10Ottomata: [C:03+1] Add rest gateway routes for video_plays path. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [17:42:35] (03CR) 10Ssingh: "Request verified out of band, please feel free to do an additional check." [puppet] - 10https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: 10Ssingh) [17:44:20] (03CR) 10Ayounsi: "That's a follow up from an email that was sent to noc@ from a local ISP." [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [17:44:27] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783815 (10HShaikh) As prabhat's manager I approve this request. [17:45:50] (03CR) 10Ssingh: [C:03+1] "Ah I see it now -- my bad. Thanks." [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [17:46:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:47:50] (03PS1) 10Snwachukwu: Add rest gateway routes for video_plays path production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267147 (https://phabricator.wikimedia.org/T415202) [17:49:08] (03CR) 10Dzahn: [V:03+1 C:03+2] "I can see in compiler how this changes things on new instance "integration-agent-docker-1070" just created on https://phabricator.wikimedi" [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: 10Hashar) [17:50:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11783859 (10Jgreen) [17:54:07] 06SRE, 10DNS, 06Infrastructure-Foundations, 10netbox, and 2 others: Missing includes in DNS repo from Netbox-generated snippets - https://phabricator.wikimedia.org/T422115#11783873 (10ssingh) Thanks for fixing it but I agree that we need an alert for this otherwise we will miss this again. [17:55:40] (03CR) 10Snwachukwu: [C:03+2] Add rest gateway routes for video_plays path production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267147 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [17:56:20] (03CR) 10Dzahn: [V:03+1 C:03+2] "noop confirmed on contint prod hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: 10Hashar) [17:57:43] (03Merged) 10jenkins-bot: Add rest gateway routes for video_plays path production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267147 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [17:58:30] (03PS4) 10Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) [17:59:52] !log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [18:00:10] !log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [18:00:29] !log ebysans@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [18:00:35] (03CR) 10CI reject: [V:04-1] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking) [18:00:48] !log ebysans@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [18:01:24] (03CR) 10Jasmine: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [18:01:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:04:40] (03PS5) 10Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) [18:05:15] (03CR) 10Brouberol: [C:03+1] fixtures: clean up reference to image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [18:06:00] (03CR) 10CI reject: [V:04-1] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking) [18:07:21] (03CR) 10Muehlenhoff: "One validation is fine, you can either go ahead and merge it or I'll take care of it via Clinic duty, either is fine." [puppet] - 10https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: 10Ssingh) [18:07:35] (03PS6) 10Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) [18:14:19] (03CR) 10Bking: data-platform: Add alerts for opensearch on k8s certificate expiry (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking) [18:16:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11783930 (10Jclark-ctr) a:05Jgreen→03Jclark-ctr [18:24:15] (03CR) 10SBassett: [C:03+2] "Oh, whoops, I see the commit msg says "miscweb(research-landing-page): bump image version". Just to be clear, this change set is for" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174750 (https://phabricator.wikimedia.org/T399132) (owner: 10Jly) [18:24:47] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs5006.eqsin.wmnet} and A:liberica [18:25:57] FIRING: ProbeDown: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:28:03] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs5006.eqsin.wmnet} and A:liberica [18:28:50] (03PS3) 10SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: 10WikiBayer) [18:29:53] port 80!? [18:30:57] yeah I'm not sure why it's firing... sort of seems ok? [18:30:57] RESOLVED: ProbeDown: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:31:19] https://phabricator.wikimedia.org/P90248 [18:31:30] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [18:31:31] (03CR) 10Scott French: [C:03+2] fixtures: clean up reference to image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [18:31:38] topranks: yeah it resolved. haven't looked very deeply on what happened but can't seem anything obvious [18:31:56] same here [18:31:56] I gotta say the probe dashboard is absolutely incomprehensible to me, any time I have to visit it [18:32:09] I don't see any signs of general connectivity issues [18:32:25] and ipv6 only? [18:32:30] seems so yeah [18:33:06] yeah, tbh that is further evidence it is just an outlier failed connection, for whatever reason [18:33:08] topranks: yep. we should improve that. it defaults to "All" [18:33:11] (03Merged) 10jenkins-bot: fixtures: clean up reference to image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [18:33:16] rather than a systemic problem like everyone is failing to connect [18:33:26] 06SRE, 10SRE-Access-Requests, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784038 (10gmodena) >>! In T422141#11782776, @MoritzMuehlenhoff wrote: > What kind of access is needed? root ac... [18:33:51] don't see any specific signs of user-visible impact from graphs [18:34:21] (03CR) 10SBassett: [C:03+1] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: 10WikiBayer) [18:34:21] (03CR) 10Ssingh: "Thanks, I will merge if I can find a reviewer otherwise feel free to take it later." [puppet] - 10https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: 10Ssingh) [18:35:37] 06SRE, 10SRE-Access-Requests, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784042 (10MoritzMuehlenhoff) >>! In T422141#11784038, @gmodena wrote: >>>! In T422141#11782776, @MoritzMuehlen... [18:35:58] (03CR) 10Reedy: [C:03+1] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: 10WikiBayer) [18:37:05] (03CR) 10Ssingh: [C:03+1] "Two reviews by the sec team, merging." [puppet] - 10https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: 10WikiBayer) [18:37:06] (03CR) 10Ssingh: [C:03+2] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: 10WikiBayer) [18:37:12] haha [18:37:13] consensus! [18:37:39] Reedy: who am I to say no to two +1s?! [18:38:57] (03CR) 10Muehlenhoff: [C:03+1] "LGMT syntax-wise" [puppet] - 10https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: 10Ssingh) [18:39:25] https://prometheus-eqiad.wikimedia.org/ops/graph?g0.expr=probe_success%7Baddress%3D%222620%3A0%3A861%3Aed1a%3A%3A1%22%2C%20instance%3D%22text%3A80%22%7D%5B20m%5D&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h [18:39:33] I really don't understand why that fired, but anyway [18:40:27] topranks: doesn't add up yep [18:40:32] anyway nothing to do here I feel [18:40:49] yep enough other stuff to worry about [18:40:58] yeah, this feels like a one time blip, and if it happens again, we can still correlat further [18:41:21] (03CR) 10Ssingh: [C:03+2] admin: update SSH key for ptiwary [puppet] - 10https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: 10Ssingh) [18:41:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11784073 (10Jclark-ctr) a:05Jclark-ctr→03BTullis [18:41:52] (03CR) 10Alex.sanford: [C:03+1] Allow-list some additional domains to the currently enforcing CSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: 10WikiBayer) [18:42:19] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11784074 (10Jclark-ctr) a:05Jclark-ctr→03BTullis [18:44:03] 06SRE, 10SRE-Access-Requests, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784075 (10gmodena) >>! In T422141#11784042, @MoritzMuehlenhoff wrote: > We don't have a specific access group... [18:44:32] (03PS1) 10Ottomata: dse-k8s - add common dir for mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267152 (https://phabricator.wikimedia.org/T360794) [18:45:52] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:46:50] (03CR) 10Ottomata: [C:03+2] dse-k8s - add common dir for mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267152 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata) [18:49:09] (03Merged) 10jenkins-bot: dse-k8s - add common dir for mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267152 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata) [18:51:31] cmooney@cumin1003 netbox (PID 2341745) is awaiting input [18:51:57] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for new lumen 100g transport - cmooney@cumin1003" [18:51:58] (03PS1) 10Reedy: Undeploy Extension:StopForumSpam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267157 (https://phabricator.wikimedia.org/T422185) [18:52:24] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for new lumen 100g transport - cmooney@cumin1003" [18:52:24] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:52:28] (03PS1) 10Cathal Mooney: Add INCLUDE statement for 2620:0:861:fe03::/64 subnet [dns] - 10https://gerrit.wikimedia.org/r/1267158 (https://phabricator.wikimedia.org/T395878) [18:53:17] (03CR) 10Ssingh: [C:03+1] Add INCLUDE statement for 2620:0:861:fe03::/64 subnet [dns] - 10https://gerrit.wikimedia.org/r/1267158 (https://phabricator.wikimedia.org/T395878) (owner: 10Cathal Mooney) [18:54:38] (03CR) 10Cathal Mooney: [C:03+2] Add INCLUDE statement for 2620:0:861:fe03::/64 subnet [dns] - 10https://gerrit.wikimedia.org/r/1267158 (https://phabricator.wikimedia.org/T395878) (owner: 10Cathal Mooney) [18:54:48] (03PS1) 10Ottomata: dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794) [18:54:56] (03CR) 10CI reject: [V:04-1] dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata) [18:55:10] !log cmooney@dns2005 START - running authdns-update [18:55:19] (03PS2) 10Ottomata: dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794) [18:56:09] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784108 (10AWesterinen-WMF) Retried ... no change [18:56:34] !log cmooney@dns2005 END - running authdns-update [18:56:53] (03CR) 10Jforrester: [C:03+1] Undeploy Extension:StopForumSpam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267157 (https://phabricator.wikimedia.org/T422185) (owner: 10Reedy) [18:57:10] (03CR) 10Ottomata: [C:03+2] dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata) [18:59:14] (03Merged) 10jenkins-bot: dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata) [19:00:25] (03CR) 10Dzahn: [C:03+2] ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [19:01:19] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [19:01:23] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [19:02:03] (03PS3) 10Elukey: opensearch-semantic-search-test: Add to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [19:04:43] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [19:04:47] (03CR) 10Scott French: [C:03+2] service: remove image-suggestion [puppet] - 10https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [19:06:31] (03PS1) 10Cathal Mooney: Management routers: set autonomous system number [homer/public] - 10https://gerrit.wikimedia.org/r/1267170 (https://phabricator.wikimedia.org/T421238) [19:09:11] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on releases2003.codfw.wmnet with reason: T418109 [19:09:14] T418109: Update Jenkins hosts from Java 17 to Java 21 - https://phabricator.wikimedia.org/T418109 [19:09:30] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784127 (10MoritzMuehlenhoff) You still need to request "wmf" at https://idm.wikimedia.org/permissions/, so far you only r... [19:12:13] (03PS1) 10Dzahn: jenkins: add profile::ci::docker to role [puppet] - 10https://gerrit.wikimedia.org/r/1267173 (https://phabricator.wikimedia.org/T418109) [19:16:13] 06SRE, 07Datacenter-Switchover: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11784146 (10Scott_French) [19:16:44] (03PS1) 10Ottomata: mw-page-html-content-change-enrich - tune backfill in staging release (-next) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267175 (https://phabricator.wikimedia.org/T421216) [19:19:50] (03CR) 10Ottomata: [C:03+2] mw-page-html-content-change-enrich - tune backfill in staging release (-next) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267175 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [19:21:50] (03Merged) 10jenkins-bot: mw-page-html-content-change-enrich - tune backfill in staging release (-next) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267175 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [19:23:41] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784167 (10AWesterinen-WMF) I tried to do that, but see no option for wmf. Only "logstash", "airflow" and "spiderpig". [19:24:12] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [19:24:16] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [19:33:06] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11784179 (10ssingh) 05Open→03Resolved a:03ssingh Should now be rolled out everywhere, let us know if you have any issues with access. [19:35:49] (03PS1) 10Dduvall: zuul: Move cross-profile references to hiera [puppet] - 10https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207) [19:35:51] (03PS1) 10Dduvall: zuul: Fix nodepool zookeeper configuration [puppet] - 10https://gerrit.wikimedia.org/r/1267178 (https://phabricator.wikimedia.org/T422207) [19:45:21] (03PS2) 10Dduvall: zuul: Move cross-profile references to hiera [puppet] - 10https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207) [19:45:21] (03PS2) 10Dduvall: zuul: Fix nodepool zookeeper configuration [puppet] - 10https://gerrit.wikimedia.org/r/1267178 (https://phabricator.wikimedia.org/T422207) [19:46:02] (03CR) 10Dduvall: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207) (owner: 10Dduvall) [19:48:46] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards Has improved - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [19:56:29] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:56:32] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:56:48] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:56:50] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:57:36] Is anyone here waiting for the UTC late backport window? And are there any blockers to the window? [19:57:46] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:57:48] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T2000) [20:00:05] nya_1F616EMO and bwang: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] o/ [20:00:26] Im here~! [20:00:49] * nya_1F616EMO prays for a deployer to show up [20:02:56] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [20:03:03] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [20:04:18] (03PS4) 10Bking: opensearch-semantic-search-test: Add to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) [20:05:17] (03CR) 10Bking: opensearch-semantic-search-test: Add to services proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [20:05:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:07:12] (03PS1) 10Ottomata: mw-page-html-content-change-enrich-next - use kafka jumbo external services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267189 (https://phabricator.wikimedia.org/T421216) [20:07:58] (03CR) 10Ottomata: [C:03+2] mw-page-html-content-change-enrich-next - use kafka jumbo external services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267189 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [20:08:04] (03CR) 10Ottomata: [V:03+2 C:03+2] mw-page-html-content-change-enrich-next - use kafka jumbo external services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267189 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [20:08:51] It seems like we're out of luck? [20:09:35] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [20:09:44] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [20:12:35] 10ops-magru: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T419298#11784255 (10phaultfinder) [20:13:27] (03PS5) 10Bking: opensearch-cluster: Add support for service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260795 (https://phabricator.wikimedia.org/T421293) [20:13:27] (03CR) 10Bking: "Thanks for the course correction! I think we have a path forward here; we've added envoy TLS termination in 1248865 and monitoring for the" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260795 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [20:13:43] I'd offer to do it, but there was a big breakage of the ability to scap deploy things this morning, so it might be a good idea to have a real deployer present who could recover from an error if it happened. [20:13:57] (03Abandoned) 10Bking: opensearch-cluster: Add support for service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260795 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [20:15:00] One of my patch is a time-specific logo update for zhwikinews, and one is a non-time-specific SecurePoll deployment to a private wiki. I may propose to the local community to use CSS for the logo change; do you recommend doing so? [20:17:01] Feels inconvenient to deal with, given all the various logo sizes involved. [20:17:17] You mean to deploy? [20:17:38] Currently working on the CSS solution [20:17:45] (cuz there are no deployment on Fridays we all know) [20:17:52] If you and bwang don't mind, I could certainly kick off a spiderpig build with all your patches. If it breaks in the same way as it did before, it'd just fail to deploy even to testservers rather than ruining production. [20:18:08] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T421439#11784278 (10VRiley-WMF) This ticket seems like it relates to another ticket https://phabricator.wikimedia.org/T413559 @Eevans this server is out of warrenty, would you like us to replace this disk or leave it... [20:18:34] There's just a *chance* that it'll wedge us into a state where a releng person needs to look at things before any deploys can happen. 😅 [20:19:12] I won't let go my SecurePoll patch anyways under this state, it'd be up to you on whether to accept that zhwikinews logo change. [20:20:05] I'm fine giving it a shot. [20:20:10] bwang: Want yours in as well? [20:21:26] Wait, I found something that might be off [20:21:44] Let me chekc my patch for resolutions [20:22:04] Just let me know when you're happy with it, and if bwang hasn't shown up by then I can do just-yours. [20:22:13] Ah nvm, the script did the job for me [20:22:25] It successfully reduced the resolution to 135x135, nice [20:22:33] so good to go [20:22:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264569 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO) [20:24:24] (03CR) 10Bking: [C:03+2] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking) [20:24:56] (03CR) 10Bking: [C:03+2] "Ben is out for the next 10 days, so I'm going to be bold and merge after addressing his concerns." [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking) [20:25:02] (03CR) 10Bking: [V:03+2 C:03+2] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking) [20:25:19] (03Merged) 10jenkins-bot: zhwikinews: 20th anniversary logo change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264569 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO) [20:25:37] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1264569|zhwikinews: 20th anniversary logo change (T420165)]] [20:25:40] T420165: Requesting temporary logo change for zh.wikinews.org - https://phabricator.wikimedia.org/T420165 [20:28:46] Sorry I was in a call [20:28:52] Im still here and able to help test the backpoert [20:29:16] (03PS2) 10Clare Ming: Update the Test Kitchen maintenance script to target testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T422209) [20:29:22] !log kemayo@deploy1003 1f616emo, kemayo: Backport for [[gerrit:1264569|zhwikinews: 20th anniversary logo change (T420165)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:29:40] nya_1F616EMO: Can you verify your change? [20:29:44] testing [20:30:44] it works, tested on vector-2022, vector, monobook, timeless. [20:31:03] I will continue the deploy, then. [20:31:06] Thanks [20:31:11] !log kemayo@deploy1003 1f616emo, kemayo: Continuing with sync [20:33:09] (03CR) 101F616EMO: "Will re-schedule in the Monday, May 04 UTC morning backport window, together with Ia1a463ba01452b76b73ff6b59b821eae9154ddf8." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: 101F616EMO) [20:33:21] (03PS1) 101F616EMO: Revert "zhwikinews: 20th anniversary logo change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165) [20:33:35] (03CR) 101F616EMO: "Will re-schedule in the Monday, May 04 UTC morning backport window, together with Iea2390c01600b5f93c7b01f5605d887541c74b02." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO) [20:33:52] 06SRE, 10SRE-Access-Requests, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784305 (10MoritzMuehlenhoff) >>! In T422141#11784075, @gmodena wrote: >>>! In T422141#11784042, @MoritzMuehlen... [20:35:37] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784306 (10MoritzMuehlenhoff) >>! In T420053#11784167, @AWesterinen-WMF wrote: > I tried to do that, but see no option for... [20:37:23] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264569|zhwikinews: 20th anniversary logo change (T420165)]] (duration: 11m 46s) [20:37:26] T420165: Requesting temporary logo change for zh.wikinews.org - https://phabricator.wikimedia.org/T420165 [20:37:34] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 182040496 and 19 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:38:32] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3815080 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:39:36] nya_1F616EMO: Okay, should be live now. [20:40:01] Nice and verified the changes through prod. [20:40:04] Thank you for your help [20:40:33] (03CR) 10Cathal Mooney: "Do we have stats for RE? Is it that much better to eqsin on average than drmrs? From the geography it's not clear to me." [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [20:43:58] nya_1F616EMO: np! [20:47:18] Hi are we still able to back port my patch? [20:47:55] bwang: sure, I can get it if you're willing to stick around until it's done. [20:48:11] Yes of course [20:48:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: 10Anne Tomasevich) [20:51:01] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11784343 (10VRiley-WMF) Hey @elukey Thanks for working on this! Is there anything I can do from my end to assist with this? Let us know... [20:51:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11784345 (10VRiley-WMF) [20:51:52] (03Merged) 10jenkins-bot: Add logged-in reader retention instrument [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: 10Anne Tomasevich) [20:52:10] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1267008|Add logged-in reader retention instrument (T420490)]] [20:52:13] T420490: [Logged in reader retention baseline] Launch A/A experiment - https://phabricator.wikimedia.org/T420490 [20:52:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11784348 (10VRiley-WMF) [20:53:51] !log kemayo@deploy1003 annet, kemayo: Backport for [[gerrit:1267008|Add logged-in reader retention instrument (T420490)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:54:13] bwang: let me know when it's tested [20:56:36] checking now [20:57:02] (03PS1) 10DLynch: SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267204 [20:57:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267204 (owner: 10DLynch) [20:58:54] Looks good [20:59:09] Continuing, then. [20:59:12] !log kemayo@deploy1003 annet, kemayo: Continuing with sync [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T2100) [21:01:02] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784373 (10AWesterinen-WMF) Updated my email and requested wmf access. But, I have a further problem. I tried to ssh in... [21:01:16] Kemayo: let me know when you are done. I have a deploy but I need 15m to prep [21:01:46] Jdlrobson: Sure, I just have one more patch to get out after this, so that should fit into your timing pretty okay. [21:03:50] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267008|Add logged-in reader retention instrument (T420490)]] (duration: 11m 40s) [21:03:54] T420490: [Logged in reader retention baseline] Launch A/A experiment - https://phabricator.wikimedia.org/T420490 [21:04:06] bwang: Live now. [21:04:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267204 (owner: 10DLynch) [21:08:09] (03PS2) 10Jasmine: role::kubernetes::worker: add sophroid to the lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) [21:15:33] (03Merged) 10jenkins-bot: SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267204 (owner: 10DLynch) [21:15:47] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]] [21:17:26] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:18:36] !log kemayo@deploy1003 kemayo: Continuing with sync [21:23:09] (03PS3) 10Jasmine: role::kubernetes::worker: add sophroid to the lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) [21:23:42] (03CR) 10CI reject: [V:04-1] role::kubernetes::worker: add sophroid to the lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [21:23:51] (03PS4) 10Jasmine: role::kubernetes::worker: add sophroid to the lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) [21:26:03] 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11784439 (10Od1n) FWIW, I'm still currently encountering this error on frwiki, and it prevents my local custom JS/CSS files from loading. Unexpectedly not loaded: * `Special:Myp... [21:26:25] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]] [21:26:41] Jdlrobson: Sorry, the k8s deploy failed, which is making everything *fun*. [21:27:13] no worries [21:27:19] im appreciating the extra testing time :) [21:28:05] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:28:34] !log kemayo@deploy1003 kemayo: Continuing with sync [21:32:44] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]] (duration: 06m 18s) [21:32:57] Jdlrobson: okay, all yours! [21:35:22] thanks! [21:35:45] (03PS1) 10Jdlrobson: Fix section heading spacing on mobile [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) [21:36:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: 10Jdlrobson) [21:48:25] (03CR) 10CI reject: [V:04-1] Fix section heading spacing on mobile [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: 10Jdlrobson) [21:48:31] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [21:49:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: 10Jdlrobson) [21:49:14] Flakey Wikibase test :( [21:50:31] (03Merged) 10jenkins-bot: Fix section heading spacing on mobile [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: 10Jdlrobson) [21:51:01] (03CR) 10SBassett: [C:03+1] Undeploy Extension:StopForumSpam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267157 (https://phabricator.wikimedia.org/T422185) (owner: 10Reedy) [21:58:21] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1267214|Fix section heading spacing on mobile (T414882)]] [21:58:24] T414882: Additional top margin for Parsoid outputs for sections with no lead - https://phabricator.wikimedia.org/T414882 [22:00:02] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1267214|Fix section heading spacing on mobile (T414882)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:01:42] !log jdlrobson@deploy1003 jdlrobson: Continuing with sync [22:03:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:05:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:05:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:05:54] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267214|Fix section heading spacing on mobile (T414882)]] (duration: 07m 33s) [22:05:57] T414882: Additional top margin for Parsoid outputs for sections with no lead - https://phabricator.wikimedia.org/T414882 [22:06:51] All done. [22:08:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:10:38] 06SRE, 06ServiceOps new, 07Datacenter-Switchover: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11784520 (10Scott_French) Moving this into #serviceops_new, since we're probably the right team to figure out how this should b... [22:11:34] (03PS1) 10Eevans: Use cassandra-dev2001-a (instance) for lambda [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267229 (https://phabricator.wikimedia.org/T414112) [22:17:35] (03CR) 10Eevans: [C:03+2] Use cassandra-dev2001-a (instance) for lambda [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267229 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [22:19:31] (03Merged) 10jenkins-bot: Use cassandra-dev2001-a (instance) for lambda [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267229 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [22:20:22] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [22:20:36] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [22:40:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:40:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:43:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:45:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:59:29] (03PS1) 10Eevans: Add egress rule for cassandra-dev2001-a:50051 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267251 (https://phabricator.wikimedia.org/T414112) [23:02:03] (03CR) 10Eevans: [C:03+2] Add egress rule for cassandra-dev2001-a:50051 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267251 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [23:03:31] FIRING: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [23:04:01] (03Merged) 10jenkins-bot: Add egress rule for cassandra-dev2001-a:50051 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267251 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [23:06:01] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [23:06:07] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [23:28:38] jouncebot: nowandnext [23:28:38] No deployments scheduled for the next 6 hour(s) and 31 minute(s) [23:28:38] In 6 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260403T0600) [23:34:22] (03CR) 10Zabe: [C:03+2] Start reading from new file table in dewiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264110 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [23:35:16] (03Merged) 10jenkins-bot: Start reading from new file table in dewiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264110 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [23:35:42] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1264110|Start reading from new file table in dewiki and fawiki (T416548)]] [23:35:45] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [23:37:19] !log zabe@deploy1003 zabe: Backport for [[gerrit:1264110|Start reading from new file table in dewiki and fawiki (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:37:40] !log zabe@deploy1003 zabe: Continuing with sync [23:38:23] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T421439#11784707 (10Eevans) >>! In T421439#11784276, @VRiley-WMF wrote: > This ticket seems like it relates to another ticket https://phabricator.wikimedia.org/T413559 > > @Eevans this server is out of warrenty, would... [23:38:31] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [23:39:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1267280 [23:39:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1267280 (owner: 10TrainBranchBot) [23:41:52] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264110|Start reading from new file table in dewiki and fawiki (T416548)]] (duration: 06m 10s) [23:41:55] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [23:51:27] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1267280 (owner: 10TrainBranchBot) [23:51:34] !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [23:52:58] !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply