[00:01:06] <wikibugs>	 (03CR) 10Scott French: "Thanks, Raine!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266250 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková)
[00:09:10] <wikibugs>	 (03CR) 10Scott French: "Thanks, Raine!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266264 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková)
[00:56:14] <logmsgbot>	 !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[01:02:33] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 284378408 and 30 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:06:33] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7050408 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:06:35] <logmsgbot>	 !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[01:08:23] <logmsgbot>	 !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[01:09:22] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[01:11:46] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1266500
[01:11:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1266500 (owner: 10TrainBranchBot)
[01:18:44] <logmsgbot>	 !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[01:19:48] <logmsgbot>	 !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[01:24:09] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1266500 (owner: 10TrainBranchBot)
[01:30:13] <logmsgbot>	 !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[01:30:53] <logmsgbot>	 !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[01:41:15] <logmsgbot>	 !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[01:51:17] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[01:54:29] <icinga-wm>	 PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[02:00:56] <logmsgbot>	 !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
[02:01:29] <icinga-wm>	 RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[02:06:11] <jinxer-wm>	 FIRING: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun
[02:07:20] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 23s)
[02:09:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:34:13] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:46:33] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 786199704 and 51 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:47:33] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:09:23] <jinxer-wm>	 FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[04:41:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:54:23] <jinxer-wm>	 RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[04:55:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:00:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] wmnet: remove image-suggestion k8s ingress CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1198584 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[05:09:37] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:16:23] <jinxer-wm>	 FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[05:33:30] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[05:51:32] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[05:56:17] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0600).
[06:06:11] <jinxer-wm>	 FIRING: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun
[06:10:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:15:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "The patch looks good, but I left a comment on the comment :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking)
[06:19:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: 101F616EMO)
[06:29:22] <wikibugs>	 (03PS2) 101F616EMO: arbcom_zhwiki: Enable SecurePoll without PII rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309)
[06:30:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:52:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[06:56:17] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:57:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0700).
[07:00:05] <jouncebot>	 georgekyz: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:24] <georgekyz>	 Good morning folks!
[07:00:59] <georgekyz>	 I am planning to deploy my patch now, is anybody around ?
[07:03:22] <georgekyz>	 I running it.
[07:03:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[07:04:26] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig: Add rr-multilingual prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[07:05:19] <logmsgbot>	 !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1266228|EventStreamConfig: Add rr-multilingual prediction_change stream (T415892)]]
[07:05:22] <stashbot>	 T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change - https://phabricator.wikimedia.org/T415892
[07:07:35] <logmsgbot>	 !log gkyziridis@deploy1003 gkyziridis: Backport for [[gerrit:1266228|EventStreamConfig: Add rr-multilingual prediction_change stream (T415892)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:08:03] <logmsgbot>	 !log gkyziridis@deploy1003 gkyziridis: Continuing with sync
[07:08:16] <georgekyz>	 syncing
[07:08:42] <wikibugs>	 06SRE, 06Traffic: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11780898 (10MoritzMuehlenhoff) p:05Triage→03Medium
[07:08:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:12:19] <logmsgbot>	 !log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266228|EventStreamConfig: Add rr-multilingual prediction_change stream (T415892)]] (duration: 07m 00s)
[07:12:23] <stashbot>	 T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change - https://phabricator.wikimedia.org/T415892
[07:12:53] <georgekyz>	 the deployment finished successfully!
[07:13:09] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11780904 (10MoritzMuehlenhoff) Was this linked in some onboarding doc that you followed? If so, it can be removed for now. We're currently reworking 2FA support in CAS and the originally...
[07:13:58] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[07:16:01] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[07:20:49] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11780907 (10MoritzMuehlenhoff) Since Andrea is working as a contractor the tracking entry in data.yaml should use the The t...
[07:22:25] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to superset dashboard for mpostoronca - https://phabricator.wikimedia.org/T421471#11780912 (10MoritzMuehlenhoff) 05In progress→03Resolved a:03hnowlan @MPostoronca-WMF Your access is enabled, so I'm rmarking this as resolved. If you run into any issues,...
[07:24:57] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[07:25:06] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[07:27:56] <wikibugs>	 (03PS1) 10Jaime Nuche: ApiAuthManagerHelper: Accept fields with undefined label [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266861 (https://phabricator.wikimedia.org/T422027)
[07:29:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266861 (https://phabricator.wikimedia.org/T422027) (owner: 10Jaime Nuche)
[07:30:33] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 64049
[07:32:13] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 64049
[07:38:00] <logmsgbot>	 !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@fea7794]: deploy to freshly reimaged wdqs host
[07:38:05] <logmsgbot>	 !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@fea7794]: deploy to freshly reimaged wdqs host (duration: 00m 05s)
[07:38:07] <moritzm>	 !log purge prometheus-nginx-exporter from url downloaders, remnants of early hcapcha rollout
[07:38:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:36] <wikibugs>	 (03PS1) 10Mszwarc: Disable external link analysis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837)
[07:40:42] <wikibugs>	 (03Merged) 10jenkins-bot: ApiAuthManagerHelper: Accept fields with undefined label [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266861 (https://phabricator.wikimedia.org/T422027) (owner: 10Jaime Nuche)
[07:41:06] <logmsgbot>	 !log jnuche@deploy1003 Started scap sync-world: Backport for [[gerrit:1266861|ApiAuthManagerHelper: Accept fields with undefined label (T422027)]]
[07:41:09] <stashbot>	 T422027: TypeError: MediaWiki\Api\ApiAuthManagerHelper::formatMessage(): Argument #3 ($message) must be of type MediaWiki\Message\Message, null given - https://phabricator.wikimedia.org/T422027
[07:41:17] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[07:42:21] <Msz2001>	 I'll deploy a config change if there's nothing going on
[07:42:42] <Msz2001>	 (I see it is, I'll wit)
[07:42:45] <Msz2001>	 wait*
[07:43:08] <logmsgbot>	 !log jnuche@deploy1003 jnuche: Backport for [[gerrit:1266861|ApiAuthManagerHelper: Accept fields with undefined label (T422027)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:43:33] <logmsgbot>	 !log jnuche@deploy1003 jnuche: Continuing with sync
[07:46:17] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[07:46:54] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] Disable external link analysis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc)
[07:47:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (, T421714) xfer wdqs-all from wdqs2016.codfw.wmnet -> wdqs1027.eqiad.wmnet, repooling both afterwards
[07:47:44] <stashbot>	 T421714: Data platform: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421714
[07:47:55] <logmsgbot>	 !log jnuche@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266861|ApiAuthManagerHelper: Accept fields with undefined label (T422027)]] (duration: 06m 39s)
[07:47:58] <stashbot>	 T422027: TypeError: MediaWiki\Api\ApiAuthManagerHelper::formatMessage(): Argument #3 ($message) must be of type MediaWiki\Message\Message, null given - https://phabricator.wikimedia.org/T422027
[07:48:54] <wikibugs>	 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 07Sustainability (Incident Followup): ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11780961 (10ayounsi)
[07:49:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc)
[07:50:16] <wikibugs>	 (03Merged) 10jenkins-bot: Disable external link analysis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc)
[07:50:17] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350)
[07:50:40] <logmsgbot>	 !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1266866|Disable external link analysis (T419837)]]
[07:50:43] <stashbot>	 T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837
[07:50:56] <jinxer-wm>	 RESOLVED: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun
[07:51:23] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[07:52:23] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[07:52:40] <logmsgbot>	 !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1266866|Disable external link analysis (T419837)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:53:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/1266242 (owner: 10Muehlenhoff)
[07:54:14] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[07:55:55] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[07:56:39] <logmsgbot>	 !log mszwarc@deploy1003 mszwarc: Continuing with sync
[07:58:49] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:00:05] <jouncebot>	 jnuche and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0800)
[08:00:53] <logmsgbot>	 !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266866|Disable external link analysis (T419837)]] (duration: 10m 13s)
[08:00:57] <stashbot>	 T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837
[08:01:15] <jnuche>	 morning, I will begin the train shortly
[08:01:58] <wikibugs>	 (03PS1) 10Arnaudb: apt-staging: error handling for restricted projects [puppet] - 10https://gerrit.wikimedia.org/r/1266920 (https://phabricator.wikimedia.org/T422070)
[08:02:03] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] apt-staging: error handling for restricted projects [puppet] - 10https://gerrit.wikimedia.org/r/1266920 (https://phabricator.wikimedia.org/T422070) (owner: 10Arnaudb)
[08:03:25] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266924 (https://phabricator.wikimedia.org/T420480)
[08:03:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266924 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot)
[08:04:19] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266924 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot)
[08:07:49] <wikibugs>	 (03CR) 10Ozge: [C:03+1] ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[08:08:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:10:28] <logmsgbot>	 !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.22  refs T420480
[08:10:31] <stashbot>	 T420480: 1.46.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T420480
[08:11:03] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[08:11:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Update email record for andreawest [puppet] - 10https://gerrit.wikimedia.org/r/1266931 (https://phabricator.wikimedia.org/T420053)
[08:12:45] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11781036 (10MoritzMuehlenhoff) >>! In T420053#11778139, @AWesterinen wrote: > I still have the error,...
[08:13:10] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[08:14:38] <wikibugs>	 (03PS4) 10Volans: webproxies: allow cloudcumin to openstack [puppet] - 10https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360)
[08:14:38] <wikibugs>	 (03CR) 10Volans: "PCC available at:" [puppet] - 10https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans)
[08:16:16] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111 (10FCeratto-WMF) 03NEW
[08:16:17] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[08:16:24] <wikibugs>	 (03PS1) 10Brouberol: deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)
[08:16:49] <wikibugs>	 (03PS2) 10Brouberol: deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)
[08:17:10] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[08:17:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans)
[08:18:38] <wikibugs>	 (03PS1) 10Arnaudb: aptrepo: add an alert for failed prepare [alerts] - 10https://gerrit.wikimedia.org/r/1266932 (https://phabricator.wikimedia.org/T422070)
[08:18:41] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] aptrepo: add an alert for failed prepare [alerts] - 10https://gerrit.wikimedia.org/r/1266932 (https://phabricator.wikimedia.org/T422070) (owner: 10Arnaudb)
[08:19:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: 10Brouberol)
[08:19:21] <wikibugs>	 (03PS3) 10Brouberol: deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)
[08:19:38] <wikibugs>	 (03PS4) 10Brouberol: deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)
[08:20:00] <wikibugs>	 (03Merged) 10jenkins-bot: aptrepo: add an alert for failed prepare [alerts] - 10https://gerrit.wikimedia.org/r/1266932 (https://phabricator.wikimedia.org/T422070) (owner: 10Arnaudb)
[08:20:57] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "lgtm, pcc looks good too, to be carefully rolled out/tested." [puppet] - 10https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans)
[08:21:07] <wikibugs>	 (03PS5) 10Brouberol: deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)
[08:23:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: 10Brouberol)
[08:24:10] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8368/co" [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: 10Brouberol)
[08:24:15] <wikibugs>	 (03PS6) 10Brouberol: deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)
[08:30:22] <volans>	 !log briefly disabling puppet on P:installserver::proxy to deploy g/1266885
[08:30:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:21] <wikibugs>	 (03CR) 10Volans: [C:03+2] webproxies: allow cloudcumin to openstack [puppet] - 10https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans)
[08:33:26] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: 10Brouberol)
[08:40:18] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: 10Brouberol)
[08:40:45] <XioNoX>	 slyngs, effie, I'm going to reboot mr1-esams for a software upgrade, it will go down for up to 20min, device itself is downtimed, but there might be some alerting noise from esams mgmt being unreachable
[08:41:15] <jinxer-wm>	 FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:41:17] <jinxer-wm>	 FIRING: [3x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[08:41:23] <jinxer-wm>	 RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[08:41:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:42:00] <XioNoX>	 !log reboot mr1-esams - T416450
[08:42:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:04] <stashbot>	 T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450
[08:42:36] <effie>	 XioNoX: thank you, break a leg
[08:43:59] <icinga-wm>	 PROBLEM - ps1-by27-esams-infeed-load-tower-B-single-phase on ps1-by27-esams is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:44:20] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781126 (10atsuko) Thanks, I'll update the onboarding.
[08:44:32] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: sync
[08:44:42] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781127 (10atsuko) a:03atsuko
[08:44:45] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance
[08:44:53] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2149 (T419635)', diff saved to https://phabricator.wikimedia.org/P90206 and previous config saved to /var/cache/conftool/dbconfig/20260402-084452-fceratto.json
[08:44:56] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[08:45:07] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[08:45:23] <icinga-wm>	 PROBLEM - Host ps1-by27-esams is DOWN: PING CRITICAL - Packet loss = 100%
[08:45:23] <icinga-wm>	 PROBLEM - Host ps1-bw27-esams is DOWN: PING CRITICAL - Packet loss = 100%
[08:45:32] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[08:45:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-by27-esams and mr1-esams (10.80.127.5) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=asw1-by27-esams:9804&var-bgp_group=Management&var-bgp_neighbor=mr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDow
[08:46:09] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[08:46:15] <jinxer-wm>	 FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:46:17] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781130 (10atsuko)
[08:47:08] <wikibugs>	 (03PS1) 10Gkyziridis: ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266941
[08:47:29] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781133 (10atsuko)
[08:47:50] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781135 (10atsuko) 05Open→03Declined
[08:49:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:49:49] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266941 (owner: 10Gkyziridis)
[08:49:54] <moritzm>	 !log added Atsuko to the cn=ops LDAP group T421860
[08:49:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:58] <stashbot>	 T421860: Requesting shell access and membership of the ops group for atsuko - https://phabricator.wikimedia.org/T421860
[08:50:23] <jinxer-wm>	 FIRING: GnmiTargetDown: asw1-bw27-esams is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[08:50:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between asw1-by27-esams and mr1-esams (10.80.127.5) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=asw1-by27-esams:9804&var-bgp_group=Management&var-bgp_neighbor=mr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPD
[08:50:47] <icinga-wm>	 RECOVERY - Host ps1-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 81.26 ms
[08:50:47] <icinga-wm>	 RECOVERY - Host ps1-bw27-esams is UP: PING OK - Packet loss = 0%, RTA = 81.25 ms
[08:51:15] <jinxer-wm>	 FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:51:27] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266941 (owner: 10Gkyziridis)
[08:51:32] <XioNoX>	 router is back up - 10min downtime
[08:52:15] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266941 (owner: 10Gkyziridis)
[08:53:34] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting shell access and membership of the ops group for atsuko - https://phabricator.wikimedia.org/T421860#11781141 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @atsuko Your SSH access should now be working. You can e.g. try to connect to cumin1003.e...
[08:54:13] <wikibugs>	 (03Merged) 10jenkins-bot: ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266941 (owner: 10Gkyziridis)
[08:54:13] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:55:23] <jinxer-wm>	 RESOLVED: GnmiTargetDown: asw1-bw27-esams is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[08:55:27] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[08:55:41] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[08:56:15] <jinxer-wm>	 RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:57:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Update Cumin alias for contint to also cover the spun-off Trixie role [puppet] - 10https://gerrit.wikimedia.org/r/1266215 (owner: 10Muehlenhoff)
[08:58:49] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:08:30] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards Has been acknowledged   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[09:12:31] <wikibugs>	 (03PS1) 10Klausman: admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266947
[09:17:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T419635)', diff saved to https://phabricator.wikimedia.org/P90207 and previous config saved to /var/cache/conftool/dbconfig/20260402-091743-fceratto.json
[09:17:47] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[09:19:48] <moritzm>	 !log upgrading Envoy on the config-master servers to 1.35.9 T419637 T410975
[09:19:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:58] <stashbot>	 T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637
[09:19:59] <stashbot>	 T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975
[09:21:37] <wikibugs>	 (03PS1) 10Gkyziridis: ml-services: Revert the changes and the model version into the previous state. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266948
[09:23:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1265258 (owner: 10Slyngshede)
[09:23:51] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] ml-services: Revert the changes and the model version into the previous state. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266948 (owner: 10Gkyziridis)
[09:25:57] <wikibugs>	 (03PS1) 10Volans: Add missing includes from Netbox exported data [dns] - 10https://gerrit.wikimedia.org/r/1266952 (https://phabricator.wikimedia.org/T422115)
[09:26:07] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Revert the changes and the model version into the previous state. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266948 (owner: 10Gkyziridis)
[09:27:36] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[09:27:42] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[09:27:52] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P90208 and previous config saved to /var/cache/conftool/dbconfig/20260402-092751-fceratto.json
[09:28:30] <jinxer-wm>	 RESOLVED: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards Has been acknowledged   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[09:29:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master-codfw
[09:29:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors
[09:29:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors
[09:30:35] <wikibugs>	 (03PS4) 10Arnaudb: gerrit: fix Envoy idle timeout handling for slow HTTPS git requests [puppet] - 10https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909)
[09:33:23] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: update sshd timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1266149 (https://phabricator.wikimedia.org/T417996)
[09:33:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master-codfw
[09:33:47] <wikibugs>	 (03Abandoned) 10Arnaudb: gerrit: update timeouts for gitiles [puppet] - 10https://gerrit.wikimedia.org/r/1265448 (https://phabricator.wikimedia.org/T421904) (owner: 10Arnaudb)
[09:37:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Obsolete airflow-search-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1242407 (owner: 10Muehlenhoff)
[09:38:00] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P90209 and previous config saved to /var/cache/conftool/dbconfig/20260402-093759-fceratto.json
[09:39:25] <wikibugs>	 (03PS5) 10Arnaudb: gerrit: fix Envoy idle timeout handling for slow HTTPS git requests [puppet] - 10https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909)
[09:39:29] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] image-suggestion: remove service configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[09:39:45] <wikibugs>	 (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb)
[09:40:13] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] profile::service_proxy::envoy: remove mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1265420 (https://phabricator.wikimedia.org/T420468) (owner: 10Elukey)
[09:40:30] <wikibugs>	 (03PS2) 10Elukey: profile::service_proxy::envoy: remove mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1265420 (https://phabricator.wikimedia.org/T420468)
[09:41:18] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: fix Envoy idle timeout handling for slow HTTPS git requests [puppet] - 10https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb)
[09:41:50] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: profile::service_proxy::envoy: remove mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1265420 (https://phabricator.wikimedia.org/T420468) (owner: 10Elukey)
[09:43:42] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "thanks!" [dns] - 10https://gerrit.wikimedia.org/r/1266952 (https://phabricator.wikimedia.org/T422115) (owner: 10Volans)
[09:45:33] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: sync
[09:45:42] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync
[09:46:56] <jinxer-wm>	 FIRING: GitlabPackagePullerFailedOnPrepare: Package puller has some run errors while preparing projects. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnPrepare
[09:47:41] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync
[09:48:02] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync
[09:48:03] <wikibugs>	 (03Abandoned) 10Majavah: hieradata: Add dumps.wikimedia.org CDN mapping [puppet] - 10https://gerrit.wikimedia.org/r/1237194 (https://phabricator.wikimedia.org/T306550) (owner: 10Majavah)
[09:48:09] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T419635)', diff saved to https://phabricator.wikimedia.org/P90210 and previous config saved to /var/cache/conftool/dbconfig/20260402-094808-fceratto.json
[09:48:11] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[09:48:26] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance
[09:48:34] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90211 and previous config saved to /var/cache/conftool/dbconfig/20260402-094834-fceratto.json
[09:48:37] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync
[09:48:58] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync
[09:53:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Obsolete airflow-wmde-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1266959
[09:58:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Update email record for andreawest [puppet] - 10https://gerrit.wikimedia.org/r/1266931 (https://phabricator.wikimedia.org/T420053) (owner: 10Muehlenhoff)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1000)
[10:00:04] <jouncebot>	 dues: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[10:00:25] <wikibugs>	 (03CR) 10Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler)
[10:02:05] <wikibugs>	 (03PS1) 10Volans: cumin: use webproxy to connect to openstack APIs [puppet] - 10https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360)
[10:02:05] <wikibugs>	 (03CR) 10Volans: "PCC available for cloudcumin1001 here:" [puppet] - 10https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans)
[10:03:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266229 (owner: 10Muehlenhoff)
[10:03:35] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: update upstream_idle_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1266962 (https://phabricator.wikimedia.org/T421827)
[10:03:38] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: update upstream_idle_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1266962 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb)
[10:04:15] <jinxer-wm>	 FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:04:17] <wikibugs>	 (03PS1) 10Volans: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963
[10:04:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (owner: 10Volans)
[10:05:23] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply
[10:05:32] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[10:05:34] <wikibugs>	 (03PS2) 10Volans: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360)
[10:05:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans)
[10:08:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans)
[10:09:15] <jinxer-wm>	 FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:09:42] <wikibugs>	 (03PS1) 10Mhorsey: Enable the CampaignEvents extension on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749)
[10:10:18] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply
[10:10:38] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: define authed-user class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler)
[10:10:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable the CampaignEvents extension on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) (owner: 10Mhorsey)
[10:11:19] <wikibugs>	 (03PS3) 10Volans: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360)
[10:11:41] <wikibugs>	 (03CR) 10Volans: [C:03+2] cumin: use webproxy to connect to openstack APIs [puppet] - 10https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans)
[10:12:36] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[10:12:49] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: define authed-user class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler)
[10:13:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans)
[10:14:30] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[10:15:11] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker-exp2001.codfw.wmnet
[10:16:51] <jinxer-wm>	 FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[10:16:52] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[10:16:54] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[10:17:00] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[10:17:05] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[10:17:14] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:17:19] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[10:17:24] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[10:17:32] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[10:17:36] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[10:17:40] <effie>	 !incidents
[10:17:40] <sirenbot>	 7803 (UNACKED)  [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
[10:17:46] <effie>	 !ack 7803
[10:17:46] <sirenbot>	 7803 (ACKED)  [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
[10:17:50] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[10:17:55] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[10:18:03] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[10:18:08] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[10:18:24] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[10:18:28] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[10:18:45] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[10:18:46] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:18:50] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[10:19:11] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[10:19:15] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[10:19:17] <moritzm>	 !log installing freetype security updates
[10:19:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:25] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker-exp2001.codfw.wmnet
[10:19:27] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[10:19:30] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[10:19:31] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[10:19:35] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'.
[10:19:36] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'.
[10:21:06] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90212 and previous config saved to /var/cache/conftool/dbconfig/20260402-102105-fceratto.json
[10:21:09] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[10:21:41] <wikibugs>	 (03PS2) 10Mhorsey: Enable the CampaignEvents extension on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749)
[10:22:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ...
[10:22:50] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[10:23:16] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) (owner: 10Mhorsey)
[10:24:44] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11781519 (10Volans) Given this has been moved to the backlog I'll leave here a comment for our future selves: i...
[10:26:33] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 166195784 and 22 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:27:29] <wikibugs>	 (03PS1) 10Hashar: wm-checks-api: add tag for PostgreSQL jobs [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1266965
[10:27:45] <jinxer-wm>	 FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[10:28:33] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3533304 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:30:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:30:41] <wikibugs>	 06SRE, 06DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781562 (10Peachey88)
[10:31:02] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[10:31:12] <wikibugs>	 06SRE, 06DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781588 (10MBH) Many such servers: 26, 31. When just opening pages for read.
[10:31:14] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P90213 and previous config saved to /var/cache/conftool/dbconfig/20260402-103113-fceratto.json
[10:31:25] <wikibugs>	 06SRE, 06DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781591 (10Peachey88)
[10:31:27] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[10:32:19] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet
[10:33:27] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans)
[10:34:45] <wikibugs>	 06SRE, 06DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781642 (10Thryduulf) I've been experiencing these errors intermittently on English Wikipedia today, but only on trying to save edits. Each time trying again has resulted in the save being successful.
[10:37:41] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[10:38:10] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler)
[10:38:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler)
[10:39:14] <wikibugs>	 (03PS5) 10Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581)
[10:39:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1148 - https://phabricator.wikimedia.org/T421892#11781672 (10Jclark-ctr) 05Open→03Declined This ticket automated ticket was opened by mistake it was still being worked on in In T411919
[10:39:44] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler)
[10:40:02] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[10:41:22] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P90214 and previous config saved to /var/cache/conftool/dbconfig/20260402-104121-fceratto.json
[10:41:51] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler)
[10:41:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421970#11781681 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr rebalanced
[10:43:18] <wikibugs>	 06SRE, 06DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781698 (10Aklapper)
[10:43:53] <logmsgbot>	 !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet
[10:44:33] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 76721280 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:45:00] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:45:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS2 Status - issue on ml-serve1015:9290 - https://phabricator.wikimedia.org/T421599#11781704 (10Jclark-ctr) 05Open→03Resolved replaced failed psu Outbound ticket for psu 1-258638557493
[10:45:33] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3553128 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:45:43] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:48:21] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:48:23] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:48:23] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:48:23] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:48:28] <A_smart_kitten>	 fwiw I jusst got 'cannot access the database: database servers in cluster31 are overloaded' when trying to save an edit on metawiki. worked fine on the second attempt.
[10:48:33] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 298909248 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:49:26] <A_smart_kitten>	 oh i see it's already known, apologies :)
[10:49:33] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 4010680 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:49:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:49:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:50:33] <wikibugs_>	 06SRE, 06DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781731 (10Wellverywell) p:05Triage→03Unbreak!
[10:50:41] <wikibugs>	 10SRE-Access-Requests, 06Data-Platform-SRE, 06Wikidata Platform Team: Request: wdqs shell access for user @AWesterinen-WMF - https://phabricator.wikimedia.org/T422141 (10gmodena) 03NEW
[10:51:30] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90215 and previous config saved to /var/cache/conftool/dbconfig/20260402-105129-fceratto.json
[10:51:33] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[10:51:35] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance
[10:51:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T419635)', diff saved to https://phabricator.wikimedia.org/P90216 and previous config saved to /var/cache/conftool/dbconfig/20260402-105142-fceratto.json
[10:52:14] <wikibugs>	 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781750 (10RhinosF1)
[10:52:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:54:13] <wikibugs>	 10SRE-Access-Requests, 06Data-Platform-SRE, 06Wikidata Platform Team: Request: wdqs shell access for user AWesterinen-WMF - https://phabricator.wikimedia.org/T422141#11781774 (10gmodena)
[10:56:57] <wikibugs>	 10SRE-Access-Requests, 06Data-Platform-SRE, 06Wikidata Platform Team: Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11781779 (10gmodena)
[10:57:52] <wikibugs>	 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781783 (101F616EMO) I experienced such errors when diffing and saving edits.
[10:58:15] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011#11781785 (10Jclark-ctr) a:03Jclark-ctr This server is out of warranty.   Replaced Drive slot 16  with matching 8tb sata drive
[10:58:45] <wikibugs>	 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781790 (10Ladsgroup) We are on it.
[10:59:47] <wikibugs>	 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781793 (101F616EMO) Should I expect the coming backport window be cancelled or delayed due to this incident?
[11:00:25] <wikibugs>	 (03PS4) 10Btullis: Add analytics-fr-tech system user and corresponding groups [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213)
[11:00:25] <wikibugs>	 (03PS1) 10Btullis: Grant sudo privileges for the analytics-fr-tech-users group [puppet] - 10https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213)
[11:01:28] <wikibugs>	 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781814 (10RhinosF1) >>! In T422130#11781793, @1F616EMO wrote: > Should I expect the coming backport window be cancelled or delayed due to this incident?  Very likely yes. A dep...
[11:02:00] <wikibugs>	 (03CR) 10Btullis: [C:04-1] "Set to -1 pending the review by Infrastructure Foundations." [puppet] - 10https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis)
[11:04:16] <wikibugs>	 (03PS1) 10Esanders: Fix suggestion mode availability check [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143)
[11:04:20] <wikibugs>	 (03CR) 10Muehlenhoff: Add analytics-fr-tech system user and corresponding groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis)
[11:05:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143) (owner: 10Esanders)
[11:07:26] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011#11781846 (10Jclark-ctr) After replacement Server showed drive as foreign. continued to fail to clear foreign config.    Replaced drive again with new seagate 8tb sata drive
[11:07:48] <wikibugs>	 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781847 (101F616EMO) >>! In T422130#11781814, @RhinosF1 wrote: >>>! In T422130#11781793, @1F616EMO wrote: >> Should I expect the coming backport window be cancelled or delayed d...
[11:13:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 11.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:14:15] <jinxer-wm>	 FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:20:54] <wikibugs>	 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781890 (10Lucas_Werkmeister_WMDE)
[11:21:41] <jinxer-wm>	 FIRING: [8x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[11:24:22] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T419635)', diff saved to https://phabricator.wikimedia.org/P90217 and previous config saved to /var/cache/conftool/dbconfig/20260402-112421-fceratto.json
[11:24:25] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[11:26:41] <jinxer-wm>	 FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[11:26:51] <jinxer-wm>	 FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[11:27:00] <effie>	 !incidents
[11:27:00] <sirenbot>	 7803 (ACKED)  [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
[11:27:23] <jinxer-wm>	 FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[11:27:49] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11781903 (10BTullis)
[11:27:50] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011#11781904 (10Jclark-ctr) 05Open→03Resolved
[11:28:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11781909 (10Jclark-ctr) a:03Jclark-ctr
[11:29:02] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] REST: Publish ReadingLists v0 module in REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: 10KineticPelagic)
[11:32:25] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is CRITICAL: 6.702e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[11:32:45] <jinxer-wm>	 FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[11:34:30] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P90218 and previous config saved to /var/cache/conftool/dbconfig/20260402-113429-fceratto.json
[11:34:33] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 97599648 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[11:35:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11781922 (10Jclark-ctr) updating bios firmware , expander firmware due to coms error on backplain. and idrac firmware additionally
[11:35:33] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3557000 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[11:36:54] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: bump upstream_idle_timeout to 900s [puppet] - 10https://gerrit.wikimedia.org/r/1266989 (https://phabricator.wikimedia.org/T421904)
[11:37:12] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11781927 (10BTullis) I have validated all SSH keys via out-of...
[11:37:15] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: bump upstream_idle_timeout to 900s [puppet] - 10https://gerrit.wikimedia.org/r/1266989 (https://phabricator.wikimedia.org/T421904) (owner: 10Arnaudb)
[11:37:23] <jinxer-wm>	 RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[11:38:23] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:39:19] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11781930 (10Gehel) p:05Triage→03High
[11:42:49] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:44:38] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P90219 and previous config saved to /var/cache/conftool/dbconfig/20260402-114437-fceratto.json
[11:47:45] <jinxer-wm>	 FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[11:48:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 15.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:48:23] <wikibugs>	 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781968 (10Thryduulf) I've just encountered what I presume is the same error, this time when trying to use the reply tool [6a4d47bf-961e-4513-9b1f-c6970e11f156] Caught exception...
[11:48:23] <wikibugs>	 (03PS5) 10Btullis: Add analytics-fr-tech system user and corresponding groups [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213)
[11:48:24] <wikibugs>	 (03PS2) 10Btullis: Grant sudo privileges for the analytics-fr-tech-users group [puppet] - 10https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213)
[11:51:15] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest gateway: refactor ratelimit integration test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266995
[11:51:51] <jinxer-wm>	 RESOLVED: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[11:52:11] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[11:52:17] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254925 (owner: 10PipelineBot)
[11:52:24] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254926 (owner: 10PipelineBot)
[11:52:34] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241846 (owner: 10PipelineBot)
[11:52:44] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258153 (owner: 10PipelineBot)
[11:52:45] <jinxer-wm>	 RESOLVED: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[11:52:55] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254927 (owner: 10PipelineBot)
[11:53:04] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1246819 (owner: 10PipelineBot)
[11:54:15] <jinxer-wm>	 RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:54:30] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans)
[11:54:47] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T419635)', diff saved to https://phabricator.wikimedia.org/P90220 and previous config saved to /var/cache/conftool/dbconfig/20260402-115446-fceratto.json
[11:54:50] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[11:55:03] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2190.codfw.wmnet with reason: Maintenance
[11:55:12] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T419635)', diff saved to https://phabricator.wikimedia.org/P90221 and previous config saved to /var/cache/conftool/dbconfig/20260402-115511-fceratto.json
[11:59:00] <edsanders>	 I have a high visibility UBN in for the deployment window - just waiting for it to merge
[11:59:59] <wikibugs>	 (03PS1) 10Brouberol: deployment_server: tweak the labels on opensearch_k8s_master_cert_expiry_seconds [puppet] - 10https://gerrit.wikimedia.org/r/1266999 (https://phabricator.wikimedia.org/T418175)
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1200)
[12:02:02] <edsanders>	 ah - timezone change - the window starts in one hour
[12:02:15] <jinxer-wm>	 FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[12:03:19] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] deployment_server: tweak the labels on opensearch_k8s_master_cert_expiry_seconds [puppet] - 10https://gerrit.wikimedia.org/r/1266999 (https://phabricator.wikimedia.org/T418175) (owner: 10Brouberol)
[12:05:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:06:41] <jinxer-wm>	 FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[12:07:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 23.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:09:35] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1373.eqiad.wmnet with OS trixie
[12:09:47] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1374.eqiad.wmnet with OS trixie
[12:09:57] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1373
[12:09:57] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1373
[12:10:08] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1374
[12:10:08] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1374
[12:10:47] <p858snake|cloud>	 edsanders: fyi there is a incident at the moment (T422130) so the window might be effected
[12:10:48] <stashbot>	 T422130: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130
[12:11:02] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková)
[12:11:31] <wikibugs>	 (03CR) 10Volans: [C:03+2] Add missing includes from Netbox exported data [dns] - 10https://gerrit.wikimedia.org/r/1266952 (https://phabricator.wikimedia.org/T422115) (owner: 10Volans)
[12:11:41] <jinxer-wm>	 FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[12:11:41] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:11:57] <logmsgbot>	 !log volans@dns1004 START - running authdns-update
[12:12:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 22.53% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:12:19] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266947 (owner: 10Klausman)
[12:12:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782064 (10Jclark-ctr) 05Open→03Resolved ` A configuration related issue on the device Backplane is resolved. `
[12:13:46] <logmsgbot>	 !log volans@dns1004 END - running authdns-update
[12:13:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[12:14:26] <edsanders>	 p858snake I'd like to start my deployment asap, is everything on hold at the moment?
[12:16:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782090 (10FCeratto-WMF) Thanks!
[12:16:41] <jinxer-wm>	 FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[12:17:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11782094 (10Jclark-ctr) @herron  can you assist with updating puppet on this install ticket ?
[12:18:38] <edsanders>	 Rhoni
[12:18:46] <edsanders>	 *typo
[12:18:51] <jinxer-wm>	 FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[12:19:02] <effie>	 !incidents
[12:19:02] <sirenbot>	 7804 (ACKED)  ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad)
[12:19:03] <sirenbot>	 7803 (RESOLVED)  [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
[12:19:12] <edsanders>	 RhinosF1: is there any chance of getting a UBN backported, despite T422130?
[12:19:13] <stashbot>	 T422130: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130
[12:19:32] <RhinosF1>	 edsanders: no idea why you are asking me
[12:19:32] <edsanders>	 (this: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1266984)
[12:19:39] <edsanders>	 I saw you commented on the incident task
[12:19:42] <RhinosF1>	 You need to ask the IC
[12:19:45] <jinxer-wm>	 FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[12:19:50] <RhinosF1>	 I suggest in #wikimedia-sre
[12:19:53] <edsanders>	 Thanks
[12:19:53] <RhinosF1>	 Much quieter there
[12:20:01] <wikibugs>	 (03CR) 10JMeybohm: Upgrade aux-k8s-codfw to k8s 1.31 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1265426 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey)
[12:20:08] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] admin_ng: upgrade aux-k8s-codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265427 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey)
[12:21:41] <jinxer-wm>	 FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[12:22:32] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1373.eqiad.wmnet with reason: host reimage
[12:22:35] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1374.eqiad.wmnet with reason: host reimage
[12:22:40] <wikibugs>	 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782127 (10taavi)
[12:24:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11782144 (10Jclark-ctr) a:03Jclark-ctr
[12:25:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11782146 (10Jclark-ctr)
[12:26:41] <jinxer-wm>	 FIRING: [54x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[12:26:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T419635)', diff saved to https://phabricator.wikimedia.org/P90222 and previous config saved to /var/cache/conftool/dbconfig/20260402-122642-fceratto.json
[12:26:46] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[12:27:05] <wikibugs>	 (03CR) 10Btullis: Add analytics-fr-tech system user and corresponding groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis)
[12:27:44] <wikibugs>	 06SRE, 10DNS, 06Infrastructure-Foundations, 10netbox, and 3 others: Missing includes in DNS repo from Netbox-generated snippets - https://phabricator.wikimedia.org/T422115#11782158 (10Volans) p:05Triage→03Medium I've merged and release the fix, do you want to keep the task open to implement some form o...
[12:28:49] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for es1042.eqiad.wmnet
[12:28:50] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es1042.eqiad.wmnet
[12:29:17] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1373.eqiad.wmnet with reason: host reimage
[12:30:46] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool es1042.eqiad.wmnet: Restoring section
[12:30:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782163 (10FCeratto-WMF) The host booted, I triggered a puppet run manually, started MariaDB, enabled alarming and checked that icinga is green and started pooling in to help with T422130
[12:31:11] <wikibugs>	 (03CR) 10JMeybohm: service::catalog: add sophroid service catalog entry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine)
[12:31:23] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] conftool: add sophroid etcd data [puppet] - 10https://gerrit.wikimedia.org/r/1248611 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine)
[12:31:41] <jinxer-wm>	 RESOLVED: [44x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[12:31:46] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:31:57] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] wmnet: add sophroid svc IPs [dns] - 10https://gerrit.wikimedia.org/r/1248617 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine)
[12:32:20] <wikibugs>	 (03CR) 10Klausman: [V:03+2 C:03+2] admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266947 (owner: 10Klausman)
[12:32:32] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 86555328 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:32:39] <wikibugs>	 (03PS1) 10Anne Tomasevich: Add logged-in reader retention instrument [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490)
[12:32:46] <logmsgbot>	 !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool es1042.eqiad.wmnet: Restoring section
[12:32:49] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1374.eqiad.wmnet with reason: host reimage
[12:32:58] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool es1042.eqiad.wmnet: Restoring section
[12:32:59] <logmsgbot>	 !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool es1042.eqiad.wmnet: Restoring section
[12:33:10] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool es1042: Restoring section
[12:33:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782172 (10ops-monitoring-bot) Starting pool of es1042 by fceratto@cumin1003: Restoring section
[12:33:26] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] "This is the wrong file. Since you're targeting the aux cluster you need to add the pool there (`hieradata/role/common/aux_k8s/worker.yaml`" [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine)
[12:33:34] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 200752 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:33:51] <jinxer-wm>	 FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[12:34:33] <effie>	 !incidents
[12:34:33] <sirenbot>	 7804 (ACKED)  ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad)
[12:34:33] <sirenbot>	 7803 (RESOLVED)  [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
[12:34:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: 10Anne Tomasevich)
[12:35:52] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] role::kubernetes::worker: add sophroid to the lvs pools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine)
[12:36:32] <wikibugs>	 (03CR) 10Aude: [C:03+1] Add logged-in reader retention instrument [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: 10Anne Tomasevich)
[12:36:51] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P90224 and previous config saved to /var/cache/conftool/dbconfig/20260402-123650-fceratto.json
[12:38:22] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:38:22] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:38:22] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:39:23] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[12:39:39] <wikibugs>	 (03Merged) 10jenkins-bot: admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266947 (owner: 10Klausman)
[12:39:48] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:39:48] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:41:20] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986#11782182 (10Jclark-ctr) a:03Jclark-ctr   ` 2026-01-12 21:59:21 An unrecoverable disk media error occurred on Disk 20 in Backplane 2 of Integrated RAID Controller 1. Part Number =...
[12:41:31] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[12:41:32] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[12:41:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:41:43] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782184 (10BTullis) I have run `cross-validate-accounts` for...
[12:42:33] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986#11782190 (10Jclark-ctr) 05Open→03Resolved
[12:44:17] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[12:45:04] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[12:45:29] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1373.eqiad.wmnet with OS trixie
[12:45:51] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[12:46:59] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P90225 and previous config saved to /var/cache/conftool/dbconfig/20260402-124659-fceratto.json
[12:48:33] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1042: Restoring section
[12:48:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782211 (10ops-monitoring-bot) Completed pooling of es1042 by fceratto@cumin1003: Restoring section
[12:49:21] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1374.eqiad.wmnet with OS trixie
[12:49:22] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[12:49:36] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986#11782217 (10MatthewVernon) Thanks for the quick fixes @Jclark-ctr :-)
[12:50:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:50:19] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:54:43] <jasmine_>	 hi folks, just a reminder that we will repooling codfw at 14:00 utc today
[12:55:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 22.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:55:32] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 468938744 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:56:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11782255 (10Jclark-ctr) @Jgreen  replaced cable link came up.  Sorry for delay
[12:56:37] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:57:07] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T419635)', diff saved to https://phabricator.wikimedia.org/P90227 and previous config saved to /var/cache/conftool/dbconfig/20260402-125707-fceratto.json
[12:57:11] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[12:57:25] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2194.codfw.wmnet with reason: Maintenance
[12:57:32] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 120 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:57:33] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2194 (T419635)', diff saved to https://phabricator.wikimedia.org/P90228 and previous config saved to /var/cache/conftool/dbconfig/20260402-125732-fceratto.json
[12:58:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 20.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1300).
[13:00:05] <jouncebot>	 manfredi, HouseOfM, edsanders, and annet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:14] <Lucas_WMDE>	 o/
[13:00:23] <annet>	 o/
[13:00:24] <Lucas_WMDE>	 I can deploy but I need to catch up with the incident first
[13:00:32] <Lucas_WMDE>	 not sure if it’s okay to deploy at the moment
[13:00:41] <edsanders>	 last I heard it isn't
[13:01:01] <edsanders>	 I've also asked to deploy my UBN asap once the incident is resolved
[13:01:14] <Lucas_WMDE>	 https://www.wikimediastatus.net/incidents/kq46rrxd2yy4 is still up
[13:01:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:02:04] <wikibugs>	 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782282 (10Aklapper)
[13:02:15] <Lucas_WMDE>	 I agree that edsanders’ change seems top priority once we can deploy at all
[13:02:19] <wikibugs>	 (03PS1) 10Btullis: Grant the WME engineering team production access suitable for Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214)
[13:03:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Grant the WME engineering team production access suitable for Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: 10Btullis)
[13:03:51] <jinxer-wm>	 FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:08:53] <wikibugs>	 (03PS2) 10Btullis: Grant the WME engineering team production access suitable for Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214)
[13:09:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Grant the WME engineering team production access suitable for Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: 10Btullis)
[13:13:31] <wikibugs>	 (03PS3) 10Btullis: Grant the WME engineering team production access suitable for Airflow [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214)
[13:15:21] <wikibugs>	 (03PS1) 10Ayounsi: Add Mayotte to geo-maps - prefer drmrs [dns] - 10https://gerrit.wikimedia.org/r/1267042
[13:16:34] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 47811456 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:17:24] <Lucas_WMDE>	 (the codfw repool is being pulled ahead, if that solves the incident then we *may* be able to deploy one or two patches in the window after all)
[13:17:33] <logmsgbot>	 !log jasmine@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool codfw [reason: no reason specified, T414486]
[13:17:37] <stashbot>	 T414486: Upgrade AUX clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414486
[13:17:46] <logmsgbot>	 !log jasmine@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool codfw [reason: no reason specified, T414486]
[13:18:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:18:33] <logmsgbot>	 !log jasmine@cumin1003 START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: maintenance - T414486
[13:19:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:20:31] <wikibugs>	 (03CR) 10Btullis: [C:04-1] "I'm just waiting for final approval from Haroon on the ticket, for his 6 reports." [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: 10Btullis)
[13:20:32] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3981016 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:22:09] <sukhe>	 !incidents
[13:22:09] <sirenbot>	 7804 (ACKED)  ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad)
[13:22:09] <sirenbot>	 7803 (RESOLVED)  [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
[13:23:50] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#11782358 (10Jclark-ctr)
[13:27:16] <wikibugs>	 (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216)
[13:28:30] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 23.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:28:51] <jinxer-wm>	 FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:29:15] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T419635)', diff saved to https://phabricator.wikimedia.org/P90229 and previous config saved to /var/cache/conftool/dbconfig/20260402-132914-fceratto.json
[13:29:18] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[13:29:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11782375 (10Jgreen) >>! In T417295#11782255, @Jclark-ctr wrote: > @Jgreen  replaced cable link came up.  Sorry for delay  @Jclark-ctr looks good, it's imaging now. Thanks!
[13:29:52] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782376 (10BTullis) This patch for the...
[13:30:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 22.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:30:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Patch looks good, can be merged once approval is done" [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: 10Btullis)
[13:31:11] <wikibugs>	 (03CR) 10Eevans: [C:03+2] charts/cassandra-http-gateway: configurable Cassandra keyspace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259188 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[13:31:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11782380 (10Jclark-ctr)
[13:32:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis)
[13:32:44] <wikibugs>	 (03CR) 10Eevans: [C:03+2] services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[13:33:51] <jinxer-wm>	 FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:33:58] <sukhe>	 !ack
[13:33:59] <sirenbot>	 All incidents are already acked.
[13:34:45] <jinxer-wm>	 FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[13:34:51] <wikibugs>	 (03Merged) 10jenkins-bot: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[13:35:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 21.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:35:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11782401 (10Jclark-ctr) @VRiley-WMF  Thanks for following up I had Sent the email with instructions to Papaul while I was out on Tuesday. This will require som...
[13:36:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11782402 (10Jclark-ctr) 05Open→03Resolved
[13:37:45] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
[13:39:24] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P90230 and previous config saved to /var/cache/conftool/dbconfig/20260402-133923-fceratto.json
[13:39:45] <jinxer-wm>	 FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[13:41:28] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Emit Prometheus counter on health check failover [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267056 (https://phabricator.wikimedia.org/T421204)
[13:41:47] <logmsgbot>	 !log jasmine@cumin1003 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in codfw: maintenance - T414486
[13:41:51] <stashbot>	 T414486: Upgrade AUX clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414486
[13:42:15] <jinxer-wm>	 RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:42:58] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
[13:43:51] <jinxer-wm>	 RESOLVED: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:44:45] <jinxer-wm>	 RESOLVED: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[13:49:23] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143) (owner: 10Esanders)
[13:49:32] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P90231 and previous config saved to /var/cache/conftool/dbconfig/20260402-134931-fceratto.json
[13:49:44] <Lucas_WMDE>	 ^ there’s some chance we’ll be able to deploy; otherwise I’ll undo that CR+2 (cc edsanders)
[13:50:16] <edsanders>	 I'm here
[13:50:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267056 (https://phabricator.wikimedia.org/T421204) (owner: 10Kosta Harlan)
[13:50:41] <edsanders>	 are we ready to deploy?
[13:50:54] <Lucas_WMDE>	 I just got the go-ahead in the security channel, so i think yes
[13:50:55] <wikibugs>	 (03Merged) 10jenkins-bot: Fix suggestion mode availability check [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143) (owner: 10Esanders)
[13:50:57] <cdanis>	 ye
[13:51:02] * Lucas_WMDE spiders the pig
[13:51:15] <Lucas_WMDE>	 oh, that gate-and-submit was a lot faster than I expected
[13:51:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]]
[13:51:28] <stashbot>	 T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
[13:51:40] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: add Cache-Control for Gitiles with mod_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1266238 (https://phabricator.wikimedia.org/T409422)
[13:51:40] <edsanders>	 Lucas_WMDE: thanks
[13:52:53] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add analytics-fr-tech system user and corresponding groups [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis)
[13:53:09] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmne
[13:53:10] <logmsgbot>	 t/restricted/mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_
[13:53:10] <logmsgbot>	 dir=/srv/mediawiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 44s)
[13:53:28] * Lucas_WMDE looks
[13:54:06] <Lucas_WMDE>	 I think the sudo docker-pusher falied with “blob upload unknown”?
[13:54:09] <Lucas_WMDE>	 let me try again…
[13:54:47] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]]
[13:55:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmne
[13:55:45] <logmsgbot>	 t/restricted/mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_
[13:55:45] <logmsgbot>	 dir=/srv/mediawiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 00m 58s)
[13:56:06] <Lucas_WMDE>	 :(
[13:56:25] <Lucas_WMDE>	 same error I think
[13:56:29] <Lucas_WMDE>	 “blob upload unknown”
[13:57:11] <wikibugs>	 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782509 (10cmooney) We are hopeful the situation should have improved after codfw was repooled, adding additional capacity.  Root cause of the circuit breaking is still being in...
[13:57:15] <edsanders>	 oh dear
[13:58:03] <Lucas_WMDE>	 jasmine_: as the codfw repooler (thanks again), any idea if this could be related?
[13:58:17] <wikibugs>	 (03CR) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski)
[13:58:19] <wikibugs>	 (03PS1) 10Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)
[13:58:26] <Lucas_WMDE>	 I’m imagining something like, scap now has to push the new mw image to codfw, but something on codfw might not be ready for it…
[13:58:29] <Lucas_WMDE>	 juts guessing though
[13:58:35] <edsanders>	 I'll try once more for luck
[13:58:48] <Lucas_WMDE>	 ok
[13:58:53] <logmsgbot>	 !log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]]
[13:58:56] <stashbot>	 T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
[13:58:58] <Lucas_WMDE>	 I didn’t realize you can deploy, I should’ve asked ^^
[13:59:00] <Lucas_WMDE>	 sorry
[13:59:17] <jasmine_>	 lucas_wmde: looking 
[13:59:20] <Lucas_WMDE>	 thx
[13:59:40] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T419635)', diff saved to https://phabricator.wikimedia.org/P90232 and previous config saved to /var/cache/conftool/dbconfig/20260402-135939-fceratto.json
[13:59:43] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[13:59:56] <hashar>	 jouncebot: nowandnext
[13:59:56] <jouncebot>	 For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1300)
[13:59:56] <jouncebot>	 In 0 hour(s) and 0 minute(s): DC Switchover: Day 8 - Codfw Repool (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1400)
[13:59:57] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2205.codfw.wmnet with reason: Maintenance
[14:00:04] <jouncebot>	 jasmine_: May I have your attention please! DC Switchover: Day 8 - Codfw Repool. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1400)
[14:00:05] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2205 (T419635)', diff saved to https://phabricator.wikimedia.org/P90233 and previous config saved to /var/cache/conftool/dbconfig/20260402-140004-fceratto.json
[14:00:08] <logmsgbot>	 !log esanders@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/
[14:00:08] <logmsgbot>	 mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/medi
[14:00:08] <logmsgbot>	 awiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 15s)
[14:00:48] <hashar>	 jasmine_: I need to reload the CI Jenkins
[14:01:05] <hashar>	 it does not take long, I don't think it affects the switchover
[14:03:07] <hashar>	 !log Jenkins CI: reloading configuration from disk to poll new nodes # T421114
[14:03:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:11] <Lucas_WMDE>	 hashar: FYI, codfw was already repooled to respond to the incident (but I’m not sure how complete it is)
[14:03:12] <stashbot>	 T421114: Rebuild all Jenkins agents VM to Bookworm to support Java 21 - https://phabricator.wikimedia.org/T421114
[14:03:17] <hashar>	 done
[14:03:27] <hashar>	 Lucas_WMDE: ah cool, thank you!
[14:03:48] <wikibugs>	 (03PS2) 10Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)
[14:03:48] <Lucas_WMDE>	 (we’re also still trying to deploy an UBN fix backport, but running into issues in scap)
[14:04:16] <wikibugs>	 (03CR) 10Elukey: [WIP] Move linting to Ruff and apply code fixes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[14:05:34] <wikibugs>	 (03CR) 10Elukey: "First pass! I have intentionally removed a lot of problems allowing exceptions for tests etc.., I think it would be impossible (and probab" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[14:05:48] <wikibugs>	 (03CR) 10Ottomata: stream: mw-page-html-content-change-enrich (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton)
[14:06:07] <jasmine_>	 hashar: yes we repooled a little bit earlier than scheduled, codfw is back up now 
[14:07:25] <hashar>	 jasmine_: thank you and congratulations
[14:08:22] <hnowlan>	 could/should we make the config reload a part of a repool/depool? 
[14:09:00] <wikibugs>	 (03PS3) 10Bking: opensearch: handle IP changes for software firewall [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714)
[14:09:05] <wikibugs>	 (03PS2) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216)
[14:09:07] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch: handle IP changes for software firewall (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking)
[14:09:11] <wikibugs>	 (03CR) 10Bking: [V:03+2 C:03+2] opensearch: handle IP changes for software firewall [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking)
[14:09:16] <logmsgbot>	 !log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]]
[14:09:18] <hashar>	 hnowlan: the Jenkins reload? Nope it is unrelated, I had to do it for some unrelated configuration changes I have made on Jenkins
[14:09:19] <stashbot>	 T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
[14:09:23] <Lucas_WMDE>	 I confess I’m a bit torn between “revert the backport so the deployment is in a known state” and “leave it to be rolled out with the next deploy because it’s small and we really want it deployed”
[14:09:26] <hnowlan>	 hashar: ah okay
[14:10:01] <hashar>	 hnowlan: and whenever I act on Jenkins/Zuul I try to remember to check the deployment calendar to ensure that is not going to break some ongoing deployment :]
[14:10:24] <wikibugs>	 (03PS3) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216)
[14:10:26] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 11 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[14:10:32] <logmsgbot>	 !log esanders@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/
[14:10:32] <logmsgbot>	 mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/medi
[14:10:32] <logmsgbot>	 awiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 16s)
[14:10:48] <Lucas_WMDE>	 still the same error
[14:11:17] <wikibugs>	 (03CR) 10JavierMonton: stream: mw-page-html-content-change-enrich (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton)
[14:11:45] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782589 (10HShaikh) I approve these re...
[14:11:47] <wikibugs>	 (03PS3) 10Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)
[14:12:31] <wikibugs>	 (03CR) 10Elukey: [WIP] Move linting to Ruff and apply code fixes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[14:13:23] <wikibugs>	 (03CR) 10Ottomata: "It is quite annoying that 'staging' AKA -next in dse-k8s is a different helmfile. It makes it hard to share common settings between 'stagi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton)
[14:13:44] <wikibugs>	 06SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166 (10Lucas_Werkmeister_WMDE) 03NEW
[14:13:47] <Lucas_WMDE>	 I filed T422166 for the deploy blocker (cc edsanders), not sure how it should be tagged
[14:13:48] <stashbot>	 T422166: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166
[14:14:06] <wikibugs>	 06SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11782617 (10Lucas_Werkmeister_WMDE) p:05Triage→03Unbreak!
[14:14:11] <Lucas_WMDE>	 cc jasmine_ ^ if you’re still looking into it
[14:14:18] <jasmine_>	 Lucas_WMDE:  looking now if perhaps it's swift related see
[14:14:18] <jasmine_>	 [0] - https://wikitech.wikimedia.org/wiki/Docker-registry/Runbook
[14:14:55] <wikibugs>	 (03PS1) 10Ladsgroup: Bump maxConnCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267062
[14:15:28] <wikibugs>	 (03CR) 10CDanis: [C:03+1] Bump maxConnCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267062 (owner: 10Ladsgroup)
[14:16:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267062 (owner: 10Ladsgroup)
[14:16:50] <wikibugs>	 06SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11782637 (10Lucas_Werkmeister_WMDE) Timeline note: this comes hot on the tail of T422130, for which @jasmine_ repooled codfw slightly earlier than [scheduled](https://wikitech.wikimedia.org/w/index.php?title=Deployments&old...
[14:16:54] <Lucas_WMDE>	 Amir1: good luck with that deploy
[14:16:59] <wikibugs>	 (03Merged) 10jenkins-bot: Bump maxConnCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267062 (owner: 10Ladsgroup)
[14:17:13] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1267062|Bump maxConnCount]]
[14:17:15] <Lucas_WMDE>	 (I expect you’ll run into T422166)
[14:17:23] <Amir1>	 Lucas_WMDE: that hopefully should prevent it from happening?
[14:17:46] <Amir1>	 oh that's a different issue
[14:17:48] <Amir1>	 yay
[14:17:48] <Lucas_WMDE>	 yeah
[14:18:25] <logmsgbot>	 !log ladsgroup@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted
[14:18:25] <logmsgbot>	 /mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/med
[14:18:25] <logmsgbot>	 iawiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 11s)
[14:18:28] <Lucas_WMDE>	 yup :(
[14:19:23] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782654 (10BTullis)
[14:19:39] <wikibugs>	 (03CR) 10Btullis: [C:03+2] "Manager approval received." [puppet] - 10https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: 10Btullis)
[14:23:17] <wikibugs>	 (03PS4) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216)
[14:23:24] <wikibugs>	 (03CR) 10JavierMonton: stream: mw-page-html-content-change-enrich (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton)
[14:23:36] <Lucas_WMDE>	 (further investigation happening in -sre FTR)
[14:24:35] <wikibugs>	 (03CR) 10CDanis: [C:03+1] Add Mayotte to geo-maps - prefer drmrs [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi)
[14:27:10] <wikibugs>	 06SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11782695 (10Scott_French) dockerd logs on deploy1003 for the above example:  ` Apr 02 14:09:17 deploy1003 dockerd[1070]: time="2026-04-02T14:09:17.561327804Z" level=info msg="ignoring event" container=c8f32695fd426caa327d6d...
[14:28:22] <wikibugs>	 (03CR) 10Volans: [C:03+2] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans)
[14:28:30] <moritzm>	 !log installing pyasn1 security updates
[14:28:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:42] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - 10https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans)
[14:30:05] <jouncebot>	 jasmine_: Time to snap out of that daydream and deploy DC Switchover: Day 8 - Codfw Repool. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1400).
[14:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1430)
[14:33:14] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782729 (10BTullis) I have now modified the `airflow-platfor...
[14:34:53] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T419635)', diff saved to https://phabricator.wikimedia.org/P90236 and previous config saved to /var/cache/conftool/dbconfig/20260402-143452-fceratto.json
[14:34:56] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[14:36:28] <wikibugs>	 10SRE-tools, 10Cumin, 06Infrastructure-Foundations: Add proxy support to cumin openstack backend - https://phabricator.wikimedia.org/T420360#11782751 (10Volans) 05Open→03Resolved The cloudcumin hosts are now using the webproxies to connect to the openstack APIs and the firewall rule has been reverted...
[14:37:31] <wikibugs>	 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782760 (10MoritzMuehlenhoff) p:05Unbreak!→03Medium The immediate impact has been mitigated, reducing priority, the task might still be used to collect followups.
[14:41:11] <Lucas_WMDE>	 huge spike of PHP warnings from ExperimentManager all of a sudden
[14:41:11] <wikibugs>	 (03PS1) 10Eevans: cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112)
[14:41:19] <Lucas_WMDE>	 (logspam-watch)
[14:42:09] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11782776 (10MoritzMuehlenhoff) What kind of access is needed? root access or simply shell access?  We have exist...
[14:42:17] <moritzm>	 !log installing libxml-parser-perl security updates
[14:42:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:33] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782789 (10BTullis) You should also now be able to start con...
[14:45:01] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P90237 and previous config saved to /var/cache/conftool/dbconfig/20260402-144500-fceratto.json
[14:46:38] <wikibugs>	 (03CR) 10Eevans: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[14:47:27] <wikibugs>	 (03CR) 10Elukey: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski)
[14:48:34] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Final review - this is currently a ok-ish use case since we already run the same config in prod. We agreed to open a task and follow up on" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski)
[14:49:26] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[14:50:08] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[14:50:17] <Lucas_WMDE>	 edsanders: are you still around and available to test your backport? (see -sre)
[14:50:45] <wikibugs>	 (03CR) 10Eevans: [V:03+2 C:03+2] cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[14:51:13] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
[14:51:41] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
[14:52:01] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782824 (10BTullis) 4 Kerberos principals created and welcom...
[14:52:25] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
[14:52:40] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
[14:53:40] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
[14:53:54] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
[14:54:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11782828 (10Jgreen) 05Open→03Resolved hosts are up and running
[14:55:09] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P90239 and previous config saved to /var/cache/conftool/dbconfig/20260402-145508-fceratto.json
[14:55:12] <Lucas_WMDE>	 (the ExperimentManager warning spike seems to have abated again fwiw)
[14:56:38] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Manual sync-world to pick up 1267062, 1266985 - T422143
[14:56:41] <stashbot>	 T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
[14:56:44] <logmsgbot>	 !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mr1-eqiad,mr1-eqiad IPv6 with reason: switching from OSFP to BGP
[14:56:46] <Lucas_WMDE>	 \o/
[14:57:44] <logmsgbot>	 !log swfrench@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/
[14:57:44] <logmsgbot>	 mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/medi
[14:57:44] <logmsgbot>	 awiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 06s)
[14:58:20] <wikibugs>	 (03CR) 10Ssingh: "I am guessing this is based on probenet data? (not that everything else in the repo currently is but I am mostly curious)" [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi)
[14:59:32] <papaul>	 !log ongoing maintenance on mr1-eqiad
[14:59:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:40] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Manual full-rebuild sync-world to pick up 1267062, 1266985 - T422143
[15:00:04] <jouncebot>	 jnuche and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1500)
[15:00:38] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton)
[15:02:15] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] buildkitd: Bump buildkit image to wmf-v0.29.0 [puppet] - 10https://gerrit.wikimedia.org/r/1266395 (https://phabricator.wikimedia.org/T415284) (owner: 10Ahmon Dancy)
[15:02:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Preseed notes often use globbing where applicable, but with our ongoing migration of all servers to UEFI for hardware there will be a lot " [puppet] - 10https://gerrit.wikimedia.org/r/1260727 (https://phabricator.wikimedia.org/T418929) (owner: 10Herron)
[15:03:03] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
[15:03:45] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
[15:04:20] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:04:20] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:05:17] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T419635)', diff saved to https://phabricator.wikimedia.org/P90241 and previous config saved to /var/cache/conftool/dbconfig/20260402-150517-fceratto.json
[15:05:20] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[15:05:34] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2227.codfw.wmnet with reason: Maintenance
[15:05:47] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T419635)', diff saved to https://phabricator.wikimedia.org/P90242 and previous config saved to /var/cache/conftool/dbconfig/20260402-150542-fceratto.json
[15:05:49] <wikibugs>	 (03PS1) 10Papaul: Remove OSFP from mr1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1267081 (https://phabricator.wikimedia.org/T421238)
[15:06:35] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:07:05] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:07:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephmon2007-dev - https://phabricator.wikimedia.org/T416396#11782910 (10Jhancock.wm)
[15:08:45] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Remove OSFP from mr1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1267081 (https://phabricator.wikimedia.org/T421238) (owner: 10Papaul)
[15:09:29] <wikibugs>	 (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton)
[15:11:23] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton)
[15:11:40] <moritzm>	 !log installing apache2 security updates
[15:11:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:20] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:12:20] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:12:45] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[15:12:59] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[15:16:17] <wikibugs>	 (03PS1) 10Papaul: Add back "replace osfp" to be able to remove it [homer/public] - 10https://gerrit.wikimedia.org/r/1267085 (https://phabricator.wikimedia.org/T421238)
[15:20:29] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Add back "replace osfp" to be able to remove it [homer/public] - 10https://gerrit.wikimedia.org/r/1267085 (https://phabricator.wikimedia.org/T421238) (owner: 10Papaul)
[15:22:31] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[15:23:08] <papaul>	 !log maintenance complete on mr1-eqiad
[15:23:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:22] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[15:26:12] <swfrench-wmf>	 !log restarted docker-registry-restricted.service on registry200[45] - T422166
[15:26:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:14] <stashbot>	 T422166: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166
[15:26:28] <logmsgbot>	 !log swfrench@deploy1003 sync-world aborted: Manual full-rebuild sync-world to pick up 1267062, 1266985 - T422143 (duration: 26m 48s)
[15:26:31] <stashbot>	 T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
[15:27:38] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[15:27:46] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[15:31:16] <swfrench-wmf>	 !log restarted docker-registry-ml.service on registry200[45] - T422166
[15:31:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:19] <stashbot>	 T422166: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166
[15:32:34] <moritzm>	 !log installing freetype security updates
[15:32:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:59] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gerrit: adjust idleTimeout on Jetty [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb)
[15:33:00] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Manual full-rebuild sync-world to pick up 1267062, 1266985 (attempt 2) - T422143
[15:33:02] <stashbot>	 T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
[15:34:43] <wikibugs>	 (03PS4) 10Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)
[15:35:06] <wikibugs>	 (03CR) 10Elukey: [WIP] Move linting to Ruff and apply code fixes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[15:38:44] <wikibugs>	 (03CR) 10Elukey: "Local, venvs created (so not the first run):" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[15:39:18] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T419635)', diff saved to https://phabricator.wikimedia.org/P90244 and previous config saved to /var/cache/conftool/dbconfig/20260402-153918-fceratto.json
[15:39:22] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[15:41:49] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1256301/8370/" [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten)
[15:41:50] <wikibugs>	 (03PS5) 10Elukey: Move linting to Ruff and apply code fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)
[15:44:23] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[15:45:37] <wikibugs>	 (03PS14) 10Herron: site: opt-in insetup defaults by hostname prefix [puppet] - 10https://gerrit.wikimedia.org/r/1260727 (https://phabricator.wikimedia.org/T418929)
[15:46:55] <wikibugs>	 (03CR) 10A smart kitten: "FWIW that [phab1004 NOOP result](https://puppet-compiler.wmflabs.org/output/1256301/8370/phab1004.eqiad.wmnet/index.html) seems wrong - it" [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten)
[15:46:59] <wikibugs>	 (03CR) 10A smart kitten: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten)
[15:48:31] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards Has improved   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[15:48:57] <wikibugs>	 (03CR) 10A smart kitten: "(FWIW @dzahn@wikimedia.org, feel free to shoot me a message in IRC if you want to sync-up e.g. if/when deploying/testing this patch. I'm n" [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten)
[15:49:08] <wikibugs>	 (03CR) 10Herron: [C:03+2] "thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1260727 (https://phabricator.wikimedia.org/T418929) (owner: 10Herron)
[15:49:22] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[15:49:26] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P90245 and previous config saved to /var/cache/conftool/dbconfig/20260402-154925-fceratto.json
[15:50:05] <logmsgbot>	 !log swfrench@deploy1003 swfrench: Manual full-rebuild sync-world to pick up 1267062, 1266985 (attempt 2) - T422143 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:50:09] <stashbot>	 T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
[15:50:10] <wikibugs>	 (03CR) 10A smart kitten: "(if I'm around in IRC at the time you'll be deploying this, that is; otherwise feel free to just deploy it if/when is good for you :) )" [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten)
[15:51:13] <logmsgbot>	 !log swfrench@deploy1003 swfrench: Continuing with sync
[15:55:31] <wikibugs>	 (03PS3) 10Btullis: Remove the site.pp definitions for decommissioned an-worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1242516 (https://phabricator.wikimedia.org/T414948)
[15:55:47] <wikibugs>	 (03PS1) 10Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)
[15:56:13] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+1] "it's because puppet DB queries were introduced somewhere (not by your patch) which often breaks compiler runs (Failed to execute '/pdb/que" [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten)
[15:59:23] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[15:59:35] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P90246 and previous config saved to /var/cache/conftool/dbconfig/20260402-155934-fceratto.json
[16:00:05] <jouncebot>	 jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1600). Please do the needful.
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:34] <Lucas_WMDE>	 we’re so close to finishing the backport+config window lol
[16:00:49] <Lucas_WMDE>	 (with 1/4 patches deployed)
[16:01:31] <wikibugs>	 (03PS2) 10Herron: preseed: use efi for new kafka-logging hosts [puppet] - 10https://gerrit.wikimedia.org/r/1267102 (https://phabricator.wikimedia.org/T418929)
[16:01:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] preseed: use efi for new kafka-logging hosts [puppet] - 10https://gerrit.wikimedia.org/r/1267102 (https://phabricator.wikimedia.org/T418929) (owner: 10Herron)
[16:01:38] <wikibugs>	 (03PS2) 10Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)
[16:02:56] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: Manual full-rebuild sync-world to pick up 1267062, 1266985 (attempt 2) - T422143 (duration: 29m 56s)
[16:02:59] <stashbot>	 T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
[16:02:59] <swfrench-wmf>	 \i/
[16:03:04] <Lucas_WMDE>	 \o/ \o/ \o/
[16:03:40] <Lucas_WMDE>	 !log UTC afternoon backport+config window (very belatedly) done ^^
[16:03:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:50] <Lucas_WMDE>	 thanks for figuring it out and deploying!
[16:04:08] <Lucas_WMDE>	 Amir1: your maxConnCount bump got deployed now btw ^
[16:04:15] <Amir1>	 thanks!
[16:04:22] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[16:05:09] <wikibugs>	 06SRE, 07Datacenter-Switchover: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11783170 (10Scott_French) p:05Unbreak!→03Medium This was a curious one. Many thanks to @elukey and @CDanis for the assistance.  tl;dr - Cached connections in the (restricted) docker registry's...
[16:05:26] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11783179 (10Ahoelzl) I approve the addition of the listed WME...
[16:05:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:09:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:09:23] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[16:09:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T419635)', diff saved to https://phabricator.wikimedia.org/P90247 and previous config saved to /var/cache/conftool/dbconfig/20260402-160942-fceratto.json
[16:09:46] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[16:09:59] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2239.codfw.wmnet with reason: Maintenance
[16:10:44] <wikibugs>	 (03Abandoned) 10Federico Ceratto: wmnet: update CNAME records for DB masters to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1255655 (https://phabricator.wikimedia.org/T416705) (owner: 10Federico Ceratto)
[16:11:45] <wikibugs>	 (03PS3) 10Herron: preseed: use efi for new kafka-logging hosts [puppet] - 10https://gerrit.wikimedia.org/r/1267102 (https://phabricator.wikimedia.org/T418929)
[16:12:31] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[16:12:43] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[16:12:55] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[16:13:01] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[16:14:07] <wikibugs>	 (03CR) 10Herron: [C:03+2] "ok! lets give this a try" [alerts] - 10https://gerrit.wikimedia.org/r/1262175 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron)
[16:14:23] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[16:15:28] <wikibugs>	 (03Merged) 10jenkins-bot: burrow: update expressions to handle multiple instances [alerts] - 10https://gerrit.wikimedia.org/r/1262175 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron)
[16:15:28] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] phabricator: Set a custom default-mail-address for the test instance [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten)
[16:15:53] <swfrench-wmf>	 jouncebot: nowandnext
[16:15:53] <jouncebot>	 For the next 0 hour(s) and 44 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1600)
[16:15:53] <jouncebot>	 In 0 hour(s) and 44 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700)
[16:15:53] <jouncebot>	 In 0 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700)
[16:16:55] <wikibugs>	 (03CR) 10Herron: [C:03+2] "thanks all!" [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron)
[16:18:02] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "deployed. confirmed it is a NOOP / no error on production host." [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten)
[16:18:31] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1198576 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[16:19:10] <wikibugs>	 (03CR) 10Scott French: [C:03+2] deployment_server: absent image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198576 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[16:23:33] <wikibugs>	 (03Restored) 10Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana)
[16:24:14] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:24:35] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11783241 (10BTullis) 05Open→03Resolved p:05Triage→...
[16:25:39] <wikibugs>	 (03PS6) 10Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366)
[16:25:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana)
[16:26:51] <wikibugs>	 (03Abandoned) 10Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana)
[16:29:13] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:31:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: 10Anne Tomasevich)
[16:32:25] <wikibugs>	 (03PS1) 10Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267116 (https://phabricator.wikimedia.org/T421366)
[16:33:19] <wikibugs>	 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 4 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#11783346 (10Ladsgroup) I was looking into this a bit yesterday (more general...
[16:34:13] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:34:48] <wikibugs>	 (03CR) 10Btullis: data-platform: Add alerts for opensearch on k8s certificate expiry (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking)
[16:37:32] <wikibugs>	 06SRE, 06Traffic: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11783388 (10Alberto) Thank you very much for your help!  I have correctly implemented the User-Agent in my LocalSettings.php for both MediaWiki core and the QuickInstantCommons...
[16:39:14] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:39:22] <wikibugs>	 (03CR) 10Scott French: [C:03+2] deployment_server: remove absented image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[16:39:46] <wikibugs>	 (03PS4) 10Scott French: deployment_server: remove absented image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096)
[16:40:30] <wikibugs>	 (03PS1) 10Daniel Kinzler: Revert "rest gateway: introduce policy for abstractwiki/wikifunctions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267119
[16:41:02] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] "revert undeployed change" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267119 (owner: 10Daniel Kinzler)
[16:43:22] <wikibugs>	 (03CR) 10Scott French: [C:03+2] deployment_server: remove absented image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[16:44:00] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "rest gateway: introduce policy for abstractwiki/wikifunctions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267119 (owner: 10Daniel Kinzler)
[16:45:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11783408 (10Jclark-ctr) a:05herron→03Jclark-ctr
[16:45:58] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581)
[16:47:02] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189 (10prabhat) 03NEW
[16:47:35] <wikibugs>	 (03PS1) 10Herron: kafkamon: update burrow ports [puppet] - 10https://gerrit.wikimedia.org/r/1267121 (https://phabricator.wikimedia.org/T418858)
[16:47:47] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783451 (10prabhat)
[16:49:51] <wikibugs>	 (03CR) 10Scott French: "Thank you both for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[16:50:07] <wikibugs>	 (03CR) 10Scott French: [C:03+2] image-suggestion: remove service configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[16:52:26] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783519 (10ssingh) request and key confirmed out of band.
[16:53:23] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs3009.esams.wmnet} and A:liberica
[16:54:23] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[16:57:02] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs3009.esams.wmnet} and A:liberica
[16:58:15] <wikibugs>	 (03Merged) 10jenkins-bot: image-suggestion: remove service configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[16:59:30] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs3008.esams.wmnet} and A:liberica
[17:00:05] <jouncebot>	 bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700)
[17:00:07] <swfrench-wmf>	 o/
[17:00:25] <swfrench-wmf>	 I'll be deploying some admin_ng changes shortly
[17:02:25] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[17:03:03] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs3008.esams.wmnet} and A:liberica
[17:03:30] <wikibugs>	 (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216)
[17:04:46] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton)
[17:05:13] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[17:05:34] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[17:07:04] <wikibugs>	 (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton)
[17:07:04] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[17:08:15] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[17:08:37] * bd808 checks for things that need releasing
[17:09:06] <wikibugs>	 (03PS1) 10DCausse: search: add space-discount for wikidata custom prefix search profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267130 (https://phabricator.wikimedia.org/T420427)
[17:09:09] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216) (owner: 10JavierMonton)
[17:09:17] <bd808>	 nothing for my window this week</window>
[17:09:39] <wikibugs>	 (03PS4) 10Dzahn: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109)
[17:10:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn)
[17:10:34] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[17:10:37] <wikibugs>	 (03CR) 10Dzahn: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn)
[17:10:48] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[17:11:08] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[17:11:20] <wikibugs>	 (03PS5) 10Dzahn: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109)
[17:11:31] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[17:11:50] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[17:12:02] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[17:12:12] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[17:12:40] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[17:13:56] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[17:14:49] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [dns] - 10https://gerrit.wikimedia.org/r/1198584 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[17:15:32] <wikibugs>	 (03CR) 10Scott French: [C:03+2] wmnet: remove image-suggestion k8s ingress CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1198584 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[17:15:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn)
[17:16:11] <logmsgbot>	 !log swfrench@dns1004 START - running authdns-update
[17:18:08] <logmsgbot>	 !log swfrench@dns1004 END - running authdns-update
[17:20:27] <wikibugs>	 (03PS4) 10Scott French: service: remove image-suggestion [puppet] - 10https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096)
[17:26:28] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783746 (10prabhat)
[17:27:48] <swfrench-wmf>	 alright, I believe I'm done with my side of this window
[17:28:10] <wikibugs>	 (03PS1) 10Eevans: cassandra-dev: add ferm srange for k8s staging [puppet] - 10https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444)
[17:28:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cassandra-dev: add ferm srange for k8s staging [puppet] - 10https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444) (owner: 10Eevans)
[17:29:04] <wikibugs>	 (03PS1) 10Snwachukwu: Add rest gateway routes for video_plays path. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202)
[17:31:23] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] phabricator: Set a custom default-mail-address for the test instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten)
[17:31:54] <wikibugs>	 (03CR) 10Mforns: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu)
[17:32:10] <wikibugs>	 (03PS2) 10Eevans: cassandra-dev: add ferm srange for k8s staging [puppet] - 10https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444)
[17:35:42] <wikibugs>	 (03PS1) 10Scott French: fixtures: clean up reference to image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096)
[17:36:02] <wikibugs>	 (03CR) 10Snwachukwu: [C:03+2] Add rest gateway routes for video_plays path. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu)
[17:36:07] <wikibugs>	 (03PS3) 10Eevans: cassandra-dev: add ferm srange for k8s staging [puppet] - 10https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444)
[17:36:12] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444) (owner: 10Eevans)
[17:36:51] <wikibugs>	 (03PS1) 10Ssingh: admin: update SSH key for ptiwary [puppet] - 10https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189)
[17:36:54] <wikibugs>	 (03CR) 10Snwachukwu: [C:03+2] "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu)
[17:37:00] <wikibugs>	 (03CR) 10Snwachukwu: [V:03+2 C:03+2] Add rest gateway routes for video_plays path. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu)
[17:39:23] <wikibugs>	 (03PS3) 10Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)
[17:39:32] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra-dev: add ferm srange for k8s staging [puppet] - 10https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444) (owner: 10Eevans)
[17:39:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11783799 (10Jgreen) 05Open→03Resolved boxes are imaged, in replication, and ready for traffic once pfw policy is done
[17:40:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking)
[17:42:20] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] Add rest gateway routes for video_plays path. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu)
[17:42:35] <wikibugs>	 (03CR) 10Ssingh: "Request verified out of band, please feel free to do an additional check." [puppet] - 10https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: 10Ssingh)
[17:44:20] <wikibugs>	 (03CR) 10Ayounsi: "That's a follow up from an email that was sent to noc@ from a local ISP." [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi)
[17:44:27] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783815 (10HShaikh) As prabhat's manager I approve this request.
[17:45:50] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Ah I see it now -- my bad. Thanks." [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi)
[17:46:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:47:50] <wikibugs>	 (03PS1) 10Snwachukwu: Add rest gateway routes for video_plays path production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267147 (https://phabricator.wikimedia.org/T415202)
[17:49:08] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "I can see in compiler how this changes things on new instance "integration-agent-docker-1070" just created on https://phabricator.wikimedi" [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: 10Hashar)
[17:50:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11783859 (10Jgreen)
[17:54:07] <wikibugs>	 06SRE, 10DNS, 06Infrastructure-Foundations, 10netbox, and 2 others: Missing includes in DNS repo from Netbox-generated snippets - https://phabricator.wikimedia.org/T422115#11783873 (10ssingh) Thanks for fixing it but I agree that we need an alert for this otherwise we will miss this again.
[17:55:40] <wikibugs>	 (03CR) 10Snwachukwu: [C:03+2] Add rest gateway routes for video_plays path production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267147 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu)
[17:56:20] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "noop confirmed on contint prod hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: 10Hashar)
[17:57:43] <wikibugs>	 (03Merged) 10jenkins-bot: Add rest gateway routes for video_plays path production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267147 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu)
[17:58:30] <wikibugs>	 (03PS4) 10Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)
[17:59:52] <logmsgbot>	 !log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[18:00:10] <logmsgbot>	 !log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[18:00:29] <logmsgbot>	 !log ebysans@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[18:00:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking)
[18:00:48] <logmsgbot>	 !log ebysans@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[18:01:24] <wikibugs>	 (03CR) 10Jasmine: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[18:01:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[18:04:40] <wikibugs>	 (03PS5) 10Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)
[18:05:15] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] fixtures: clean up reference to image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[18:06:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking)
[18:07:21] <wikibugs>	 (03CR) 10Muehlenhoff: "One validation is fine, you can either go ahead and merge it or I'll take care of it via Clinic duty, either is fine." [puppet] - 10https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: 10Ssingh)
[18:07:35] <wikibugs>	 (03PS6) 10Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)
[18:14:19] <wikibugs>	 (03CR) 10Bking: data-platform: Add alerts for opensearch on k8s certificate expiry (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking)
[18:16:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11783930 (10Jclark-ctr) a:05Jgreen→03Jclark-ctr
[18:24:15] <wikibugs>	 (03CR) 10SBassett: [C:03+2] "Oh, whoops, I see the commit msg says "miscweb(research-landing-page): bump image version".  Just to be clear, this change set is for" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174750 (https://phabricator.wikimedia.org/T399132) (owner: 10Jly)
[18:24:47] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs5006.eqsin.wmnet} and A:liberica
[18:25:57] <jinxer-wm>	 FIRING: ProbeDown: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:28:03] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs5006.eqsin.wmnet} and A:liberica
[18:28:50] <wikibugs>	 (03PS3) 10SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: 10WikiBayer)
[18:29:53] <sukhe>	 port 80!?
[18:30:57] <topranks>	 yeah I'm not sure why it's firing... sort of seems ok?
[18:30:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:31:19] <topranks>	 https://phabricator.wikimedia.org/P90248
[18:31:30] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[18:31:31] <wikibugs>	 (03CR) 10Scott French: [C:03+2] fixtures: clean up reference to image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[18:31:38] <sukhe>	 topranks: yeah it resolved. haven't looked very deeply on what happened but can't seem anything obvious
[18:31:56] <moritzm>	 same here
[18:31:56] <topranks>	 I gotta say the probe dashboard is absolutely incomprehensible to me, any time I have to visit it 
[18:32:09] <topranks>	 I don't see any signs of general connectivity issues 
[18:32:25] <moritzm>	 and ipv6 only?
[18:32:30] <sukhe>	 seems so yeah
[18:33:06] <topranks>	 yeah, tbh that is further evidence it is just an outlier failed connection, for whatever reason 
[18:33:08] <sukhe>	 topranks: yep. we should improve that. it defaults to "All"
[18:33:11] <wikibugs>	 (03Merged) 10jenkins-bot: fixtures: clean up reference to image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[18:33:16] <topranks>	 rather than a systemic problem like everyone is failing to connect 
[18:33:26] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784038 (10gmodena) >>! In T422141#11782776, @MoritzMuehlenhoff wrote: > What kind of access is needed? root ac...
[18:33:51] <moritzm>	 don't see any specific signs of user-visible impact from graphs
[18:34:21] <wikibugs>	 (03CR) 10SBassett: [C:03+1] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: 10WikiBayer)
[18:34:21] <wikibugs>	 (03CR) 10Ssingh: "Thanks, I will merge if I can find a reviewer otherwise feel free to take it later." [puppet] - 10https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: 10Ssingh)
[18:35:37] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784042 (10MoritzMuehlenhoff) >>! In T422141#11784038, @gmodena wrote: >>>! In T422141#11782776, @MoritzMuehlen...
[18:35:58] <wikibugs>	 (03CR) 10Reedy: [C:03+1] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: 10WikiBayer)
[18:37:05] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Two reviews by the sec team, merging." [puppet] - 10https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: 10WikiBayer)
[18:37:06] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: 10WikiBayer)
[18:37:12] <Reedy>	 haha
[18:37:13] <Reedy>	 consensus!
[18:37:39] <sukhe>	 Reedy: who am I to say no to two +1s?!
[18:38:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGMT syntax-wise" [puppet] - 10https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: 10Ssingh)
[18:39:25] <topranks>	 https://prometheus-eqiad.wikimedia.org/ops/graph?g0.expr=probe_success%7Baddress%3D%222620%3A0%3A861%3Aed1a%3A%3A1%22%2C%20instance%3D%22text%3A80%22%7D%5B20m%5D&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h
[18:39:33] <topranks>	 I really don't understand why that fired, but anyway 
[18:40:27] <sukhe>	 topranks: doesn't add up yep
[18:40:32] <sukhe>	 anyway nothing to do here I feel 
[18:40:49] <topranks>	 yep enough other stuff to worry about 
[18:40:58] <moritzm>	 yeah, this feels like a one time blip, and if it happens again, we can still correlat further
[18:41:21] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] admin: update SSH key for ptiwary [puppet] - 10https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: 10Ssingh)
[18:41:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11784073 (10Jclark-ctr) a:05Jclark-ctr→03BTullis
[18:41:52] <wikibugs>	 (03CR) 10Alex.sanford: [C:03+1] Allow-list some additional domains to the currently enforcing CSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: 10WikiBayer)
[18:42:19] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11784074 (10Jclark-ctr) a:05Jclark-ctr→03BTullis
[18:44:03] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784075 (10gmodena) >>! In T422141#11784042, @MoritzMuehlenhoff wrote: > We don't have a specific access group...
[18:44:32] <wikibugs>	 (03PS1) 10Ottomata: dse-k8s - add common dir for mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267152 (https://phabricator.wikimedia.org/T360794)
[18:45:52] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[18:46:50] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] dse-k8s - add common dir for mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267152 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata)
[18:49:09] <wikibugs>	 (03Merged) 10jenkins-bot: dse-k8s - add common dir for mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267152 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata)
[18:51:31] <logmsgbot>	 cmooney@cumin1003 netbox (PID 2341745) is awaiting input
[18:51:57] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for new lumen 100g transport - cmooney@cumin1003"
[18:51:58] <wikibugs>	 (03PS1) 10Reedy: Undeploy Extension:StopForumSpam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267157 (https://phabricator.wikimedia.org/T422185)
[18:52:24] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for new lumen 100g transport - cmooney@cumin1003"
[18:52:24] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:52:28] <wikibugs>	 (03PS1) 10Cathal Mooney: Add INCLUDE statement for 2620:0:861:fe03::/64 subnet [dns] - 10https://gerrit.wikimedia.org/r/1267158 (https://phabricator.wikimedia.org/T395878)
[18:53:17] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Add INCLUDE statement for 2620:0:861:fe03::/64 subnet [dns] - 10https://gerrit.wikimedia.org/r/1267158 (https://phabricator.wikimedia.org/T395878) (owner: 10Cathal Mooney)
[18:54:38] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add INCLUDE statement for 2620:0:861:fe03::/64 subnet [dns] - 10https://gerrit.wikimedia.org/r/1267158 (https://phabricator.wikimedia.org/T395878) (owner: 10Cathal Mooney)
[18:54:48] <wikibugs>	 (03PS1) 10Ottomata: dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794)
[18:54:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata)
[18:55:10] <logmsgbot>	 !log cmooney@dns2005 START - running authdns-update
[18:55:19] <wikibugs>	 (03PS2) 10Ottomata: dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794)
[18:56:09] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784108 (10AWesterinen-WMF) Retried ... no change
[18:56:34] <logmsgbot>	 !log cmooney@dns2005 END - running authdns-update
[18:56:53] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] Undeploy Extension:StopForumSpam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267157 (https://phabricator.wikimedia.org/T422185) (owner: 10Reedy)
[18:57:10] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata)
[18:59:14] <wikibugs>	 (03Merged) 10jenkins-bot: dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata)
[19:00:25] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn)
[19:01:19] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[19:01:23] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[19:02:03] <wikibugs>	 (03PS3) 10Elukey: opensearch-semantic-search-test: Add to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking)
[19:04:43] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[19:04:47] <wikibugs>	 (03CR) 10Scott French: [C:03+2] service: remove image-suggestion [puppet] - 10https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[19:06:31] <wikibugs>	 (03PS1) 10Cathal Mooney: Management routers: set autonomous system number [homer/public] - 10https://gerrit.wikimedia.org/r/1267170 (https://phabricator.wikimedia.org/T421238)
[19:09:11] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on releases2003.codfw.wmnet with reason: T418109
[19:09:14] <stashbot>	 T418109: Update Jenkins hosts from Java 17 to Java 21 - https://phabricator.wikimedia.org/T418109
[19:09:30] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784127 (10MoritzMuehlenhoff) You still need to request "wmf" at https://idm.wikimedia.org/permissions/, so far you only r...
[19:12:13] <wikibugs>	 (03PS1) 10Dzahn: jenkins: add profile::ci::docker to role [puppet] - 10https://gerrit.wikimedia.org/r/1267173 (https://phabricator.wikimedia.org/T418109)
[19:16:13] <wikibugs>	 06SRE, 07Datacenter-Switchover: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11784146 (10Scott_French)
[19:16:44] <wikibugs>	 (03PS1) 10Ottomata: mw-page-html-content-change-enrich - tune backfill in staging release (-next) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267175 (https://phabricator.wikimedia.org/T421216)
[19:19:50] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] mw-page-html-content-change-enrich - tune backfill in staging release (-next) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267175 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[19:21:50] <wikibugs>	 (03Merged) 10jenkins-bot: mw-page-html-content-change-enrich - tune backfill in staging release (-next) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267175 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[19:23:41] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784167 (10AWesterinen-WMF) I tried to do that, but see no option for wmf.  Only "logstash", "airflow" and "spiderpig".
[19:24:12] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[19:24:16] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[19:33:06] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11784179 (10ssingh) 05Open→03Resolved a:03ssingh Should now be rolled out everywhere, let us know if you have any issues with access.
[19:35:49] <wikibugs>	 (03PS1) 10Dduvall: zuul: Move cross-profile references to hiera [puppet] - 10https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207)
[19:35:51] <wikibugs>	 (03PS1) 10Dduvall: zuul: Fix nodepool zookeeper configuration [puppet] - 10https://gerrit.wikimedia.org/r/1267178 (https://phabricator.wikimedia.org/T422207)
[19:45:21] <wikibugs>	 (03PS2) 10Dduvall: zuul: Move cross-profile references to hiera [puppet] - 10https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207)
[19:45:21] <wikibugs>	 (03PS2) 10Dduvall: zuul: Fix nodepool zookeeper configuration [puppet] - 10https://gerrit.wikimedia.org/r/1267178 (https://phabricator.wikimedia.org/T422207)
[19:46:02] <wikibugs>	 (03CR) 10Dduvall: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207) (owner: 10Dduvall)
[19:48:46] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards Has improved   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[19:56:29] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[19:56:32] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[19:56:48] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[19:56:50] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[19:57:36] <nya_1F616EMO>	 Is anyone here waiting for the UTC late backport window? And are there any blockers to the window?
[19:57:46] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[19:57:48] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T2000)
[20:00:05] <jouncebot>	 nya_1F616EMO and bwang: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:15] <nya_1F616EMO>	 o/
[20:00:26] <bwang>	 Im here~!
[20:00:49] * nya_1F616EMO prays for a deployer to show up
[20:02:56] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[20:03:03] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[20:04:18] <wikibugs>	 (03PS4) 10Bking: opensearch-semantic-search-test: Add to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293)
[20:05:17] <wikibugs>	 (03CR) 10Bking: opensearch-semantic-search-test: Add to services proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking)
[20:05:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:07:12] <wikibugs>	 (03PS1) 10Ottomata: mw-page-html-content-change-enrich-next - use kafka jumbo external services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267189 (https://phabricator.wikimedia.org/T421216)
[20:07:58] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] mw-page-html-content-change-enrich-next - use kafka jumbo external services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267189 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[20:08:04] <wikibugs>	 (03CR) 10Ottomata: [V:03+2 C:03+2] mw-page-html-content-change-enrich-next - use kafka jumbo external services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267189 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[20:08:51] <nya_1F616EMO>	 It seems like we're out of luck?
[20:09:35] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[20:09:44] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
[20:12:35] <wikibugs>	 10ops-magru: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T419298#11784255 (10phaultfinder)
[20:13:27] <wikibugs>	 (03PS5) 10Bking: opensearch-cluster: Add support for service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260795 (https://phabricator.wikimedia.org/T421293)
[20:13:27] <wikibugs>	 (03CR) 10Bking: "Thanks for the course correction! I think we have a path forward here; we've added envoy TLS termination in 1248865 and monitoring for the" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260795 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking)
[20:13:43] <Kemayo>	 I'd offer to do it, but there was a big breakage of the ability to scap deploy things this morning, so it might be a good idea to have a real deployer present who could recover from an error if it happened.
[20:13:57] <wikibugs>	 (03Abandoned) 10Bking: opensearch-cluster: Add support for service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260795 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking)
[20:15:00] <nya_1F616EMO>	 One of my patch is a time-specific logo update for zhwikinews, and one is a non-time-specific SecurePoll deployment to a private wiki. I may propose to the local community to use CSS for the logo change; do you recommend doing so?
[20:17:01] <Kemayo>	 Feels inconvenient to deal with, given all the various logo sizes involved.
[20:17:17] <nya_1F616EMO>	 You mean to deploy? 
[20:17:38] <nya_1F616EMO>	 Currently working on the CSS solution
[20:17:45] <nya_1F616EMO>	 (cuz there are no deployment on Fridays we all know)
[20:17:52] <Kemayo>	 If you and bwang don't mind, I could certainly kick off a spiderpig build with all your patches. If it breaks in the same way as it did before, it'd just fail to deploy even to testservers rather than ruining production.
[20:18:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T421439#11784278 (10VRiley-WMF) This ticket seems like it relates to another ticket https://phabricator.wikimedia.org/T413559  @Eevans this server is out of warrenty, would you like us to replace this disk or leave it...
[20:18:34] <Kemayo>	 There's just a *chance* that it'll wedge us into a state where a releng person needs to look at things before any deploys can happen. 😅
[20:19:12] <nya_1F616EMO>	 I won't let go my SecurePoll patch anyways under this state, it'd be up to you on whether to accept that zhwikinews logo change.
[20:20:05] <Kemayo>	 I'm fine giving it a shot.
[20:20:10] <Kemayo>	 bwang: Want yours in as well?
[20:21:26] <nya_1F616EMO>	 Wait, I found something that might be off
[20:21:44] <nya_1F616EMO>	 Let me chekc my patch for resolutions
[20:22:04] <Kemayo>	 Just let me know when you're happy with it, and if bwang hasn't shown up by then I can do just-yours.
[20:22:13] <nya_1F616EMO>	 Ah nvm, the script did the job for me
[20:22:25] <nya_1F616EMO>	 It successfully reduced the resolution to 135x135, nice
[20:22:33] <nya_1F616EMO>	 so good to go
[20:22:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264569 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO)
[20:24:24] <wikibugs>	 (03CR) 10Bking: [C:03+2] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking)
[20:24:56] <wikibugs>	 (03CR) 10Bking: [C:03+2] "Ben is out for the next 10 days, so I'm going to be bold and merge after addressing his concerns." [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking)
[20:25:02] <wikibugs>	 (03CR) 10Bking: [V:03+2 C:03+2] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - 10https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: 10Bking)
[20:25:19] <wikibugs>	 (03Merged) 10jenkins-bot: zhwikinews: 20th anniversary logo change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264569 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO)
[20:25:37] <logmsgbot>	 !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1264569|zhwikinews: 20th anniversary logo change (T420165)]]
[20:25:40] <stashbot>	 T420165: Requesting temporary logo change for zh.wikinews.org - https://phabricator.wikimedia.org/T420165
[20:28:46] <bwang>	 Sorry I was in a call
[20:28:52] <bwang>	 Im still here and able to help test the backpoert
[20:29:16] <wikibugs>	 (03PS2) 10Clare Ming: Update the Test Kitchen maintenance script to target testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T422209)
[20:29:22] <logmsgbot>	 !log kemayo@deploy1003 1f616emo, kemayo: Backport for [[gerrit:1264569|zhwikinews: 20th anniversary logo change (T420165)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:29:40] <Kemayo>	 nya_1F616EMO: Can you verify your change?
[20:29:44] <nya_1F616EMO>	 testing
[20:30:44] <nya_1F616EMO>	 it works, tested on vector-2022, vector, monobook, timeless.
[20:31:03] <Kemayo>	 I will continue the deploy, then.
[20:31:06] <nya_1F616EMO>	 Thanks
[20:31:11] <logmsgbot>	 !log kemayo@deploy1003 1f616emo, kemayo: Continuing with sync
[20:33:09] <wikibugs>	 (03CR) 101F616EMO: "Will re-schedule in the Monday, May 04 UTC morning backport window, together with Ia1a463ba01452b76b73ff6b59b821eae9154ddf8." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: 101F616EMO)
[20:33:21] <wikibugs>	 (03PS1) 101F616EMO: Revert "zhwikinews: 20th anniversary logo change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165)
[20:33:35] <wikibugs>	 (03CR) 101F616EMO: "Will re-schedule in the Monday, May 04 UTC morning backport window, together with Iea2390c01600b5f93c7b01f5605d887541c74b02." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165) (owner: 101F616EMO)
[20:33:52] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784305 (10MoritzMuehlenhoff) >>! In T422141#11784075, @gmodena wrote: >>>! In T422141#11784042, @MoritzMuehlen...
[20:35:37] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784306 (10MoritzMuehlenhoff) >>! In T420053#11784167, @AWesterinen-WMF wrote: > I tried to do that, but see no option for...
[20:37:23] <logmsgbot>	 !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264569|zhwikinews: 20th anniversary logo change (T420165)]] (duration: 11m 46s)
[20:37:26] <stashbot>	 T420165: Requesting temporary logo change for zh.wikinews.org - https://phabricator.wikimedia.org/T420165
[20:37:34] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 182040496 and 19 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[20:38:32] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3815080 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[20:39:36] <Kemayo>	 nya_1F616EMO: Okay, should be live now.
[20:40:01] <nya_1F616EMO>	 Nice and verified the changes through prod.
[20:40:04] <nya_1F616EMO>	 Thank you for your help
[20:40:33] <wikibugs>	 (03CR) 10Cathal Mooney: "Do we have stats for RE?  Is it that much better to eqsin on average than drmrs?  From the geography it's not clear to me." [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi)
[20:43:58] <Kemayo>	 nya_1F616EMO: np!
[20:47:18] <bwang>	 Hi are we still able to back port my patch?
[20:47:55] <Kemayo>	 bwang: sure, I can get it if you're willing to stick around until it's done.
[20:48:11] <bwang>	 Yes of course
[20:48:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: 10Anne Tomasevich)
[20:51:01] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11784343 (10VRiley-WMF) Hey @elukey Thanks for working on this! Is there anything I can do from my end to assist with this? Let us know...
[20:51:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11784345 (10VRiley-WMF)
[20:51:52] <wikibugs>	 (03Merged) 10jenkins-bot: Add logged-in reader retention instrument [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: 10Anne Tomasevich)
[20:52:10] <logmsgbot>	 !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1267008|Add logged-in reader retention instrument (T420490)]]
[20:52:13] <stashbot>	 T420490: [Logged in reader retention baseline] Launch A/A experiment - https://phabricator.wikimedia.org/T420490
[20:52:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11784348 (10VRiley-WMF)
[20:53:51] <logmsgbot>	 !log kemayo@deploy1003 annet, kemayo: Backport for [[gerrit:1267008|Add logged-in reader retention instrument (T420490)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:54:13] <Kemayo>	 bwang: let me know when it's tested
[20:56:36] <bwang>	 checking now
[20:57:02] <wikibugs>	 (03PS1) 10DLynch: SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267204
[20:57:19] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267204 (owner: 10DLynch)
[20:58:54] <bwang>	 Looks good
[20:59:09] <Kemayo>	 Continuing, then.
[20:59:12] <logmsgbot>	 !log kemayo@deploy1003 annet, kemayo: Continuing with sync
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T2100)
[21:01:02] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784373 (10AWesterinen-WMF) Updated my email and requested wmf access.   But, I have a further problem.  I tried to ssh in...
[21:01:16] <Jdlrobson>	 Kemayo: let me know when you are done. I have a deploy but I need 15m to prep 
[21:01:46] <Kemayo>	 Jdlrobson: Sure, I just have one more patch to get out after this, so that should fit into your timing pretty okay.
[21:03:50] <logmsgbot>	 !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267008|Add logged-in reader retention instrument (T420490)]] (duration: 11m 40s)
[21:03:54] <stashbot>	 T420490: [Logged in reader retention baseline] Launch A/A experiment - https://phabricator.wikimedia.org/T420490
[21:04:06] <Kemayo>	 bwang: Live now.
[21:04:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267204 (owner: 10DLynch)
[21:08:09] <wikibugs>	 (03PS2) 10Jasmine: role::kubernetes::worker: add sophroid to the lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748)
[21:15:33] <wikibugs>	 (03Merged) 10jenkins-bot: SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267204 (owner: 10DLynch)
[21:15:47] <logmsgbot>	 !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]]
[21:17:26] <logmsgbot>	 !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:18:36] <logmsgbot>	 !log kemayo@deploy1003 kemayo: Continuing with sync
[21:23:09] <wikibugs>	 (03PS3) 10Jasmine: role::kubernetes::worker: add sophroid to the lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748)
[21:23:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] role::kubernetes::worker: add sophroid to the lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine)
[21:23:51] <wikibugs>	 (03PS4) 10Jasmine: role::kubernetes::worker: add sophroid to the lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748)
[21:26:03] <wikibugs>	 06SRE, 06DBA, 07Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11784439 (10Od1n) FWIW, I'm still currently encountering this error on frwiki, and it prevents my local custom JS/CSS files from loading.  Unexpectedly not loaded: * `Special:Myp...
[21:26:25] <logmsgbot>	 !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]]
[21:26:41] <Kemayo>	 Jdlrobson: Sorry, the k8s deploy failed, which is making everything *fun*.
[21:27:13] <Jdlrobson>	 no worries
[21:27:19] <Jdlrobson>	 im appreciating the extra testing time :)
[21:28:05] <logmsgbot>	 !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:28:34] <logmsgbot>	 !log kemayo@deploy1003 kemayo: Continuing with sync
[21:32:44] <logmsgbot>	 !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]] (duration: 06m 18s)
[21:32:57] <Kemayo>	 Jdlrobson: okay, all yours!
[21:35:22] <Jdlrobson>	 thanks!
[21:35:45] <wikibugs>	 (03PS1) 10Jdlrobson: Fix section heading spacing on mobile [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882)
[21:36:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: 10Jdlrobson)
[21:48:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Fix section heading spacing on mobile [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: 10Jdlrobson)
[21:48:31] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[21:49:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: 10Jdlrobson)
[21:49:14] <Jdlrobson>	 Flakey Wikibase test :(
[21:50:31] <wikibugs>	 (03Merged) 10jenkins-bot: Fix section heading spacing on mobile [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: 10Jdlrobson)
[21:51:01] <wikibugs>	 (03CR) 10SBassett: [C:03+1] Undeploy Extension:StopForumSpam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1267157 (https://phabricator.wikimedia.org/T422185) (owner: 10Reedy)
[21:58:21] <logmsgbot>	 !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1267214|Fix section heading spacing on mobile (T414882)]]
[21:58:24] <stashbot>	 T414882: Additional top margin for Parsoid outputs for sections with no lead - https://phabricator.wikimedia.org/T414882
[22:00:02] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1267214|Fix section heading spacing on mobile (T414882)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:01:42] <logmsgbot>	 !log jdlrobson@deploy1003 jdlrobson: Continuing with sync
[22:03:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[22:05:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:05:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:05:54] <logmsgbot>	 !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267214|Fix section heading spacing on mobile (T414882)]] (duration: 07m 33s)
[22:05:57] <stashbot>	 T414882: Additional top margin for Parsoid outputs for sections with no lead - https://phabricator.wikimedia.org/T414882
[22:06:51] <Jdlrobson>	 All done.
[22:08:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[22:10:38] <wikibugs>	 06SRE, 06ServiceOps new, 07Datacenter-Switchover: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11784520 (10Scott_French) Moving this into #serviceops_new, since we're probably the right team to figure out how this should b...
[22:11:34] <wikibugs>	 (03PS1) 10Eevans: Use cassandra-dev2001-a (instance) for lambda [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267229 (https://phabricator.wikimedia.org/T414112)
[22:17:35] <wikibugs>	 (03CR) 10Eevans: [C:03+2] Use cassandra-dev2001-a (instance) for lambda [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267229 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[22:19:31] <wikibugs>	 (03Merged) 10jenkins-bot: Use cassandra-dev2001-a (instance) for lambda [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267229 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[22:20:22] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
[22:20:36] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
[22:40:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:40:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:43:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[22:45:39] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:59:29] <wikibugs>	 (03PS1) 10Eevans: Add egress rule for cassandra-dev2001-a:50051 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267251 (https://phabricator.wikimedia.org/T414112)
[23:02:03] <wikibugs>	 (03CR) 10Eevans: [C:03+2] Add egress rule for cassandra-dev2001-a:50051 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267251 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[23:03:31] <jinxer-wm>	 FIRING: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[23:04:01] <wikibugs>	 (03Merged) 10jenkins-bot: Add egress rule for cassandra-dev2001-a:50051 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267251 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[23:06:01] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
[23:06:07] <logmsgbot>	 !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
[23:28:38] <zabe>	 jouncebot: nowandnext
[23:28:38] <jouncebot>	 No deployments scheduled for the next 6 hour(s) and 31 minute(s)
[23:28:38] <jouncebot>	 In 6 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260403T0600)
[23:34:22] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Start reading from new file table in dewiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264110 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe)
[23:35:16] <wikibugs>	 (03Merged) 10jenkins-bot: Start reading from new file table in dewiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264110 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe)
[23:35:42] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1264110|Start reading from new file table in dewiki and fawiki (T416548)]]
[23:35:45] <stashbot>	 T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548
[23:37:19] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1264110|Start reading from new file table in dewiki and fawiki (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:37:40] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[23:38:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T421439#11784707 (10Eevans) >>! In T421439#11784276, @VRiley-WMF wrote: > This ticket seems like it relates to another ticket https://phabricator.wikimedia.org/T413559 >  > @Eevans this server is out of warrenty, would...
[23:38:31] <jinxer-wm>	 RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[23:39:52] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1267280
[23:39:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1267280 (owner: 10TrainBranchBot)
[23:41:52] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264110|Start reading from new file table in dewiki and fawiki (T416548)]] (duration: 06m 10s)
[23:41:55] <stashbot>	 T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548
[23:51:27] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1267280 (owner: 10TrainBranchBot)
[23:51:34] <logmsgbot>	 !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[23:52:58] <logmsgbot>	 !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply