[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T0000) [00:04:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:07:57] (03Merged) 10jenkins-bot: ImageListPager: Properly support file schema migration read new [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247068 (https://phabricator.wikimedia.org/T418327) (owner: 10Zabe) [00:08:40] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1247068|ImageListPager: Properly support file schema migration read new (T418327)]] [00:08:44] T418327: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Pager\IndexPager::buildQueryInfo (MediaWiki\Specials\Pager\ImageListPager): [1054] Unk - https://phabricator.wikimedia.org/T418327 [00:10:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P89614 and previous config saved to /var/cache/conftool/dbconfig/20260303-001018-marostegui.json [00:10:29] !log zabe@deploy2002 zabe: Backport for [[gerrit:1247068|ImageListPager: Properly support file schema migration read new (T418327)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:11:15] !log zabe@deploy2002 zabe: Continuing with sync [00:13:11] !log zabe@deploy2002 Started scap sync-world: T418327 [00:14:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T418465)', diff saved to https://phabricator.wikimedia.org/P89615 and previous config saved to /var/cache/conftool/dbconfig/20260303-001440-marostegui.json [00:14:45] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [00:14:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1234.eqiad.wmnet with reason: Maintenance [00:15:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1234 (T418465)', diff saved to https://phabricator.wikimedia.org/P89616 and previous config saved to /var/cache/conftool/dbconfig/20260303-001504-marostegui.json [00:18:12] !log zabe@deploy2002 Finished scap sync-world: T418327 (duration: 05m 01s) [00:18:16] T418327: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Pager\IndexPager::buildQueryInfo (MediaWiki\Specials\Pager\ImageListPager): [1054] Unk - https://phabricator.wikimedia.org/T418327 [00:18:53] !log zabe@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [00:20:09] !log zabe@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [00:24:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:25:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P89617 and previous config saved to /var/cache/conftool/dbconfig/20260303-002525-marostegui.json [00:26:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T418465)', diff saved to https://phabricator.wikimedia.org/P89618 and previous config saved to /var/cache/conftool/dbconfig/20260303-002604-marostegui.json [00:26:08] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [00:29:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:31:38] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mwlog1003.eqiad.wmnet with OS trixie [00:33:53] FIRING: [3x] SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [00:34:57] (03PS1) 10Zabe: Revert "ImageListPager: Properly support file schema migration read new" [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247189 [00:37:46] herron@cumin1003 reimage (PID 2266653) is awaiting input [00:39:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1247190 [00:39:05] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1247190 (owner: 10TrainBranchBot) [00:39:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.65% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:39:45] (03CR) 10Zabe: [C:03+2] Revert "ImageListPager: Properly support file schema migration read new" [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247189 (owner: 10Zabe) [00:40:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:40:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T418465)', diff saved to https://phabricator.wikimedia.org/P89619 and previous config saved to /var/cache/conftool/dbconfig/20260303-004033-marostegui.json [00:40:37] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [00:40:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2216.codfw.wmnet with reason: Maintenance [00:40:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2216 (T418465)', diff saved to https://phabricator.wikimedia.org/P89620 and previous config saved to /var/cache/conftool/dbconfig/20260303-004056-marostegui.json [00:41:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P89621 and previous config saved to /var/cache/conftool/dbconfig/20260303-004112-marostegui.json [00:46:30] PROBLEM - dump of db_inventory in eqiad on backupmon1001 is CRITICAL: Last dump for db_inventory at eqiad (db1215) taken on 2026-03-03 00:39:45 is 316 KiB, but the previous one was 118 KiB, a change of +169.0 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:50:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:50:59] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1247190 (owner: 10TrainBranchBot) [00:51:05] (03Merged) 10jenkins-bot: Revert "ImageListPager: Properly support file schema migration read new" [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247189 (owner: 10Zabe) [00:51:44] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1247189|Revert "ImageListPager: Properly support file schema migration read new"]] [00:51:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T418465)', diff saved to https://phabricator.wikimedia.org/P89622 and previous config saved to /var/cache/conftool/dbconfig/20260303-005156-marostegui.json [00:52:00] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [00:53:13] !log herron@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mwlog2003.codfw.wmnet with OS trixie [00:53:34] !log zabe@deploy2002 zabe: Backport for [[gerrit:1247189|Revert "ImageListPager: Properly support file schema migration read new"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:54:02] !log herron@cumin1003 START - Cookbook sre.hosts.reimage for host mwlog2003.codfw.wmnet with OS trixie [00:56:00] !log zabe@deploy2002 zabe: Continuing with sync [00:56:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P89623 and previous config saved to /var/cache/conftool/dbconfig/20260303-005620-marostegui.json [00:59:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.73% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:59:56] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247189|Revert "ImageListPager: Properly support file schema migration read new"]] (duration: 08m 12s) [01:01:20] PROBLEM - dump of db_inventory in codfw on backupmon1001 is CRITICAL: Last dump for db_inventory at codfw (db2185) taken on 2026-03-03 00:37:46 is 316 KiB, but the previous one was 118 KiB, a change of +168.7 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:02:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 20.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:05:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:07:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P89624 and previous config saved to /var/cache/conftool/dbconfig/20260303-010703-marostegui.json [01:09:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1247193 [01:09:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1247193 (owner: 10TrainBranchBot) [01:11:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T418465)', diff saved to https://phabricator.wikimedia.org/P89625 and previous config saved to /var/cache/conftool/dbconfig/20260303-011128-marostegui.json [01:11:32] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [01:11:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1235.eqiad.wmnet with reason: Maintenance [01:11:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1235 (T418465)', diff saved to https://phabricator.wikimedia.org/P89626 and previous config saved to /var/cache/conftool/dbconfig/20260303-011151-marostegui.json [01:11:54] !log herron@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mwlog2003.codfw.wmnet with reason: host reimage [01:17:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:17:18] (03PS1) 10CDanis: haproxy: ja3n is session-scoped [puppet] - 10https://gerrit.wikimedia.org/r/1247194 [01:18:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:18:43] (03PS2) 10CDanis: haproxy: ja3n is session-scoped [puppet] - 10https://gerrit.wikimedia.org/r/1247194 [01:19:06] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwlog2003.codfw.wmnet with reason: host reimage [01:22:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P89627 and previous config saved to /var/cache/conftool/dbconfig/20260303-012211-marostegui.json [01:22:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T418465)', diff saved to https://phabricator.wikimedia.org/P89628 and previous config saved to /var/cache/conftool/dbconfig/20260303-012254-marostegui.json [01:22:58] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [01:23:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:23:53] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:26:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:28:53] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:29:06] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1247193 (owner: 10TrainBranchBot) [01:31:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:32:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:32:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:33:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:34:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.89% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:37:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T418465)', diff saved to https://phabricator.wikimedia.org/P89629 and previous config saved to /var/cache/conftool/dbconfig/20260303-013719-marostegui.json [01:37:23] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [01:37:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:38:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P89630 and previous config saved to /var/cache/conftool/dbconfig/20260303-013802-marostegui.json [01:39:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.58% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:40:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:42:49] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mwlog2003.codfw.wmnet with OS trixie [01:43:53] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:45:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:47:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:50:11] (03CR) 10ArielGlenn: "I am still a little confused; I thought in the case of having a sub in both the access token and the cookie, if those aren't identical, we" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [01:50:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:51:32] (03CR) 10ArielGlenn: [C:03+1] "Still ok from me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243807 (owner: 10Daniel Kinzler) [01:53:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P89631 and previous config saved to /var/cache/conftool/dbconfig/20260303-015309-marostegui.json [01:57:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:00:50] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:07:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:08:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T418465)', diff saved to https://phabricator.wikimedia.org/P89632 and previous config saved to /var/cache/conftool/dbconfig/20260303-020817-marostegui.json [02:08:21] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [02:08:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1239.eqiad.wmnet with reason: Maintenance [02:08:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.18 [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247196 (https://phabricator.wikimedia.org/T413809) [02:08:51] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 08m 00s) [02:08:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.18 [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247196 (https://phabricator.wikimedia.org/T413809) (owner: 10TrainBranchBot) [02:08:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:08:53] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:16:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1240.eqiad.wmnet with reason: Maintenance [02:18:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:21:11] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.18 [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247196 (https://phabricator.wikimedia.org/T413809) (owner: 10TrainBranchBot) [02:21:46] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:21:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:31:46] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:33:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:34:10] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:37:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:38:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:44:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:54:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:56:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T0300) [03:02:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1251.eqiad.wmnet with reason: Maintenance [03:02:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1251 (T418465)', diff saved to https://phabricator.wikimedia.org/P89633 and previous config saved to /var/cache/conftool/dbconfig/20260303-030217-marostegui.json [03:02:21] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [03:06:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:07:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:11:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:12:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:12:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T418465)', diff saved to https://phabricator.wikimedia.org/P89634 and previous config saved to /var/cache/conftool/dbconfig/20260303-031224-marostegui.json [03:12:28] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [03:13:54] FIRING: [3x] SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:16:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:24:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:27:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P89635 and previous config saved to /var/cache/conftool/dbconfig/20260303-032731-marostegui.json [03:29:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:34:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:35:19] (03CR) 10ArielGlenn: [C:03+1] "This seems ok to me now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244696 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [03:39:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:42:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P89636 and previous config saved to /var/cache/conftool/dbconfig/20260303-034239-marostegui.json [03:49:34] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:50:34] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:56:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:57:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T418465)', diff saved to https://phabricator.wikimedia.org/P89637 and previous config saved to /var/cache/conftool/dbconfig/20260303-035746-marostegui.json [03:57:50] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [03:58:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [04:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T0400) [04:01:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:02:08] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247200 (https://phabricator.wikimedia.org/T413809) [04:02:10] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247200 (https://phabricator.wikimedia.org/T413809) (owner: 10TrainBranchBot) [04:03:02] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247200 (https://phabricator.wikimedia.org/T413809) (owner: 10TrainBranchBot) [04:03:34] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.18 refs T413809 [04:03:37] T413809: 1.46.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T413809 [04:17:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:22:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:24:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:28:46] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:31:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:41:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:43:17] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.18 refs T413809 (duration: 39m 43s) [04:43:20] T413809: 1.46.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T413809 [04:47:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:52:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:55:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:57:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T0500) [05:01:13] !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.15 (duration: 01m 10s) [05:09:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-eqord and Hurricane Electric (2001:504:0:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:12:51] 10ops-eqiad, 06DC-Ops: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418825 (10phaultfinder) 03NEW [05:16:20] (03PS3) 10Ryan Kemper: wdqs: Reduce deadlock remediation cooldown to 15 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1247178 (https://phabricator.wikimedia.org/T242453) [05:21:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:24:13] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:26:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:28:54] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:33:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:34:15] (03PS1) 10Medelius: Create message strings for experimental checks [extensions/VisualEditor] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247434 (https://phabricator.wikimedia.org/T414987) [05:34:18] (03CR) 10Ryan Kemper: [C:03+2] wdqs: Reduce deadlock remediation cooldown to 15 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1247178 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [05:37:25] ^ oops, just merged this but realized i set my yubikey down somewhere and don't remember exactly. if i don't find it before someone has something else to puppet-merge, go ahead and merge it for me :) [05:42:11] * ryankemper found it [05:42:25] (03CR) 10Giuseppe Lavagetto: [C:03+1] haproxy: ja3n is session-scoped [puppet] - 10https://gerrit.wikimedia.org/r/1247194 (owner: 10CDanis) [05:42:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:43:54] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:45:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:48:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1163.eqiad.wmnet with reason: Maintenance [05:48:17] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2240 gradually with 4 steps - repool after schema change [05:48:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2212.codfw.wmnet with reason: Maintenance [05:48:46] (03PS1) 10Marostegui: Revert "db2240: Long schema change" [puppet] - 10https://gerrit.wikimedia.org/r/1247435 [05:49:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:49:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-eqord and Hurricane Electric (2001:504:0:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:49:46] (03CR) 10Marostegui: [C:03+2] Revert "db2240: Long schema change" [puppet] - 10https://gerrit.wikimedia.org/r/1247435 (owner: 10Marostegui) [05:50:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:54:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:55:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:57:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:57:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1150.eqiad.wmnet with reason: Maintenance [05:58:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2147.codfw.wmnet with reason: Maintenance [05:58:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2147 (T418465)', diff saved to https://phabricator.wikimedia.org/P89639 and previous config saved to /var/cache/conftool/dbconfig/20260303-055834-marostegui.json [05:58:38] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [05:58:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:59:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:05:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:15:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:16:41] PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim [06:17:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:18:54] FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:19:45] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [06:25:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T418465)', diff saved to https://phabricator.wikimedia.org/P89642 and previous config saved to /var/cache/conftool/dbconfig/20260303-062507-marostegui.json [06:25:11] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [06:30:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:32:02] (03PS1) 10Marostegui: mariadb: db1219[1-8].yaml remove [puppet] - 10https://gerrit.wikimedia.org/r/1247438 [06:32:38] (03CR) 10Marostegui: "This is a noop as these hosts aren't even racked." [puppet] - 10https://gerrit.wikimedia.org/r/1247438 (owner: 10Marostegui) [06:32:41] (03CR) 10Marostegui: [C:03+2] mariadb: db1219[1-8].yaml remove [puppet] - 10https://gerrit.wikimedia.org/r/1247438 (owner: 10Marostegui) [06:33:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2240 gradually with 4 steps - repool after schema change [06:35:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:37:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:40:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P89644 and previous config saved to /var/cache/conftool/dbconfig/20260303-064015-marostegui.json [06:43:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1190.eqiad.wmnet with reason: Maintenance [06:44:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1190 (T418465)', diff saved to https://phabricator.wikimedia.org/P89645 and previous config saved to /var/cache/conftool/dbconfig/20260303-064405-marostegui.json [06:44:09] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [06:50:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:53:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [06:55:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:55:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P89646 and previous config saved to /var/cache/conftool/dbconfig/20260303-065523-marostegui.json [06:57:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T0700) [07:00:05] marostegui, Amir1, and federico3: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T0700). nyaa~ [07:04:00] (03PS1) 10Ayounsi: asw1-22-ulsfo: add ACLs and infra BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1247439 (https://phabricator.wikimedia.org/T408892) [07:05:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:05:20] (03CR) 10CI reject: [V:04-1] asw1-22-ulsfo: add ACLs and infra BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1247439 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [07:09:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T418465)', diff saved to https://phabricator.wikimedia.org/P89647 and previous config saved to /var/cache/conftool/dbconfig/20260303-070940-marostegui.json [07:09:44] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [07:09:55] (03PS2) 10Ayounsi: asw1-22-ulsfo: add ACLs and infra BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1247439 (https://phabricator.wikimedia.org/T408892) [07:10:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T418465)', diff saved to https://phabricator.wikimedia.org/P89648 and previous config saved to /var/cache/conftool/dbconfig/20260303-071029-marostegui.json [07:10:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2155.codfw.wmnet with reason: Maintenance [07:10:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2155 (T418465)', diff saved to https://phabricator.wikimedia.org/P89649 and previous config saved to /var/cache/conftool/dbconfig/20260303-071054-marostegui.json [07:14:13] FIRING: [2x] SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:15:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:18:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1051.eqiad.wmnet [07:19:04] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11666586 (10ops-monitoring-bot) Draining ganeti1051.eqiad.wmnet of running VMs [07:20:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1051.eqiad.wmnet [07:22:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:24:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P89650 and previous config saved to /var/cache/conftool/dbconfig/20260303-072447-marostegui.json [07:26:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:27:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:29:02] (03CR) 10Muehlenhoff: [C:03+2] pcc_update_facts: Rename variables [puppet] - 10https://gerrit.wikimedia.org/r/1227734 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:30:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:31:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:32:50] 10ops-eqiad, 06SRE, 06DC-Ops: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418825#11666598 (10VRiley-WMF) a:03VRiley-WMF [07:32:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:35:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:37:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:37:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:37:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:38:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T418465)', diff saved to https://phabricator.wikimedia.org/P89651 and previous config saved to /var/cache/conftool/dbconfig/20260303-073838-marostegui.json [07:38:42] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [07:39:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P89652 and previous config saved to /var/cache/conftool/dbconfig/20260303-073955-marostegui.json [07:39:58] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11666615 (10MoritzMuehlenhoff) 05Open→03Resolved Sounds good. The maxbinderwmf account is now disabled, resolving the task [07:40:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:42:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:43:51] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11666621 (10JMeybohm) [07:44:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:45:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:47:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:53:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P89653 and previous config saved to /var/cache/conftool/dbconfig/20260303-075345-marostegui.json [07:55:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T418465)', diff saved to https://phabricator.wikimedia.org/P89654 and previous config saved to /var/cache/conftool/dbconfig/20260303-075502-marostegui.json [07:55:07] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [07:55:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:55:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1199.eqiad.wmnet with reason: Maintenance [07:55:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1199 (T418465)', diff saved to https://phabricator.wikimedia.org/P89655 and previous config saved to /var/cache/conftool/dbconfig/20260303-075526-marostegui.json [08:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:03:46] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11666650 (10Marostegui) a:05Marostegui→03None Un-assigning this as I am no longer working on this task as I was the point of contac... [08:05:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:06:36] (03PS1) 10Fabfur: hiera: set haproxy version to 3.0 on ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247512 (https://phabricator.wikimedia.org/T417253) [08:07:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:07:34] !log installing PAM security updates on Bookworm [08:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P89656 and previous config saved to /var/cache/conftool/dbconfig/20260303-080853-marostegui.json [08:10:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:11:16] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247512 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [08:14:26] (03PS2) 10Fabfur: hiera: set haproxy version to 3.0 on ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247512 (https://phabricator.wikimedia.org/T417253) [08:14:28] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247512 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [08:19:27] (03PS1) 10Brouberol: growhbook: allow WMDE engineers to self-enroll [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247513 (https://phabricator.wikimedia.org/T418665) [08:20:13] (03CR) 10Slyngshede: [C:03+1] "Exciting :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1247512 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [08:22:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T418465)', diff saved to https://phabricator.wikimedia.org/P89657 and previous config saved to /var/cache/conftool/dbconfig/20260303-082209-marostegui.json [08:22:13] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [08:22:55] (03PS1) 10Fabfur: hiera: set haproxy version to 3.0 on eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247515 (https://phabricator.wikimedia.org/T417253) [08:22:58] (03PS1) 10Fabfur: hiera: set haproxy version to 3.0 on codfw cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247516 (https://phabricator.wikimedia.org/T417253) [08:23:00] (03PS1) 10Fabfur: hiera: set haproxy version to 3.0 on ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247517 (https://phabricator.wikimedia.org/T417253) [08:23:17] (03PS2) 10Fabfur: hiera: set haproxy version to 3.0 on drmrs cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247517 (https://phabricator.wikimedia.org/T417253) [08:24:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T418465)', diff saved to https://phabricator.wikimedia.org/P89658 and previous config saved to /var/cache/conftool/dbconfig/20260303-082400-marostegui.json [08:24:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2172.codfw.wmnet with reason: Maintenance [08:24:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2172 (T418465)', diff saved to https://phabricator.wikimedia.org/P89659 and previous config saved to /var/cache/conftool/dbconfig/20260303-082424-marostegui.json [08:25:59] (03PS1) 10Fabfur: hiera: set haproxy version to 3.0 on eqiad cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247518 (https://phabricator.wikimedia.org/T417253) [08:26:01] (03PS1) 10Fabfur: hiera: set haproxy version to 3.0 on esams cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247519 (https://phabricator.wikimedia.org/T417253) [08:27:32] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in codfw/ml-serve-codfw: maintenance [08:28:33] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in codfw/ml-serve-codfw: maintenance [08:29:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:30:44] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [08:31:46] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11666729 (10MoritzMuehlenhoff) [08:31:47] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [08:32:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:32:46] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:32:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:32:59] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [08:33:00] (03CR) 10Ayounsi: [C:03+2] asw1-22-ulsfo: add ACLs and infra BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1247439 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [08:34:44] (03Merged) 10jenkins-bot: asw1-22-ulsfo: add ACLs and infra BGP sessions [homer/public] - 10https://gerrit.wikimedia.org/r/1247439 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [08:37:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P89660 and previous config saved to /var/cache/conftool/dbconfig/20260303-083716-marostegui.json [08:37:22] (03CR) 10Fabfur: [C:03+2] hiera: set haproxy version to 3.0 on ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247512 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [08:37:48] !log start upgrading haproxy to 3.0 on A:cp-ulsfo (T417253) [08:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:51] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11666758 (10Jelto) [08:37:51] T417253: Upgrade to HAProxy 3.0 on cache (bullseye) hosts - https://phabricator.wikimedia.org/T417253 [08:38:21] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11666762 (10Jelto) [08:40:45] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11666770 (10JMeybohm) [08:41:37] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo and A:cp - 3.0 upgrade () [08:41:45] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo and A:cp - 3.0 upgrade () [08:42:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:43:54] FIRING: [2x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:44:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [08:45:10] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11666786 (10Gehel) [08:45:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:45:32] (03PS1) 10Jelto: admin: add dtotten to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1247522 (https://phabricator.wikimedia.org/T418415) [08:46:19] (03CR) 10CI reject: [V:04-1] admin: add dtotten to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1247522 (https://phabricator.wikimedia.org/T418415) (owner: 10Jelto) [08:46:57] (03PS1) 10Tiziano Fogli: grafana::ldap_sync: disable systemd timer (TMP) [puppet] - 10https://gerrit.wikimedia.org/r/1247523 (https://phabricator.wikimedia.org/T418118) [08:47:54] !log powercycling lvs1013 [08:47:55] (03PS2) 10Jelto: admin: add dtotten to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1247522 (https://phabricator.wikimedia.org/T418415) [08:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:34] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1247522 (https://phabricator.wikimedia.org/T418415) (owner: 10Jelto) [08:50:01] (03CR) 10Jelto: [C:04-1] "thank you Moritz! out-of-band verification still pending" [puppet] - 10https://gerrit.wikimedia.org/r/1247522 (https://phabricator.wikimedia.org/T418415) (owner: 10Jelto) [08:50:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T418465)', diff saved to https://phabricator.wikimedia.org/P89661 and previous config saved to /var/cache/conftool/dbconfig/20260303-085019-marostegui.json [08:50:23] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [08:52:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P89662 and previous config saved to /var/cache/conftool/dbconfig/20260303-085224-marostegui.json [08:52:26] (03CR) 10JMeybohm: [C:03+2] conftool-data: Fix YAML syntax [puppet] - 10https://gerrit.wikimedia.org/r/1247099 (https://phabricator.wikimedia.org/T418259) (owner: 10JMeybohm) [08:52:32] (03CR) 10JMeybohm: [C:03+2] Add wikikube-worker[1350-1351] [puppet] - 10https://gerrit.wikimedia.org/r/1247100 (https://phabricator.wikimedia.org/T418259) (owner: 10JMeybohm) [08:53:12] (03PS4) 10Tiziano Fogli: ldap_users_sync.py: format code [puppet] - 10https://gerrit.wikimedia.org/r/1243062 (https://phabricator.wikimedia.org/T418118) [08:53:13] (03PS2) 10Tiziano Fogli: grafana/ldap_users_sync: delete a user if it has invalid metadata [puppet] - 10https://gerrit.wikimedia.org/r/1247076 (https://phabricator.wikimedia.org/T418118) [08:53:13] (03PS2) 10Tiziano Fogli: grafana::ldap_sync: disable systemd timer (TMP) [puppet] - 10https://gerrit.wikimedia.org/r/1247523 (https://phabricator.wikimedia.org/T418118) [08:53:41] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in codfw/ml-serve-codfw: maintenance [08:54:36] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in codfw/ml-serve-codfw: maintenance [08:56:44] (03PS1) 10Slyngshede: Release version 0.1.16 [software/bitu] - 10https://gerrit.wikimedia.org/r/1247529 [08:58:04] (03CR) 10Tiziano Fogli: [C:03+2] ldap_users_sync.py: format code [puppet] - 10https://gerrit.wikimedia.org/r/1243062 (https://phabricator.wikimedia.org/T418118) (owner: 10Tiziano Fogli) [08:58:14] (03CR) 10Tiziano Fogli: [C:03+2] grafana/ldap_users_sync: delete a user if it has invalid metadata [puppet] - 10https://gerrit.wikimedia.org/r/1247076 (https://phabricator.wikimedia.org/T418118) (owner: 10Tiziano Fogli) [09:01:04] (03PS3) 10Tiziano Fogli: grafana/ldap_users_sync: delete a user if it has invalid metadata [puppet] - 10https://gerrit.wikimedia.org/r/1247076 (https://phabricator.wikimedia.org/T418118) [09:01:04] (03PS3) 10Tiziano Fogli: grafana::ldap_sync: disable systemd timer (TMP) [puppet] - 10https://gerrit.wikimedia.org/r/1247523 (https://phabricator.wikimedia.org/T418118) [09:02:06] (03CR) 10Tiziano Fogli: [C:03+2] grafana/ldap_users_sync: delete a user if it has invalid metadata [puppet] - 10https://gerrit.wikimedia.org/r/1247076 (https://phabricator.wikimedia.org/T418118) (owner: 10Tiziano Fogli) [09:02:09] (03CR) 10JMeybohm: [C:04-1] "AIUI this has not been deployed, I would refrain from adding a service to the catalog before it's actually running." [puppet] - 10https://gerrit.wikimedia.org/r/1247175 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [09:02:17] (03CR) 10Tiziano Fogli: [C:03+2] grafana::ldap_sync: disable systemd timer (TMP) [puppet] - 10https://gerrit.wikimedia.org/r/1247523 (https://phabricator.wikimedia.org/T418118) (owner: 10Tiziano Fogli) [09:02:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:04:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:05:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P89663 and previous config saved to /var/cache/conftool/dbconfig/20260303-090526-marostegui.json [09:05:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:07:22] (03CR) 10JMeybohm: [C:04-1] service: add linked-artifact service (k8s ingress) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1247175 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [09:07:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T418465)', diff saved to https://phabricator.wikimedia.org/P89664 and previous config saved to /var/cache/conftool/dbconfig/20260303-090731-marostegui.json [09:07:33] FIRING: KubernetesCalicoDown: wikikube-worker1350.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1350.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:07:36] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [09:07:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1221.eqiad.wmnet with reason: Maintenance [09:08:10] (03PS1) 10Tiziano Fogli: grafana/ldap_users_sync.py: add missing parameter to sync_ldap_users() [puppet] - 10https://gerrit.wikimedia.org/r/1247532 (https://phabricator.wikimedia.org/T418118) [09:08:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on 6 hosts with reason: Maintenance [09:08:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1221 (T418465)', diff saved to https://phabricator.wikimedia.org/P89665 and previous config saved to /var/cache/conftool/dbconfig/20260303-090818-marostegui.json [09:08:59] (03CR) 10Tiziano Fogli: [C:03+2] grafana/ldap_users_sync.py: add missing parameter to sync_ldap_users() [puppet] - 10https://gerrit.wikimedia.org/r/1247532 (https://phabricator.wikimedia.org/T418118) (owner: 10Tiziano Fogli) [09:09:59] (03CR) 10Hashar: "@cdanis@wikimedia.org & @vgutierrez@wikimedia.org may you review this change which is about aligning timeout between Apache mod_proxy and " [puppet] - 10https://gerrit.wikimedia.org/r/1241048 (https://phabricator.wikimedia.org/T246763) (owner: 10Hashar) [09:10:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:12:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:13:26] (03PS1) 10Arnaudb: gerrit: add gerrit-spare to discovery records [dns] - 10https://gerrit.wikimedia.org/r/1247524 (https://phabricator.wikimedia.org/T418361) [09:13:54] (03PS1) 10Arnaudb: cache:text: add gerrit-spare to alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/1247528 (https://phabricator.wikimedia.org/T418361) [09:14:13] (03PS1) 10Arnaudb: trafficserver: Add gerrit-spare backend [puppet] - 10https://gerrit.wikimedia.org/r/1247527 (https://phabricator.wikimedia.org/T418361) [09:14:30] (03PS1) 10Arnaudb: gerrit: move gerrit-spare behind CDN [dns] - 10https://gerrit.wikimedia.org/r/1247531 (https://phabricator.wikimedia.org/T418361) [09:16:58] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1247527 (https://phabricator.wikimedia.org/T418361) (owner: 10Arnaudb) [09:17:37] (03CR) 10Jelto: [C:03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1247524 (https://phabricator.wikimedia.org/T418361) (owner: 10Arnaudb) [09:17:40] !log installing libbpf updates from Bookworm point release [09:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:35] (03CR) 10Arnaudb: [C:03+2] gerrit: add gerrit-spare to discovery records [dns] - 10https://gerrit.wikimedia.org/r/1247524 (https://phabricator.wikimedia.org/T418361) (owner: 10Arnaudb) [09:18:40] !log arnaudb@dns1004 START - running authdns-update [09:19:00] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1247529 (owner: 10Slyngshede) [09:19:49] !log arnaudb@dns1004 END - running authdns-update [09:19:50] (03CR) 10JMeybohm: [C:04-1] "Tests fail for me with:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [09:20:06] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:20:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P89666 and previous config saved to /var/cache/conftool/dbconfig/20260303-092034-marostegui.json [09:20:38] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo and A:cp - 3.0 upgrade () [09:20:39] !log fceratto@cumin1003 START - Cookbook sre.mysql.update-replication [09:21:18] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.update-replication (exit_code=0) [09:22:33] RESOLVED: KubernetesCalicoDown: wikikube-worker1350.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1350.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:22:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:23:53] !log fceratto@cumin1003 START - Cookbook sre.hosts.reimage for host db1176.eqiad.wmnet with OS trixie [09:23:59] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo and A:cp - 3.0 upgrade () [09:25:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:26:20] (03PS5) 10Daniel Kinzler: rest-gateway: use rlc claim from cookie with bearer token [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) [09:26:24] (03CR) 10Daniel Kinzler: "The problem is that the sub claims wouldn't be identical - they would refer to the same user, and contain the same global user ID, but the" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [09:29:16] (03PS1) 10Arnaudb: gerrit: dns cache wipe update [cookbooks] - 10https://gerrit.wikimedia.org/r/1247534 (https://phabricator.wikimedia.org/T418108) [09:29:16] (03CR) 10Arnaudb: [C:04-1] "this should be merged after both remaining gerrit instances are moved behind CDN" [cookbooks] - 10https://gerrit.wikimedia.org/r/1247534 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [09:29:24] (03CR) 10Daniel Kinzler: "Filed T418835" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [09:31:18] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11666968 (10Jelto) [09:32:21] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [09:32:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T418465)', diff saved to https://phabricator.wikimedia.org/P89667 and previous config saved to /var/cache/conftool/dbconfig/20260303-093224-marostegui.json [09:32:28] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [09:33:54] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:34:41] (03PS1) 10Arnaudb: gerrit: move gerrit-replica behind CDN [dns] - 10https://gerrit.wikimedia.org/r/1247530 (https://phabricator.wikimedia.org/T418108) [09:34:41] (03CR) 10Arnaudb: [C:04-1] "to be merged once we're done with T417998" [dns] - 10https://gerrit.wikimedia.org/r/1247530 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [09:35:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T418465)', diff saved to https://phabricator.wikimedia.org/P89668 and previous config saved to /var/cache/conftool/dbconfig/20260303-093542-marostegui.json [09:35:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2199.codfw.wmnet with reason: Maintenance [09:36:21] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11666990 (10MoritzMuehlenhoff) [09:37:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:38:02] !log fceratto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1176.eqiad.wmnet with reason: host reimage [09:38:42] (03PS1) 10Tiziano Fogli: Revert "grafana::ldap_sync: disable systemd timer (TMP)" [puppet] - 10https://gerrit.wikimedia.org/r/1247536 [09:39:31] (03CR) 10Tiziano Fogli: [C:03+2] Revert "grafana::ldap_sync: disable systemd timer (TMP)" [puppet] - 10https://gerrit.wikimedia.org/r/1247536 (owner: 10Tiziano Fogli) [09:39:46] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11667003 (10MatthewVernon) 05Resolved→03Open @Jclark-ctr can you take another look at these, please? In neither system can the OS see any of the spinning disks, which... [09:40:13] (03PS1) 10Jelto: admin: add milimetric to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/1247537 (https://phabricator.wikimedia.org/T417906) [09:40:32] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [09:42:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:43:45] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [09:44:05] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1176.eqiad.wmnet with reason: host reimage [09:44:13] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:44:22] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [09:44:31] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:44:40] !log elukey@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [09:45:10] (03PS1) 10Muehlenhoff: standard_packages: Remove more obsolete packages after bullseye->bookworm update [puppet] - 10https://gerrit.wikimedia.org/r/1247539 [09:45:21] !log elukey@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [09:45:57] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [09:47:07] !log elukey@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [09:47:12] !log elukey@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [09:47:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P89669 and previous config saved to /var/cache/conftool/dbconfig/20260303-094732-marostegui.json [09:48:41] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [09:48:59] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [09:50:51] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1247537 (https://phabricator.wikimedia.org/T417906) (owner: 10Jelto) [09:51:19] !log installing qemu security updates [09:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:22] (03CR) 10Arnaudb: [C:03+2] trafficserver: Add gerrit-spare backend [puppet] - 10https://gerrit.wikimedia.org/r/1247527 (https://phabricator.wikimedia.org/T418361) (owner: 10Arnaudb) [09:52:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:53:13] (03CR) 10Btullis: [C:03+1] admin: add milimetric to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/1247537 (https://phabricator.wikimedia.org/T417906) (owner: 10Jelto) [09:53:37] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:54:26] (03CR) 10Filippo Giunchedi: [C:03+1] standard_packages: Remove more obsolete packages after bullseye->bookworm update [puppet] - 10https://gerrit.wikimedia.org/r/1247539 (owner: 10Muehlenhoff) [09:54:34] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745#11667054 (10MatthewVernon) Perhaps instead for the odd non-web format (which seem... [09:55:35] (03PS1) 10Hashar: wm-checks-api: add tag for Selenium jobs [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1247540 [09:55:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:56:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2206.codfw.wmnet with reason: Maintenance [09:56:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2206 (T418465)', diff saved to https://phabricator.wikimedia.org/P89670 and previous config saved to /var/cache/conftool/dbconfig/20260303-095655-marostegui.json [09:56:59] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [09:57:11] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:57:16] 06SRE, 06Infrastructure-Foundations, 10netops, 10Prod-Kubernetes, 06ServiceOps new: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11667074 (10BTullis) [09:59:11] PROBLEM - MariaDB Replica SQL: s7 on db1171 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:11] PROBLEM - MariaDB Replica SQL: s8 on db1171 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:12] PROBLEM - MariaDB Replica SQL: s3 #page on db1175 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:12] PROBLEM - MariaDB Replica SQL: s7 #page on db1170 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:13] PROBLEM - MariaDB Replica SQL: s5 #page on db1161 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:14] PROBLEM - MariaDB Replica SQL: s7 #page on db1158 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:15] PROBLEM - MariaDB Replica SQL: s3 #page on db1157 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:16] PROBLEM - MariaDB Replica SQL: s6 #page on db1168 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:16] PROBLEM - MariaDB Replica SQL: s2 #page on db1162 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:17] PROBLEM - MariaDB Replica SQL: s7 #page on db1174 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:18] PROBLEM - MariaDB Replica SQL: s8 #page on db1172 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:18] PROBLEM - MariaDB Replica SQL: s5 #page on db1185 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:19] PROBLEM - MariaDB Replica SQL: x1 #page on db1179 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:21] PROBLEM - MariaDB Replica SQL: s6 #page on db1165 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:22] PROBLEM - MariaDB Replica SQL: s2 #page on db1156 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:23] PROBLEM - MariaDB Replica SQL: s2 #page on db1188 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:24] PROBLEM - MariaDB Replica SQL: s5 #page on db1159 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:24] PROBLEM - MariaDB Replica SQL: s8 #page on db1167 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:25] PROBLEM - MariaDB Replica SQL: s3 #page on db1166 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:26] PROBLEM - MariaDB Replica SQL: s8 #page on db1193 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:27] PROBLEM - MariaDB Replica SQL: s8 #page on db1192 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:27] PROBLEM - MariaDB Replica SQL: s2 #page on db1197 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:28] PROBLEM - MariaDB Replica SQL: s5 #page on db1200 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:29] PROBLEM - MariaDB Replica SQL: s8 #page on db1203 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:30] PROBLEM - MariaDB Replica SQL: s8 #page on db1209 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:30] PROBLEM - MariaDB Replica SQL: s5 #page on db1207 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:31] PROBLEM - MariaDB Replica SQL: x3 #page on db1211 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:32] PROBLEM - MariaDB Replica SQL: s5 #page on db1210 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:32] PROBLEM - MariaDB Replica SQL: x3 on db1216 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:33] PROBLEM - MariaDB Replica SQL: x1 #page on db1224 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:33] PROBLEM - MariaDB Replica SQL: x1 on db1216 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:33] PROBLEM - MariaDB Replica SQL: x1 on db1225 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:33] PROBLEM - MariaDB Replica SQL: s5 on db1216 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:34] PROBLEM - MariaDB Replica SQL: x1 #page on db1220 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:36] PROBLEM - MariaDB Replica SQL: m2 on db1217 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:36] PROBLEM - MariaDB Replica SQL: m3 on db1217 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:41] PROBLEM - MariaDB Replica SQL: s8 #page on db2152 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:41] PROBLEM - MariaDB Replica SQL: s8 #page on db2154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:42] PROBLEM - MariaDB Replica SQL: s8 #page on db2161 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:43] PROBLEM - MariaDB Replica SQL: s8 #page on db2164 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:44] PROBLEM - MariaDB Replica SQL: s8 #page on db2163 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:59:46] ugh [09:59:49] that's unexpected. Anyhow i stopped the script [10:00:03] PROBLEM - MariaDB Replica SQL: s8 #page on db1178 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:03] PROBLEM - MariaDB Replica SQL: s2 #page on db1182 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:04] PROBLEM - MariaDB Replica SQL: s8 #page on db1177 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:05] PROBLEM - MariaDB Replica SQL: s7 #page on db2208 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:07] PROBLEM - MariaDB Replica SQL: s7 on db2200 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:07] PROBLEM - MariaDB Replica SQL: s5 on db2201 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:07] PROBLEM - MariaDB Replica SQL: s1 on db2141 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:08] PROBLEM - MariaDB Replica SQL: s5 #page on db2211 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:08] PROBLEM - MariaDB Replica SQL: s5 #page on db2178 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:09] PROBLEM - MariaDB Replica SQL: s7 #page on db2182 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:10] PROBLEM - MariaDB Replica SQL: s1 #page on db2176 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:10] !ack [10:00:10] PROBLEM - MariaDB Replica SQL: s1 #page on db2174 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:11] PROBLEM - MariaDB Replica SQL: s2 #page on db2175 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:12] PROBLEM - MariaDB Replica SQL: s3 #page on db2177 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:12] PROBLEM - MariaDB Replica SQL: s5 #page on db2192 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:13] PROBLEM - MariaDB Replica SQL: s3 #page on db2194 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:14] PROBLEM - MariaDB Replica SQL: s1 #page on db2188 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:15] PROBLEM - MariaDB Replica SQL: x1 #page on db2186 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:15] PROBLEM - MariaDB Replica SQL: s3 #page on db2190 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:19] PROBLEM - MariaDB Replica SQL: s7 on db2198 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:19] PROBLEM - MariaDB Replica SQL: s1 #page on db2203 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:22] PROBLEM - MariaDB Replica SQL: s7 #page on db1181 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:22] it ran ok on test hosts [10:00:23] PROBLEM - MariaDB Replica SQL: s4 #page on db1160 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:24] PROBLEM - MariaDB Replica SQL: s3 #page on db1189 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:25] <_joe_> lol poor sirenbot [10:00:28] PROBLEM - MariaDB Replica SQL: s2 #page on db1222 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:28] PROBLEM - MariaDB Replica SQL: s5 #page on db1230 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:33] <_joe_> marostegui: need help? [10:00:36] PROBLEM - MariaDB Replica SQL: x1 #page on db1237 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:40] PROBLEM - MariaDB Replica SQL: s7 #page on db2150 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:40] PROBLEM - MariaDB Replica SQL: s1 #page on db2145 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:41] PROBLEM - MariaDB Replica SQL: s5 #page on db2157 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:42] PROBLEM - MariaDB Replica SQL: s2 #page on db2148 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:43] PROBLEM - MariaDB Replica SQL: s7 #page on db2159 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:43] PROBLEM - MariaDB Replica SQL: s1 #page on db2146 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:44] PROBLEM - MariaDB Replica SQL: s3 #page on db2149 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:45] PROBLEM - MariaDB Replica SQL: s3 #page on db2156 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:46] PROBLEM - MariaDB Replica SQL: s1 #page on db2170 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:46] PROBLEM - MariaDB Replica SQL: s1 #page on db2153 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:47] <_joe_> I guess just skipping that specific query should be enough right? [10:00:47] PROBLEM - MariaDB Replica SQL: s5 #page on db2171 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:48] PROBLEM - MariaDB Replica SQL: s7 #page on db2168 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:49] PROBLEM - MariaDB Replica SQL: s1 #page on db2173 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:49] PROBLEM - MariaDB Replica SQL: s3 #page on db2205 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:52] PROBLEM - MariaDB Replica SQL: s2 #page on db2189 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:55] Probably [10:00:55] PROBLEM - MariaDB Replica SQL: s2 on db2197 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:00:57] let's switch to -sre [10:00:58] there's a backlog of alerts [10:01:00] PROBLEM - MariaDB Replica SQL: s7 #page on db2218 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:01:22] PROBLEM - MariaDB Replica SQL: s6 #page on db1173 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:01:36] PROBLEM - MariaDB Replica SQL: x3 #page on db1255 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:01:37] PROBLEM - MariaDB Replica SQL: es7 #page on es1035 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:01:42] PROBLEM - MariaDB Replica SQL: s6 #page on db2158 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:01:43] PROBLEM - MariaDB Replica SQL: s6 #page on db2151 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:01:43] PROBLEM - MariaDB Replica SQL: s6 #page on db2169 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:01:44] PROBLEM - MariaDB Replica SQL: x3 #page on db2162 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:01:50] PROBLEM - MariaDB Replica SQL: s6 #page on db2224 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:01:59] PROBLEM - MariaDB Replica SQL: es6 #page on es1037 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:01:59] PROBLEM - MariaDB Replica SQL: es6 #page on es1036 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:02:00] PROBLEM - MariaDB Replica SQL: es6 #page on es1038 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:02:03] PROBLEM - MariaDB Replica SQL: s6 #page on db2214 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:02:05] !ack [10:02:06] 7597 (ACKED) db1173 (paged)/MariaDB Replica SQL: s6 (paged) [10:02:06] 7598 (ACKED) db1255 (paged)/MariaDB Replica SQL: x3 (paged) [10:02:06] 7599 (ACKED) es1035 (paged)/MariaDB Replica SQL: es7 (paged) [10:02:06] 7600 (ACKED) db2151 (paged)/MariaDB Replica SQL: s6 (paged) [10:02:07] 7601 (ACKED) db2158 (paged)/MariaDB Replica SQL: s6 (paged) [10:02:07] 7602 (ACKED) db2169 (paged)/MariaDB Replica SQL: s6 (paged) [10:02:07] 7603 (ACKED) db2224 (paged)/MariaDB Replica SQL: s6 (paged) [10:02:08] 7604 (ACKED) db2162 (paged)/MariaDB Replica SQL: x3 (paged) [10:02:08] 7605 (ACKED) es1038 (paged)/MariaDB Replica SQL: es6 (paged) [10:02:09] 7606 (ACKED) es1037 (paged)/MariaDB Replica SQL: es6 (paged) [10:02:09] 7607 (ACKED) es1036 (paged)/MariaDB Replica SQL: es6 (paged) [10:02:14] !ack [10:02:14] 7608 (ACKED) db2214 (paged)/MariaDB Replica SQL: s6 (paged) [10:02:15] PROBLEM - MariaDB Replica SQL: s6 #page on db2217 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:02:15] PROBLEM - MariaDB Replica SQL: s6 #page on db2180 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:02:16] PROBLEM - MariaDB Replica SQL: s6 #page on db2193 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:02:17] PROBLEM - MariaDB Replica SQL: x3 #page on db2187 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:02:19] PROBLEM - MariaDB Replica SQL: s6 on db2197 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:02:19] PROBLEM - MariaDB Replica SQL: x3 on db2200 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:02:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P89671 and previous config saved to /var/cache/conftool/dbconfig/20260303-100240-marostegui.json [10:02:41] marostegui@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [10:03:01] !acj [10:03:03] !ack [10:03:04] 7609 (ACKED) db2180 (paged)/MariaDB Replica SQL: s6 (paged) [10:03:04] 7610 (ACKED) db2193 (paged)/MariaDB Replica SQL: s6 (paged) [10:03:04] 7611 (ACKED) db2217 (paged)/MariaDB Replica SQL: s6 (paged) [10:03:04] 7612 (ACKED) db2187 (paged)/MariaDB Replica SQL: x3 (paged) [10:03:41] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:03:42] dpogorzelski@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [10:05:43] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [10:05:44] dpogorzelski@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [10:07:17] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11667122 (10MoritzMuehlenhoff) [10:07:21] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1176.eqiad.wmnet with OS trixie [10:07:21] PROBLEM - MariaDB Replica Lag: x3 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 623.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:21] PROBLEM - MariaDB Replica Lag: s5 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 625.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:21] PROBLEM - MariaDB Replica Lag: s8 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 627.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:21] fceratto@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [10:07:24] PROBLEM - MariaDB Replica Lag: s2 #page on db1182 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:24] PROBLEM - MariaDB Replica Lag: s8 #page on db1178 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 627.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:25] PROBLEM - MariaDB Replica Lag: s7 #page on db1158 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:26] PROBLEM - MariaDB Replica Lag: s5 #page on db1161 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:27] PROBLEM - MariaDB Replica Lag: s3 #page on db1175 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:27] PROBLEM - MariaDB Replica Lag: s2 #page on db1162 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 627.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:28] PROBLEM - MariaDB Replica Lag: s3 #page on db1157 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:29] PROBLEM - MariaDB Replica Lag: s3 #page on db1166 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:30] PROBLEM - MariaDB Replica Lag: s2 #page on db1156 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 627.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:30] PROBLEM - MariaDB Replica Lag: s6 #page on db1165 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 630.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:31] PROBLEM - MariaDB Replica Lag: s7 #page on db1174 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 630.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:32] PROBLEM - MariaDB Replica Lag: s8 #page on db1172 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 628.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:33] PROBLEM - MariaDB Replica Lag: s8 #page on db1177 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 628.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:33] !ack [10:07:33] PROBLEM - MariaDB Replica Lag: s7 #page on db1170 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 630.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:34] 7613 (ACKED) db1182 (paged)/MariaDB Replica Lag: s2 (paged) [10:07:34] 7614 (ACKED) db1178 (paged)/MariaDB Replica Lag: s8 (paged) [10:07:34] 7615 (ACKED) db1158 (paged)/MariaDB Replica Lag: s7 (paged) [10:07:34] PROBLEM - MariaDB Replica Lag: s8 #page on db1193 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:34] 7616 (ACKED) db1161 (paged)/MariaDB Replica Lag: s5 (paged) [10:07:35] 7617 (ACKED) db1162 (paged)/MariaDB Replica Lag: s2 (paged) [10:07:35] 7618 (ACKED) db1175 (paged)/MariaDB Replica Lag: s3 (paged) [10:07:35] 7619 (ACKED) db1166 (paged)/MariaDB Replica Lag: s3 (paged) [10:07:35] PROBLEM - MariaDB Replica Lag: x1 #page on db1179 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:36] 7620 (ACKED) db1157 (paged)/MariaDB Replica Lag: s3 (paged) [10:07:36] 7621 (ACKED) db1165 (paged)/MariaDB Replica Lag: s6 (paged) [10:07:36] PROBLEM - MariaDB Replica Lag: s5 #page on db1159 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 627.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:37] PROBLEM - MariaDB Replica Lag: s8 #page on db1167 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:37] 7622 (ACKED) db1156 (paged)/MariaDB Replica Lag: s2 (paged) [10:07:37] PROBLEM - MariaDB Replica Lag: s5 #page on db1185 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 627.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:38] PROBLEM - MariaDB Replica Lag: s2 #page on db1197 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 628.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:39] PROBLEM - MariaDB Replica Lag: s2 #page on db1188 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 628.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:40] PROBLEM - MariaDB Replica Lag: s6 #page on db1168 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 631.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:40] PROBLEM - MariaDB Replica Lag: s8 #page on db1192 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:43] PROBLEM - MariaDB Replica Lag: s5 #page on db1200 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 645.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:43] PROBLEM - MariaDB Replica Lag: s8 #page on db1203 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 647.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:43] <_joe_> marostegui: don't ack alerts, we can ack them [10:07:44] PROBLEM - MariaDB Replica Lag: s5 #page on db1207 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 645.69 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:45] PROBLEM - MariaDB Replica Lag: s8 #page on db1209 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 631.70 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:46] PROBLEM - MariaDB Replica Lag: x3 #page on db1211 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.72 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:46] PROBLEM - MariaDB Replica Lag: s5 #page on db1210 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 645.73 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:46] <_joe_> !ack [10:07:47] PROBLEM - MariaDB Replica Lag: x1 #page on db1224 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 644.73 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:47] 7623 (ACKED) db1172 (paged)/MariaDB Replica Lag: s8 (paged) [10:07:48] 7624 (ACKED) db1174 (paged)/MariaDB Replica Lag: s7 (paged) [10:07:48] 7625 (ACKED) db1177 (paged)/MariaDB Replica Lag: s8 (paged) [10:07:48] PROBLEM - MariaDB Replica Lag: x1 #page on db1220 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 644.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:48] 7626 (ACKED) db1193 (paged)/MariaDB Replica Lag: s8 (paged) [10:07:48] 7627 (ACKED) db1170 (paged)/MariaDB Replica Lag: s7 (paged) [10:07:48] 7628 (ACKED) db1159 (paged)/MariaDB Replica Lag: s5 (paged) [10:07:49] 7629 (ACKED) db1179 (paged)/MariaDB Replica Lag: x1 (paged) [10:07:49] 7630 (ACKED) db1167 (paged)/MariaDB Replica Lag: s8 (paged) [10:07:49] 7631 (ACKED) db1188 (paged)/MariaDB Replica Lag: s2 (paged) [10:07:50] 7632 (ACKED) db1197 (paged)/MariaDB Replica Lag: s2 (paged) [10:07:50] 7633 (ACKED) db1185 (paged)/MariaDB Replica Lag: s5 (paged) [10:07:50] PROBLEM - MariaDB Replica Lag: s8 on dbstore1009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:51] 7634 (ACKED) db1168 (paged)/MariaDB Replica Lag: s6 (paged) [10:07:51] 7635 (ACKED) db1192 (paged)/MariaDB Replica Lag: s8 (paged) [10:07:51] !ack [10:07:52] 7636 (ACKED) db1200 (paged)/MariaDB Replica Lag: s5 (paged) [10:07:52] 7637 (ACKED) db1207 (paged)/MariaDB Replica Lag: s5 (paged) [10:07:53] 7638 (ACKED) db1209 (paged)/MariaDB Replica Lag: s8 (paged) [10:07:53] 7639 (ACKED) db1203 (paged)/MariaDB Replica Lag: s8 (paged) [10:07:53] PROBLEM - MariaDB Replica Lag: s8 #page on db2152 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:54] 7640 (ACKED) db1211 (paged)/MariaDB Replica Lag: x3 (paged) [10:07:54] 7641 (ACKED) db1224 (paged)/MariaDB Replica Lag: x1 (paged) [10:07:54] PROBLEM - MariaDB Replica Lag: s8 #page on db2164 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.61 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:55] 7642 (ACKED) db1210 (paged)/MariaDB Replica Lag: s5 (paged) [10:07:55] PROBLEM - MariaDB Replica Lag: s8 #page on db2154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:56] PROBLEM - MariaDB Replica Lag: s8 #page on db2163 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:57] PROBLEM - MariaDB Replica Lag: s8 #page on db2161 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:01] PROBLEM - MariaDB Replica Lag: x1 #page on db2186 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 637.71 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:10] PROBLEM - MariaDB Replica Lag: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 673.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:10] PROBLEM - MariaDB Replica Lag: s1 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 635.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:14] PROBLEM - MariaDB Replica Lag: x3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 675.54 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:14] PROBLEM - MariaDB Replica Lag: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.54 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:14] PROBLEM - MariaDB Replica Lag: s8 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 679.55 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:14] PROBLEM - MariaDB Replica Lag: s7 on dbstore1008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 637.56 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:14] PROBLEM - MariaDB Replica Lag: x3 on clouddb1023 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 675.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:15] PROBLEM - MariaDB Replica Lag: x3 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 675.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:15] PROBLEM - MariaDB Replica Lag: s3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.60 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:16] PROBLEM - MariaDB Replica Lag: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 678.61 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:16] PROBLEM - MariaDB Replica Lag: s7 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 681.61 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:17] PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 677.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:17] PROBLEM - MariaDB Replica Lag: x3 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 675.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:18] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 640.68 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:18] PROBLEM - MariaDB Replica Lag: s7 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 681.68 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:19] PROBLEM - MariaDB Replica Lag: s3 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.69 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:19] PROBLEM - MariaDB Replica Lag: s8 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 679.72 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:20] PROBLEM - MariaDB Replica Lag: s5 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 677.74 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:20] PROBLEM - MariaDB Replica Lag: s3 on clouddb1023 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:21] PROBLEM - MariaDB Replica Lag: s2 #page on db2175 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 646.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:21] PROBLEM - MariaDB Replica Lag: s1 #page on db2173 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:22] PROBLEM - MariaDB Replica Lag: s7 #page on db2182 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 640.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:22] PROBLEM - MariaDB Replica Lag: s5 #page on db2178 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:23] PROBLEM - MariaDB Replica Lag: s5 #page on db2192 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.56 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:23] PROBLEM - MariaDB Replica Lag: s3 #page on db2194 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 645.60 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:25] PROBLEM - MariaDB Replica Lag: s3 #page on db2205 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 652.77 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:25] PROBLEM - MariaDB Replica Lag: s1 #page on db2203 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 650.88 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:26] PROBLEM - MariaDB Replica Lag: s5 #page on db2211 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 650.90 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:27] !ack [10:08:27] PROBLEM - MariaDB Replica Lag: s7 #page on db2208 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 647.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:28] 7643 (ACKED) db1220 (paged)/MariaDB Replica Lag: x1 (paged) [10:08:28] 7644 (ACKED) db2152 (paged)/MariaDB Replica Lag: s8 (paged) [10:08:28] 7645 (ACKED) db2161 (paged)/MariaDB Replica Lag: s8 (paged) [10:08:28] 7646 (ACKED) db2154 (paged)/MariaDB Replica Lag: s8 (paged) [10:08:28] 7647 (ACKED) db2164 (paged)/MariaDB Replica Lag: s8 (paged) [10:08:29] 7648 (ACKED) db2163 (paged)/MariaDB Replica Lag: s8 (paged) [10:08:29] PROBLEM - MariaDB Replica Lag: s1 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 655.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:29] 7649 (ACKED) db2186 (paged)/MariaDB Replica Lag: x1 (paged) [10:08:29] PROBLEM - MariaDB Replica Lag: s3 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 657.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:29] 7650 (ACKED) db2175 (paged)/MariaDB Replica Lag: s2 (paged) [10:08:30] PROBLEM - MariaDB Replica Lag: s4 #page on db1160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 672.69 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:30] 7651 (ACKED) db2173 (paged)/MariaDB Replica Lag: s1 (paged) [10:08:30] 7652 (ACKED) db2182 (paged)/MariaDB Replica Lag: s7 (paged) [10:08:30] PROBLEM - MariaDB Replica Lag: s3 #page on db1189 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 658.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:31] 7653 (ACKED) db2178 (paged)/MariaDB Replica Lag: s5 (paged) [10:08:31] 7654 (ACKED) db2192 (paged)/MariaDB Replica Lag: s5 (paged) [10:08:31] PROBLEM - MariaDB Replica Lag: s7 #page on db1181 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 653.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:32] 7655 (ACKED) db2194 (paged)/MariaDB Replica Lag: s3 (paged) [10:08:41] PROBLEM - MariaDB Replica Lag: s2 on dbstore1007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 670.72 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:41] PROBLEM - MariaDB Replica Lag: s3 on dbstore1007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 669.81 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:43] PROBLEM - MariaDB Replica Lag: s4 on dbstore1007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 685.88 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:50] PROBLEM - MariaDB Replica Lag: s5 #page on db1230 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 676.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:50] PROBLEM - MariaDB Replica Lag: s2 #page on db1222 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 679.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:51] PROBLEM - MariaDB Replica Lag: x1 #page on db1237 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 688.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:51] PROBLEM - MariaDB Replica Lag: s1 on dbstore1008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 676.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:51] PROBLEM - MariaDB Replica Lag: s5 on dbstore1008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 676.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:51] PROBLEM - MariaDB Replica Lag: x1 on dbstore1009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 688.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:56] PROBLEM - MariaDB Replica Lag: s1 #page on db2146 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 681.79 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:57] PROBLEM - MariaDB Replica Lag: s7 #page on db2150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 678.87 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:57] PROBLEM - MariaDB Replica Lag: s3 #page on db2149 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 683.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:58] PROBLEM - MariaDB Replica Lag: s2 #page on db2148 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 684.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:59] PROBLEM - MariaDB Replica Lag: s1 #page on db2145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 682.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:59] PROBLEM - MariaDB Replica Lag: s1 #page on db2153 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 682.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:00] PROBLEM - MariaDB Replica Lag: s5 #page on db2157 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 682.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:01] PROBLEM - MariaDB Replica Lag: s7 #page on db2159 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 679.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:02] PROBLEM - MariaDB Replica Lag: s1 #page on db2176 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 682.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:03] PROBLEM - MariaDB Replica Lag: s5 #page on db2171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 682.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:03] PROBLEM - MariaDB Replica Lag: s1 #page on db2174 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 682.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:04] PROBLEM - MariaDB Replica Lag: s7 #page on db2168 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 679.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:05] PROBLEM - MariaDB Replica Lag: s1 #page on db2170 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 682.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:06] PROBLEM - MariaDB Replica Lag: s3 #page on db2177 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 684.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:06] PROBLEM - MariaDB Replica Lag: s3 #page on db2156 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 684.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:11] PROBLEM - MariaDB Replica Lag: s1 #page on db2188 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 695.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:12] PROBLEM - MariaDB Replica Lag: s2 #page on db2189 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 698.69 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:13] PROBLEM - MariaDB Replica Lag: s3 #page on db2190 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 697.70 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:20] PROBLEM - MariaDB Replica Lag: s7 #page on db2218 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 701.73 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:31] PROBLEM - MariaDB Replica Lag: s6 #page on db2180 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 623.68 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:32] PROBLEM - MariaDB Replica Lag: s6 #page on db2193 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 624.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:35] PROBLEM - MariaDB Replica Lag: s6 #page on db2217 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 628.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:37] PROBLEM - MariaDB Replica Lag: s6 #page on db1173 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:49] PROBLEM - MariaDB Replica Lag: s6 on dbstore1009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:53] PROBLEM - MariaDB Replica Lag: x3 #page on db1255 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 640.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:54] PROBLEM - MariaDB Replica Lag: es7 #page on es1035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 633.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:55] PROBLEM - MariaDB Replica Lag: es6 #page on es1038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 634.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:58] PROBLEM - MariaDB Replica Lag: s6 #page on db2151 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 650.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:10:00] PROBLEM - MariaDB Replica Lag: s6 #page on db2169 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 651.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:10:00] PROBLEM - MariaDB Replica Lag: x3 #page on db2162 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 645.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:10:01] PROBLEM - MariaDB Replica Lag: s6 #page on db2158 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 651.69 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:10:05] !ack [10:10:06] 7694 (ACKED) db2151 (paged)/MariaDB Replica Lag: s6 (paged) [10:10:08] PROBLEM - MariaDB Replica Lag: s6 #page on db2224 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 660.86 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:10:18] PROBLEM - MariaDB Replica Lag: s6 #page on db2214 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 671.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:10:28] PROBLEM - MariaDB Replica Lag: es6 #page on es1037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 670.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:10:28] PROBLEM - MariaDB Replica Lag: es6 #page on es1036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 671.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:10:33] PROBLEM - MariaDB Replica Lag: x3 #page on db2187 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 680.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:10:40] RECOVERY - MariaDB Replica SQL: s4 #page on db1160 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:07] !ack [10:11:08] no value provided for parameter incident and no default available [10:11:08] All incidents are already acked. [10:11:10] 06SRE, 07sre-alert-triage, 07Wikimedia-Incident: Edits aren't saving correctly - https://phabricator.wikimedia.org/T418839#11667148 (10Samwalton9-WMF) [10:11:12] RECOVERY - MariaDB Replica Lag: x1 #page on db2186 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:13] <_joe_> godog: I'm doing that [10:11:17] <_joe_> now it's recoveries [10:11:18] _joe_: thank you [10:11:31] RECOVERY - MariaDB Replica SQL: s1 on db2141 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:32] RECOVERY - MariaDB Replica SQL: s1 #page on db2176 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:32] RECOVERY - MariaDB Replica SQL: s1 #page on db2174 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:33] RECOVERY - MariaDB Replica SQL: x1 #page on db2186 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:35] RECOVERY - MariaDB Replica SQL: s1 #page on db2188 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:40] RECOVERY - MariaDB Replica SQL: s1 #page on db2203 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:42] RECOVERY - MariaDB Replica Lag: s4 #page on db1160 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:43] RECOVERY - MariaDB Replica SQL: s2 #page on db1188 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:44] RECOVERY - MariaDB Replica SQL: s2 #page on db1197 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:45] RECOVERY - MariaDB Replica Lag: x1 #page on db1179 is OK: OK slave_sql_lag Replication lag: 42.59 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:46] RECOVERY - MariaDB Replica Lag: s2 #page on db1188 is OK: OK slave_sql_lag Replication lag: 0.56 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:46] RECOVERY - MariaDB Replica Lag: s2 #page on db1197 is OK: OK slave_sql_lag Replication lag: 0.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:46] RECOVERY - MariaDB Replica Lag: s4 on dbstore1007 is OK: OK slave_sql_lag Replication lag: 0.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:48] RECOVERY - MariaDB Replica Lag: x1 #page on db1224 is OK: OK slave_sql_lag Replication lag: 48.54 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:49] RECOVERY - MariaDB Replica Lag: x1 #page on db1220 is OK: OK slave_sql_lag Replication lag: 48.73 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:50] RECOVERY - MariaDB Replica SQL: s2 #page on db1222 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:53] RECOVERY - MariaDB Replica SQL: x1 #page on db1237 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:54] RECOVERY - MariaDB Replica Lag: x1 #page on db1237 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:54] RECOVERY - MariaDB Replica Lag: x1 on dbstore1009 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:59] RECOVERY - MariaDB Replica Lag: s1 #page on db2146 is OK: OK slave_sql_lag Replication lag: 0.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:11:59] RECOVERY - MariaDB Replica Lag: s2 #page on db2148 is OK: OK slave_sql_lag Replication lag: 0.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:00] RECOVERY - MariaDB Replica SQL: s2 #page on db2148 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:01] RECOVERY - MariaDB Replica SQL: s1 #page on db2145 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:02] RECOVERY - MariaDB Replica Lag: s1 #page on db2145 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:03] RECOVERY - MariaDB Replica SQL: s1 #page on db2170 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:03] RECOVERY - MariaDB Replica Lag: s1 #page on db2153 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:04] RECOVERY - MariaDB Replica SQL: s1 #page on db2146 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:05] RECOVERY - MariaDB Replica SQL: s3 #page on db2149 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:05] RECOVERY - MariaDB Replica SQL: s1 #page on db2153 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:06] RECOVERY - MariaDB Replica SQL: s3 #page on db2156 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:07] RECOVERY - MariaDB Replica Lag: s1 #page on db2176 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:08] RECOVERY - MariaDB Replica Lag: s1 #page on db2174 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:08] RECOVERY - MariaDB Replica Lag: s1 #page on db2170 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:09] RECOVERY - MariaDB Replica SQL: s1 #page on db2173 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:11] RECOVERY - MariaDB Replica SQL: s3 #page on db2205 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:12] RECOVERY - MariaDB Replica SQL: s2 #page on db2189 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:13] RECOVERY - MariaDB Replica Lag: s1 #page on db2188 is OK: OK slave_sql_lag Replication lag: 0.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:14] RECOVERY - MariaDB Replica SQL: s2 on db2197 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:14] RECOVERY - MariaDB Replica Lag: s2 #page on db2189 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:15] RECOVERY - MariaDB Replica Lag: s3 #page on db2190 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:22] RECOVERY - MariaDB Replica SQL: s6 #page on db2214 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:25] RECOVERY - MariaDB Replica Lag: s1 on clouddb1013 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:31] RECOVERY - MariaDB Replica Lag: s3 on clouddb1022 is OK: OK slave_sql_lag Replication lag: 0.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:31] RECOVERY - MariaDB Replica Lag: s3 on clouddb1013 is OK: OK slave_sql_lag Replication lag: 0.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:31] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:31] RECOVERY - MariaDB Replica Lag: s3 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:32] RECOVERY - MariaDB Replica SQL: s7 #page on db2208 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:32] RECOVERY - MariaDB Replica Lag: s3 on clouddb1023 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:33] RECOVERY - MariaDB Replica SQL: s7 on db2200 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:33] RECOVERY - MariaDB Replica SQL: s5 on db2201 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:36] RECOVERY - MariaDB Replica Lag: s2 #page on db2175 is OK: OK slave_sql_lag Replication lag: 0.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:36] RECOVERY - MariaDB Replica SQL: s5 #page on db2178 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:37] RECOVERY - MariaDB Replica SQL: s7 #page on db2182 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:38] RECOVERY - MariaDB Replica SQL: s3 #page on db2177 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:38] RECOVERY - MariaDB Replica SQL: s5 #page on db2211 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:39] RECOVERY - MariaDB Replica Lag: s6 #page on db2180 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:40] RECOVERY - MariaDB Replica Lag: s5 #page on db2178 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:40] RECOVERY - MariaDB Replica SQL: s2 #page on db2175 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:41] RECOVERY - MariaDB Replica Lag: s1 #page on db2173 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:42] RECOVERY - MariaDB Replica SQL: s6 #page on db2217 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:43] RECOVERY - MariaDB Replica SQL: s6 #page on db2180 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:43] RECOVERY - MariaDB Replica SQL: s6 #page on db2193 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:44] RECOVERY - MariaDB Replica SQL: s3 #page on db2190 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:45] RECOVERY - MariaDB Replica Lag: s5 #page on db2192 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:46] RECOVERY - MariaDB Replica SQL: s5 #page on db2192 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:46] RECOVERY - MariaDB Replica Lag: s3 #page on db2194 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:47] RECOVERY - MariaDB Replica SQL: s3 #page on db2194 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:48] RECOVERY - MariaDB Replica Lag: s6 #page on db2193 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:50] RECOVERY - MariaDB Replica SQL: s7 on db2198 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:50] RECOVERY - MariaDB Replica SQL: s6 on db2197 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:53] RECOVERY - MariaDB Replica Lag: s3 #page on db2205 is OK: OK slave_sql_lag Replication lag: 0.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:53] RECOVERY - MariaDB Replica Lag: s1 #page on db2203 is OK: OK slave_sql_lag Replication lag: 0.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:54] RECOVERY - MariaDB Replica Lag: s5 #page on db2211 is OK: OK slave_sql_lag Replication lag: 0.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:55] RECOVERY - MariaDB Replica Lag: s6 #page on db2217 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:56] RECOVERY - MariaDB Replica Lag: s7 #page on db2208 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:56] RECOVERY - MariaDB Replica Lag: s1 on db1154 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:56] RECOVERY - MariaDB Replica Lag: s3 on db1154 is OK: OK slave_sql_lag Replication lag: 0.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:57] RECOVERY - MariaDB Replica SQL: s7 #page on db1181 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:57] RECOVERY - MariaDB Replica SQL: s6 #page on db1173 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:58] RECOVERY - MariaDB Replica Lag: s6 #page on db1173 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:12:59] RECOVERY - MariaDB Replica Lag: s3 #page on db1175 is OK: OK slave_sql_lag Replication lag: 59.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:00] RECOVERY - MariaDB Replica Lag: s7 #page on db1158 is OK: OK slave_sql_lag Replication lag: 20.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:00] RECOVERY - MariaDB Replica Lag: s8 #page on db1172 is OK: OK slave_sql_lag Replication lag: 3.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:01] RECOVERY - MariaDB Replica SQL: s3 #page on db1189 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:02] RECOVERY - MariaDB Replica Lag: s6 #page on db1165 is OK: OK slave_sql_lag Replication lag: 31.56 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:03] RECOVERY - MariaDB Replica Lag: s3 #page on db1189 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:03] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [10:13:04] RECOVERY - MariaDB Replica Lag: s5 #page on db1161 is OK: OK slave_sql_lag Replication lag: 56.70 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:05] RECOVERY - MariaDB Replica Lag: s7 #page on db1181 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:06] RECOVERY - MariaDB Replica Lag: s7 #page on db1170 is OK: OK slave_sql_lag Replication lag: 31.74 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:07] RECOVERY - MariaDB Replica Lag: s8 #page on db1177 is OK: OK slave_sql_lag Replication lag: 14.78 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:07] RECOVERY - MariaDB Replica Lag: s7 #page on db1174 is OK: OK slave_sql_lag Replication lag: 31.70 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:08] RECOVERY - MariaDB Replica Lag: s8 #page on db1193 is OK: OK slave_sql_lag Replication lag: 14.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:09] RECOVERY - MariaDB Replica Lag: s5 #page on db1159 is OK: OK slave_sql_lag Replication lag: 57.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:09] RECOVERY - MariaDB Replica Lag: s5 #page on db1185 is OK: OK slave_sql_lag Replication lag: 57.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:10] RECOVERY - MariaDB Replica Lag: s8 #page on db1167 is OK: OK slave_sql_lag Replication lag: 15.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:11] RECOVERY - MariaDB Replica Lag: s6 #page on db1168 is OK: OK slave_sql_lag Replication lag: 43.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:12] RECOVERY - MariaDB Replica Lag: s8 #page on db1192 is OK: OK slave_sql_lag Replication lag: 15.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:12] RECOVERY - MariaDB Replica Lag: s2 on dbstore1007 is OK: OK slave_sql_lag Replication lag: 0.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:12] RECOVERY - MariaDB Replica Lag: s3 on dbstore1007 is OK: OK slave_sql_lag Replication lag: 0.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:13] RECOVERY - MariaDB Replica Lag: s5 #page on db1200 is OK: OK slave_sql_lag Replication lag: 57.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:13] RECOVERY - MariaDB Replica Lag: s8 #page on db1203 is OK: OK slave_sql_lag Replication lag: 15.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:16] RECOVERY - MariaDB Replica Lag: s5 #page on db1207 is OK: OK slave_sql_lag Replication lag: 54.84 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:17] RECOVERY - MariaDB Replica SQL: s8 #page on db1209 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:18] RECOVERY - MariaDB Replica SQL: s5 #page on db1207 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:19] RECOVERY - MariaDB Replica Lag: s8 #page on db1209 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:19] RECOVERY - MariaDB Replica SQL: s5 #page on db1210 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:20] RECOVERY - MariaDB Replica Lag: s5 #page on db1210 is OK: OK slave_sql_lag Replication lag: 58.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:20] RECOVERY - MariaDB Replica SQL: s5 on db1216 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:21] RECOVERY - MariaDB Replica Lag: x3 #page on db1211 is OK: OK slave_sql_lag Replication lag: 15.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:22] RECOVERY - MariaDB Replica SQL: s5 #page on db1230 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:22] RECOVERY - MariaDB Replica Lag: s6 on dbstore1009 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:22] RECOVERY - MariaDB Replica Lag: s5 #page on db1230 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:23] RECOVERY - MariaDB Replica Lag: s2 #page on db1222 is OK: OK slave_sql_lag Replication lag: 0.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:23] RECOVERY - MariaDB Replica Lag: s1 on dbstore1008 is OK: OK slave_sql_lag Replication lag: 0.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:23] RECOVERY - MariaDB Replica Lag: s5 on dbstore1008 is OK: OK slave_sql_lag Replication lag: 0.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:23] RECOVERY - MariaDB Replica Lag: s8 on dbstore1009 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:24] RECOVERY - MariaDB Replica SQL: x3 #page on db1255 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:25] RECOVERY - MariaDB Replica Lag: x3 #page on db1255 is OK: OK slave_sql_lag Replication lag: 0.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:29] RECOVERY - MariaDB Replica Lag: s8 #page on db2152 is OK: OK slave_sql_lag Replication lag: 0.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:29] RECOVERY - MariaDB Replica Lag: s7 #page on db2150 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:30] RECOVERY - MariaDB Replica Lag: s6 #page on db2151 is OK: OK slave_sql_lag Replication lag: 0.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:31] RECOVERY - MariaDB Replica SQL: s7 #page on db2150 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:32] RECOVERY - MariaDB Replica Lag: s3 #page on db2149 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:32] RECOVERY - MariaDB Replica SQL: s7 #page on db2159 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:33] RECOVERY - MariaDB Replica SQL: s5 #page on db2157 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:34] RECOVERY - MariaDB Replica SQL: s8 #page on db2154 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:35] RECOVERY - MariaDB Replica SQL: s6 #page on db2158 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:35] 06SRE, 07Wikimedia-Incident: Edits aren't saving correctly - https://phabricator.wikimedia.org/T418839#11667150 (10Peachey88) [10:13:35] RECOVERY - MariaDB Replica SQL: s8 #page on db2152 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:36] RECOVERY - MariaDB Replica Lag: s5 #page on db2157 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:37] RECOVERY - MariaDB Replica SQL: s6 #page on db2151 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:38] RECOVERY - MariaDB Replica SQL: x3 #page on db2162 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:39] RECOVERY - MariaDB Replica SQL: s5 #page on db2171 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:39] RECOVERY - MariaDB Replica Lag: s8 #page on db2154 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:40] RECOVERY - MariaDB Replica Lag: s7 #page on db2159 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:41] RECOVERY - MariaDB Replica Lag: s8 #page on db2164 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:41] RECOVERY - MariaDB Replica Lag: s8 #page on db2163 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:42] RECOVERY - MariaDB Replica SQL: s8 #page on db2161 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:43] RECOVERY - MariaDB Replica Lag: s6 #page on db2169 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:44] RECOVERY - MariaDB Replica SQL: s6 #page on db2169 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:44] RECOVERY - MariaDB Replica Lag: s3 #page on db2156 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:45] RECOVERY - MariaDB Replica Lag: s3 #page on db2177 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:46] RECOVERY - MariaDB Replica Lag: s8 #page on db2161 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:46] RECOVERY - MariaDB Replica SQL: s8 #page on db2164 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:47] RECOVERY - MariaDB Replica SQL: s8 #page on db2163 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:48] RECOVERY - MariaDB Replica SQL: s7 #page on db2168 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:49] RECOVERY - MariaDB Replica Lag: s7 #page on db2168 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:49] !ack [10:13:49] RECOVERY - MariaDB Replica Lag: s5 #page on db2171 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:50] no value provided for parameter incident and no default available [10:13:50] All incidents are already acked. [10:13:50] RECOVERY - MariaDB Replica Lag: x3 #page on db2162 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:51] RECOVERY - MariaDB Replica Lag: s6 #page on db2158 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:53] RECOVERY - MariaDB Replica Lag: s6 #page on db2224 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:54] RECOVERY - MariaDB Replica SQL: s6 #page on db2224 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:02] RECOVERY - MariaDB Replica Lag: s7 #page on db2218 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:02] RECOVERY - MariaDB Replica Lag: s6 #page on db2214 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:03] RECOVERY - MariaDB Replica SQL: s7 #page on db2218 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:11] RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:11] RECOVERY - MariaDB Replica Lag: s7 on dbstore1008 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:11] RECOVERY - MariaDB Replica Lag: s5 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 0.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:15] RECOVERY - MariaDB Replica Lag: s7 #page on db2182 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:19] RECOVERY - MariaDB Replica SQL: x3 on db2200 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:19] RECOVERY - MariaDB Replica Lag: s5 on db1154 is OK: OK slave_sql_lag Replication lag: 0.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:20] RECOVERY - MariaDB Replica SQL: s3 #page on db1175 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:20] RECOVERY - MariaDB Replica SQL: s5 #page on db1161 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:21] RECOVERY - MariaDB Replica SQL: s3 #page on db1157 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:22] RECOVERY - MariaDB Replica Lag: s3 #page on db1166 is OK: OK slave_sql_lag Replication lag: 0.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:22] RECOVERY - MariaDB Replica SQL: s6 #page on db1168 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:23] RECOVERY - MariaDB Replica SQL: s5 #page on db1185 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:24] RECOVERY - MariaDB Replica Lag: s3 #page on db1157 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:26] RECOVERY - MariaDB Replica SQL: s6 #page on db1165 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:27] RECOVERY - MariaDB Replica SQL: s5 #page on db1159 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:28] RECOVERY - MariaDB Replica SQL: s3 #page on db1166 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:29] RECOVERY - MariaDB Replica SQL: s5 #page on db1200 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:35] 06SRE, 07Wikimedia-Incident: Edits aren't saving correctly - https://phabricator.wikimedia.org/T418839#11667158 (10fgiunchedi) Related to ongoing incident https://www.wikimediastatus.net/incidents/ncw3k9b4ynz6 [10:15:03] 06SRE, 07Wikimedia-Incident: Edits aren't saving correctly - https://phabricator.wikimedia.org/T418839#11667165 (10mszwarc) I can see recent changes from the last 10-20 minutes on plwiki: https://pl.wikipedia.org/w/index.php?hidebots=1&hidecategorization=1&hideWikibase=1&limit=500&days=30&enhanced=1&title=Spec... [10:15:12] RECOVERY - MariaDB Replica SQL: s8 #page on db1178 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:13] RECOVERY - MariaDB Replica SQL: s8 #page on db1177 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:21] RECOVERY - MariaDB Replica Lag: s8 on db1154 is OK: OK slave_sql_lag Replication lag: 0.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:21] RECOVERY - MariaDB Replica SQL: s8 on db1171 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:22] RECOVERY - MariaDB Replica Lag: s8 #page on db1178 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:22] RECOVERY - MariaDB Replica SQL: s8 #page on db1172 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:25] PROBLEM - MariaDB Replica SQL: s2 #page on db1188 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:26] RECOVERY - MariaDB Replica SQL: s8 #page on db1192 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:27] RECOVERY - MariaDB Replica SQL: s8 #page on db1193 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:28] PROBLEM - MariaDB Replica SQL: s2 #page on db1197 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1396, Errmsg: Error Operation DROP USER failed for orchestrator@208.80.155.103 on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:28] RECOVERY - MariaDB Replica SQL: s8 #page on db1167 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:29] RECOVERY - MariaDB Replica SQL: x1 #page on db1179 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:30] RECOVERY - MariaDB Replica SQL: s8 #page on db1203 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:33] RECOVERY - MariaDB Replica SQL: x1 #page on db1224 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:33] RECOVERY - MariaDB Replica SQL: x1 on db1216 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:33] RECOVERY - MariaDB Replica SQL: x1 on db1225 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:33] RECOVERY - MariaDB Replica SQL: x1 #page on db1220 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:38] !ack [10:15:39] 7701 (ACKED) db1188 (paged)/MariaDB Replica SQL: s2 (paged) [10:15:39] 7702 (ACKED) db1197 (paged)/MariaDB Replica SQL: s2 (paged) [10:15:50] 06SRE, 07Wikimedia-Incident: Edits aren't saving correctly - https://phabricator.wikimedia.org/T418839#11667171 (10Samwalton9-WMF) p:05Unbreak!→03High I'm seeing edits again, the one that I made that disappeared has re-appeared and RecentChanges is moving again. [10:15:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:16:12] RECOVERY - MariaDB Replica Lag: s8 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:16:12] RECOVERY - MariaDB Replica Lag: s8 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:16:37] RECOVERY - MariaDB Replica Lag: es7 #page on es1035 is OK: OK slave_sql_lag Replication lag: 0.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:16:37] RECOVERY - MariaDB Replica SQL: es7 #page on es1035 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:16:38] RECOVERY - MariaDB Replica Lag: es6 #page on es1038 is OK: OK slave_sql_lag Replication lag: 0.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:17:01] RECOVERY - MariaDB Replica SQL: es6 #page on es1038 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:17:13] RECOVERY - MariaDB Replica Lag: es6 #page on es1037 is OK: OK slave_sql_lag Replication lag: 53.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:17:13] RECOVERY - MariaDB Replica Lag: es6 #page on es1036 is OK: OK slave_sql_lag Replication lag: 53.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:17:23] PROBLEM - MariaDB Replica Lag: s2 #page on db1162 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 339.63 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:17:23] PROBLEM - MariaDB Replica Lag: s2 #page on db1156 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 339.63 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:17:24] PROBLEM - MariaDB Replica Lag: s2 #page on db1182 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 339.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:17:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T418465)', diff saved to https://phabricator.wikimedia.org/P89672 and previous config saved to /var/cache/conftool/dbconfig/20260303-101747-marostegui.json [10:17:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1238.eqiad.wmnet with reason: Maintenance [10:17:55] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [10:18:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1238 (T418465)', diff saved to https://phabricator.wikimedia.org/P89673 and previous config saved to /var/cache/conftool/dbconfig/20260303-101800-marostegui.json [10:18:03] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [10:18:12] RECOVERY - MariaDB Replica Lag: x3 on clouddb1022 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:18:12] RECOVERY - MariaDB Replica Lag: x3 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:18:12] RECOVERY - MariaDB Replica Lag: x3 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:18:12] RECOVERY - MariaDB Replica Lag: x3 on clouddb1023 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:18:17] RECOVERY - MariaDB Replica SQL: x3 #page on db2187 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:18:17] RECOVERY - MariaDB Replica Lag: x3 #page on db2187 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:18:22] RECOVERY - MariaDB Replica Lag: x3 on db1154 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:18:23] RECOVERY - MariaDB Replica SQL: s7 #page on db1158 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:18:32] RECOVERY - MariaDB Replica SQL: x3 on db1216 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:18:33] RECOVERY - MariaDB Replica SQL: x3 #page on db1211 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:18:45] (03PS1) 10Santiago Faci: TestKitchen renaming (MetricsPlatform => TestKitchen) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247547 (https://phabricator.wikimedia.org/T416865) [10:18:45] FIRING: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:18:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [10:19:12] RECOVERY - MariaDB Replica Lag: s7 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:19:12] RECOVERY - MariaDB Replica Lag: s7 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:19:22] RECOVERY - MariaDB Replica SQL: s7 on db1171 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:19:23] RECOVERY - MariaDB Replica SQL: s7 #page on db1170 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:19:24] RECOVERY - MariaDB Replica SQL: s7 #page on db1174 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:19:29] FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:20:01] RECOVERY - MariaDB Replica SQL: es6 #page on es1036 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:20:01] RECOVERY - MariaDB Replica SQL: es6 #page on es1037 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:20:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T418465)', diff saved to https://phabricator.wikimedia.org/P89674 and previous config saved to /var/cache/conftool/dbconfig/20260303-102004-marostegui.json [10:20:18] (03PS1) 10Brouberol: dse-k8s: increase mem/cpu quotas for the opensearch-semantic-search ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247548 (https://phabricator.wikimedia.org/T414703) [10:20:32] RECOVERY - MariaDB Replica SQL: m3 on db1217 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:20:32] PROBLEM - MariaDB Replica Lag: m3 on db1217 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 589.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:20:34] RECOVERY - MariaDB Replica SQL: m2 on db1217 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:20:46] (03PS1) 10Btullis: Add x1 section to an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1247549 (https://phabricator.wikimedia.org/T407485) [10:21:06] RECOVERY - MariaDB Replica Lag: s2 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:11] (03CR) 10Btullis: [C:03+1] "Nice. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247548 (https://phabricator.wikimedia.org/T414703) (owner: 10Brouberol) [10:21:12] RECOVERY - MariaDB Replica Lag: s2 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 0.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:13] RECOVERY - MariaDB Replica SQL: s2 #page on db1182 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:25] RECOVERY - MariaDB Replica SQL: s2 #page on db1162 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:25] RECOVERY - MariaDB Replica Lag: s2 #page on db1156 is OK: OK slave_sql_lag Replication lag: 0.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:26] RECOVERY - MariaDB Replica Lag: s2 #page on db1162 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:27] RECOVERY - MariaDB Replica Lag: s2 #page on db1182 is OK: OK slave_sql_lag Replication lag: 0.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:30] RECOVERY - MariaDB Replica SQL: s2 #page on db1188 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:30] RECOVERY - MariaDB Replica SQL: s2 #page on db1156 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:31] RECOVERY - MariaDB Replica SQL: s2 #page on db1197 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:33] RECOVERY - MariaDB Replica Lag: m3 on db1217 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:23:45] RESOLVED: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:23:46] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8181/co" [puppet] - 10https://gerrit.wikimedia.org/r/1247549 (https://phabricator.wikimedia.org/T407485) (owner: 10Btullis) [10:24:05] (03CR) 10Slyngshede: [C:03+2] Release version 0.1.16 [software/bitu] - 10https://gerrit.wikimedia.org/r/1247529 (owner: 10Slyngshede) [10:25:10] (03CR) 10Btullis: Add x1 section to an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1247549 (https://phabricator.wikimedia.org/T407485) (owner: 10Btullis) [10:25:12] (03CR) 10Jelto: [C:03+2] admin: add milimetric to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/1247537 (https://phabricator.wikimedia.org/T417906) (owner: 10Jelto) [10:25:45] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:26:38] (03Merged) 10jenkins-bot: Release version 0.1.16 [software/bitu] - 10https://gerrit.wikimedia.org/r/1247529 (owner: 10Slyngshede) [10:28:01] (03PS1) 10Santiago Faci: TestKitchen remaing (MetricsPlatform => TestKitchen) [puppet] - 10https://gerrit.wikimedia.org/r/1247551 (https://phabricator.wikimedia.org/T416865) [10:29:25] 06SRE, 07Wikimedia-Incident: Edits aren't saving correctly - https://phabricator.wikimedia.org/T418839#11667246 (10fgiunchedi) Indeed we should be back, incident has been resolved. Please confirm and resolve the task as needed [10:29:36] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11667247 (10Jelto) 05Open→03Resolved a:03Jelto The access should be available in roughly 30 minut... [10:29:46] (03CR) 10Brouberol: [C:03+1] Add x1 section to an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1247549 (https://phabricator.wikimedia.org/T407485) (owner: 10Btullis) [10:30:53] (03CR) 10Brouberol: [C:03+2] dse-k8s: increase mem/cpu quotas for the opensearch-semantic-search ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247548 (https://phabricator.wikimedia.org/T414703) (owner: 10Brouberol) [10:31:54] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:32:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:33:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:33:54] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [10:34:35] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [10:35:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P89675 and previous config saved to /var/cache/conftool/dbconfig/20260303-103512-marostegui.json [10:36:36] (03PS1) 10JMeybohm: BGPPeers: Add missing lsw1-f8-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247552 (https://phabricator.wikimedia.org/T418259) [10:37:16] 06SRE, 07Wikimedia-Incident: Edits aren't saving correctly - https://phabricator.wikimedia.org/T418839#11667280 (10Ladsgroup) >>! In T418839#11667171, @Samwalton9-WMF wrote: > I'm seeing edits again, the one that I made that disappeared has re-appeared and RecentChanges is moving again. these are usually beca... [10:38:01] 06SRE, 07Wikimedia-Incident: Edits aren't saving correctly - https://phabricator.wikimedia.org/T418839#11667283 (10jcrespo) For context, replication broke on databases, edits were not lost during the incident, but it took an abnormal number of minutes to appear as applied everywhere. Sorry for the disruption,... [10:38:54] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:38:58] 06SRE, 06DBA, 07Wikimedia-Incident: Edits aren't saving correctly - https://phabricator.wikimedia.org/T418839#11667285 (10Ladsgroup) 05Open→03Resolved a:03Marostegui [10:39:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T418465)', diff saved to https://phabricator.wikimedia.org/P89676 and previous config saved to /var/cache/conftool/dbconfig/20260303-103947-marostegui.json [10:39:51] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [10:41:20] !log installing Django security updates [10:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:22] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247515 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [10:42:46] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1247515 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [10:43:35] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:44:02] (03CR) 10Fabfur: [C:03+2] hiera: set haproxy version to 3.0 on eqsin cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1247515 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [10:45:30] !log start upgrading haproxy to 3.0 on A:cp-eqsin (T417253) [10:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:33] T417253: Upgrade to HAProxy 3.0 on cache (bullseye) hosts - https://phabricator.wikimedia.org/T417253 [10:46:02] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1170.eqiad.wmnet [10:46:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11667312 (10ops-monitoring-bot) Host an-worker1170.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:47:35] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:50:11] (03CR) 10Btullis: [C:03+2] Add x1 section to an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1247549 (https://phabricator.wikimedia.org/T407485) (owner: 10Btullis) [10:50:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P89677 and previous config saved to /var/cache/conftool/dbconfig/20260303-105020-marostegui.json [10:50:53] (03CR) 10Kevin Bazira: changeprop: Add revertrisk-multilingual model ti changeprop staging configuration. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247025 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [10:51:28] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp - 3.0 upgrade () [10:51:38] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp - 3.0 upgrade () [10:52:08] (03PS1) 10Slyngshede: IDM: Failover to updated host [dns] - 10https://gerrit.wikimedia.org/r/1247557 [10:52:59] (03CR) 10Hashar: [C:03+2] wm-checks-api: add tag for Selenium jobs [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1247540 (owner: 10Hashar) [10:53:53] (03Merged) 10jenkins-bot: wm-checks-api: add tag for Selenium jobs [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1247540 (owner: 10Hashar) [10:54:02] (03CR) 10JMeybohm: [C:03+2] BGPPeers: Add missing lsw1-f8-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247552 (https://phabricator.wikimedia.org/T418259) (owner: 10JMeybohm) [10:54:13] !log hashar@deploy2002 Started deploy [gerrit/gerrit@12177b1]: wm-checks-api: add tag for Selenium jobs [10:54:19] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1247557 (owner: 10Slyngshede) [10:54:26] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@12177b1]: wm-checks-api: add tag for Selenium jobs (duration: 00m 13s) [10:54:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P89678 and previous config saved to /var/cache/conftool/dbconfig/20260303-105455-marostegui.json [10:55:28] (03PS3) 10Gkyziridis: changeprop: Add revertrisk-multilingual model ti changeprop staging configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247025 (https://phabricator.wikimedia.org/T415892) [10:55:36] (03CR) 10Gkyziridis: changeprop: Add revertrisk-multilingual model ti changeprop staging configuration. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247025 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [10:55:42] (03CR) 10Slyngshede: [C:03+2] IDM: Failover to updated host [dns] - 10https://gerrit.wikimedia.org/r/1247557 (owner: 10Slyngshede) [10:55:59] !log slyngshede@dns1004 START - running authdns-update [10:56:53] (03CR) 10Hashar: [C:03+2] "Deployed. I have verified on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1224686 that `wikibase-selenium` job is tagg" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1247540 (owner: 10Hashar) [10:57:11] !log slyngshede@dns1004 END - running authdns-update [10:57:55] (03CR) 10Kevin Bazira: changeprop: Add revertrisk-multilingual model ti changeprop staging configuration. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247025 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [10:58:52] (03PS4) 10Gkyziridis: changeprop: Add revertrisk-multilingual model to changeprop staging configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247025 (https://phabricator.wikimedia.org/T415892) [10:59:08] (03CR) 10Gkyziridis: changeprop: Add revertrisk-multilingual model to changeprop staging configuration. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247025 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [10:59:38] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1170.eqiad.wmnet [10:59:41] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1171.eqiad.wmnet [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1100) [11:00:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11667363 (10ops-monitoring-bot) Host an-worker1171.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [11:02:13] (03Merged) 10jenkins-bot: BGPPeers: Add missing lsw1-f8-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247552 (https://phabricator.wikimedia.org/T418259) (owner: 10JMeybohm) [11:05:03] (03PS1) 10Ladsgroup: Enable thumb steps on private wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247559 (https://phabricator.wikimedia.org/T414805) [11:05:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T418465)', diff saved to https://phabricator.wikimedia.org/P89679 and previous config saved to /var/cache/conftool/dbconfig/20260303-110527-marostegui.json [11:05:31] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [11:05:36] (03PS1) 10JMeybohm: BGPPeers: Add comment for eqiad E4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247561 (https://phabricator.wikimedia.org/T418259) [11:05:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2210.codfw.wmnet with reason: Maintenance [11:05:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2210 (T418465)', diff saved to https://phabricator.wikimedia.org/P89680 and previous config saved to /var/cache/conftool/dbconfig/20260303-110551-marostegui.json [11:06:39] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:07:12] (03PS5) 10Gkyziridis: changeprop: Add revertrisk-multilingual model to changeprop staging configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247025 (https://phabricator.wikimedia.org/T415892) [11:07:22] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:07:34] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:08:13] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:08:41] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:09:21] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:10:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P89681 and previous config saved to /var/cache/conftool/dbconfig/20260303-111003-marostegui.json [11:10:46] (03CR) 10Jelto: [C:03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1247530 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [11:11:27] (03CR) 10Jelto: [C:03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1247531 (https://phabricator.wikimedia.org/T418361) (owner: 10Arnaudb) [11:11:55] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [11:12:45] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [11:13:18] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:13:23] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:13:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1171.eqiad.wmnet [11:13:44] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1172.eqiad.wmnet [11:13:48] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:13:52] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [11:14:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11667441 (10ops-monitoring-bot) Host an-worker1172.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [11:14:13] FIRING: [2x] SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:14:35] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [11:14:39] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [11:15:24] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [11:15:29] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:15:47] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:15:51] !log jayme@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [11:16:43] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1350-1351].eqiad.wmnet [11:16:45] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1350-1351].eqiad.wmnet [11:16:53] (03CR) 10JMeybohm: [C:03+2] BGPPeers: Add comment for eqiad E4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247561 (https://phabricator.wikimedia.org/T418259) (owner: 10JMeybohm) [11:17:09] !log jayme@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [11:17:12] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [11:18:31] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:18:34] !log jayme@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [11:21:28] !log jayme@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [11:21:29] (03CR) 10Hnowlan: [C:03+1] rotate large (>50G/day) logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1245514 (https://phabricator.wikimedia.org/T418612) (owner: 10Herron) [11:22:02] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:23:51] (03CR) 10Kevin Bazira: changeprop: Add revertrisk-multilingual model to changeprop staging configuration. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247025 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [11:23:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:24:47] (03Merged) 10jenkins-bot: BGPPeers: Add comment for eqiad E4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247561 (https://phabricator.wikimedia.org/T418259) (owner: 10JMeybohm) [11:25:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T418465)', diff saved to https://phabricator.wikimedia.org/P89682 and previous config saved to /var/cache/conftool/dbconfig/20260303-112511-marostegui.json [11:25:15] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [11:25:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1241.eqiad.wmnet with reason: Maintenance [11:25:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1241 (T418465)', diff saved to https://phabricator.wikimedia.org/P89683 and previous config saved to /var/cache/conftool/dbconfig/20260303-112535-marostegui.json [11:26:31] (03PS1) 10Michael Große: Enable new HTML confirmation emails for all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247566 (https://phabricator.wikimedia.org/T416748) [11:28:00] (03CR) 10Kevin Bazira: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247025 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [11:28:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T418465)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260303-112828-marostegui.json [11:30:14] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1352.eqiad.wmnet with OS trixie [11:30:50] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1353.eqiad.wmnet with OS trixie [11:30:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:31:34] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1354.eqiad.wmnet with OS trixie [11:31:43] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1355.eqiad.wmnet with OS trixie [11:32:02] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:34:04] PROBLEM - Host an-worker1172 is DOWN: PING CRITICAL - Packet loss = 100% [11:36:57] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp - 3.0 upgrade () [11:38:52] jouncebot: nowandnext [11:38:52] For the next 0 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1100) [11:38:52] In 1 hour(s) and 21 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1300) [11:39:56] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247516 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [11:40:10] (03PS1) 10Zabe: ImageListPager: Use correct name field for batch lookups [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247569 (https://phabricator.wikimedia.org/T418327) [11:40:41] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp - 3.0 upgrade () [11:42:21] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1352.eqiad.wmnet with reason: host reimage [11:43:00] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1353.eqiad.wmnet with reason: host reimage [11:43:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P89684 and previous config saved to /var/cache/conftool/dbconfig/20260303-114341-marostegui.json [11:43:42] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1354.eqiad.wmnet with reason: host reimage [11:44:00] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1355.eqiad.wmnet with reason: host reimage [11:45:06] (03PS1) 10Aqu: dse-k8s airflow-analytics-test: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247570 (https://phabricator.wikimedia.org/T415874) [11:48:18] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1352.eqiad.wmnet with reason: host reimage [11:50:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T418465)', diff saved to https://phabricator.wikimedia.org/P89685 and previous config saved to /var/cache/conftool/dbconfig/20260303-115057-marostegui.json [11:51:02] (03PS2) 10Aqu: dse-k8s airflow-analytics-test: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247570 (https://phabricator.wikimedia.org/T415874) [11:51:02] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [11:52:01] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1354.eqiad.wmnet with reason: host reimage [11:54:53] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745#11667612 (10Ladsgroup) >>! In T418745#11665357, @Tacsipacsi wrote: > TIFFs (e.g.... [11:58:18] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1355.eqiad.wmnet with reason: host reimage [11:58:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P89686 and previous config saved to /var/cache/conftool/dbconfig/20260303-115847-marostegui.json [11:58:54] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:59:58] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11667619 (10MoritzMuehlenhoff) [12:02:27] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1353.eqiad.wmnet with reason: host reimage [12:04:46] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1352.eqiad.wmnet with OS trixie [12:06:00] RECOVERY - Host an-worker1172 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [12:06:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P89687 and previous config saved to /var/cache/conftool/dbconfig/20260303-120604-marostegui.json [12:07:12] (03CR) 10Muehlenhoff: [C:03+2] Unconditionally assume Facter 4 [puppet] - 10https://gerrit.wikimedia.org/r/1244725 (owner: 10Muehlenhoff) [12:08:24] (03Abandoned) 10Phuedx: EventStreamConfig: Remove wikibase.client.interaction stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198029 (https://phabricator.wikimedia.org/T370045) (owner: 10Phuedx) [12:08:29] (03CR) 10Phuedx: "Of course!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198029 (https://phabricator.wikimedia.org/T370045) (owner: 10Phuedx) [12:08:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1172.eqiad.wmnet [12:08:56] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1173.eqiad.wmnet [12:09:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11667673 (10ops-monitoring-bot) Host an-worker1173.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:09:23] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1354.eqiad.wmnet with OS trixie [12:10:34] (03PS2) 10Muehlenhoff: profile::dns::recursor: Unconditionally enable the webserver [puppet] - 10https://gerrit.wikimedia.org/r/1243751 [12:12:08] (03CR) 10Muehlenhoff: [C:03+2] etcd::client::globalconfig: Remove inactive check [puppet] - 10https://gerrit.wikimedia.org/r/1243787 (owner: 10Muehlenhoff) [12:12:30] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: use rlc claim from cookie with bearer token [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [12:13:22] (03CR) 10Muehlenhoff: [C:03+2] ldap::client::sssd: Only support socket activation [puppet] - 10https://gerrit.wikimedia.org/r/1243795 (owner: 10Muehlenhoff) [12:13:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T418465)', diff saved to https://phabricator.wikimedia.org/P89688 and previous config saved to /var/cache/conftool/dbconfig/20260303-121355-marostegui.json [12:13:59] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [12:14:10] (03PS2) 10Muehlenhoff: openstack: Remove two buster checks [puppet] - 10https://gerrit.wikimedia.org/r/1243788 [12:14:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2219.codfw.wmnet with reason: Maintenance [12:14:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2219 (T418465)', diff saved to https://phabricator.wikimedia.org/P89689 and previous config saved to /var/cache/conftool/dbconfig/20260303-121420-marostegui.json [12:14:39] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1355.eqiad.wmnet with OS trixie [12:14:52] (03Merged) 10jenkins-bot: rest-gateway: use rlc claim from cookie with bearer token [deployment-charts] - 10https://gerrit.wikimedia.org/r/1241581 (https://phabricator.wikimedia.org/T418042) (owner: 10Daniel Kinzler) [12:15:06] (03CR) 10Muehlenhoff: [C:03+2] standard_packages: Remove more obsolete packages after bullseye->bookworm update [puppet] - 10https://gerrit.wikimedia.org/r/1247539 (owner: 10Muehlenhoff) [12:15:18] (03PS1) 10Jakob: Enable Wikibase GraphQL on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247576 (https://phabricator.wikimedia.org/T417619) [12:15:26] (03CR) 10Arnaudb: "my idea was to avoid service disruption for those who use the https interface to clone from the replica. OTOH since the config issue are a" [dns] - 10https://gerrit.wikimedia.org/r/1247530 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [12:15:28] (03PS1) 10Jakob: Enable Wikibase GraphQL on production wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247577 (https://phabricator.wikimedia.org/T417619) [12:15:50] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1356.eqiad.wmnet with OS trixie [12:15:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247576 (https://phabricator.wikimedia.org/T417619) (owner: 10Jakob) [12:16:09] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1357.eqiad.wmnet with OS trixie [12:16:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247577 (https://phabricator.wikimedia.org/T417619) (owner: 10Jakob) [12:16:32] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1358.eqiad.wmnet with OS trixie [12:17:45] (03PS1) 10Esanders: PasteCheck: Enable by default [extensions/VisualEditor] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247578 (https://phabricator.wikimedia.org/T405127) [12:19:03] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1353.eqiad.wmnet with OS trixie [12:19:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/VisualEditor] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247578 (https://phabricator.wikimedia.org/T405127) (owner: 10Esanders) [12:20:00] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:20:12] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:21:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P89690 and previous config saved to /var/cache/conftool/dbconfig/20260303-122112-marostegui.json [12:21:13] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:21:24] jouncebot: nowandnext [12:21:25] No deployments scheduled for the next 0 hour(s) and 38 minute(s) [12:21:25] In 0 hour(s) and 38 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1300) [12:21:29] cool cool [12:22:51] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1173.eqiad.wmnet [12:22:54] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1174.eqiad.wmnet [12:23:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11667753 (10ops-monitoring-bot) Host an-worker1174.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:24:07] (03CR) 10Joal: [C:03+1] varnish: add headers to x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1247034 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [12:25:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247559 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [12:26:21] (03Merged) 10jenkins-bot: Enable thumb steps on private wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247559 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [12:26:57] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:27:04] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1247559|Enable thumb steps on private wikis too (T414805)]] [12:27:08] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [12:27:48] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:28:15] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1356.eqiad.wmnet with reason: host reimage [12:28:19] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1357.eqiad.wmnet with reason: host reimage [12:28:44] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1358.eqiad.wmnet with reason: host reimage [12:30:27] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in eqiad/ml-serve-eqiad: maintenance [12:31:12] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1247559|Enable thumb steps on private wikis too (T414805)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:31:27] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:31:28] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in eqiad/ml-serve-eqiad: maintenance [12:31:59] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:33:05] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1356.eqiad.wmnet with reason: host reimage [12:33:56] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:34:10] !log dpogorzelski@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=recommendation-api,name=eqiad [12:34:34] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1174.eqiad.wmnet [12:34:37] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1175.eqiad.wmnet [12:34:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11667816 (10ops-monitoring-bot) Host an-worker1175.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:35:30] (03PS1) 10Slyngshede: P:idm disallow signups from select domains [puppet] - 10https://gerrit.wikimedia.org/r/1247584 (https://phabricator.wikimedia.org/T418201) [12:36:20] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: no limits for wmcs for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243807 (owner: 10Daniel Kinzler) [12:36:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T418465)', diff saved to https://phabricator.wikimedia.org/P89691 and previous config saved to /var/cache/conftool/dbconfig/20260303-123619-marostegui.json [12:36:23] (03CR) 10Majavah: [C:03+1] openstack: Remove two buster checks [puppet] - 10https://gerrit.wikimedia.org/r/1243788 (owner: 10Muehlenhoff) [12:36:23] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [12:36:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1242.eqiad.wmnet with reason: Maintenance [12:36:40] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1359.eqiad.wmnet with OS trixie [12:36:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1242 (T418465)', diff saved to https://phabricator.wikimedia.org/P89692 and previous config saved to /var/cache/conftool/dbconfig/20260303-123642-marostegui.json [12:36:50] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1357.eqiad.wmnet with reason: host reimage [12:37:56] (03CR) 10CI reject: [V:04-1] P:idm disallow signups from select domains [puppet] - 10https://gerrit.wikimedia.org/r/1247584 (https://phabricator.wikimedia.org/T418201) (owner: 10Slyngshede) [12:38:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T418465)', diff saved to https://phabricator.wikimedia.org/P89693 and previous config saved to /var/cache/conftool/dbconfig/20260303-123827-marostegui.json [12:39:48] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:39:52] (03PS5) 10Daniel Kinzler: rest-gateway: no limits for wmcs for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243807 [12:40:06] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247559|Enable thumb steps on private wikis too (T414805)]] (duration: 13m 01s) [12:40:09] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [12:40:13] (03PS4) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244696 (https://phabricator.wikimedia.org/T410273) [12:41:02] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: no limits for wmcs for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243807 (owner: 10Daniel Kinzler) [12:41:50] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:42:40] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [12:42:52] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [12:43:04] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1358.eqiad.wmnet with reason: host reimage [12:43:07] (03PS2) 10Slyngshede: P:idm disallow signups from select domains [puppet] - 10https://gerrit.wikimedia.org/r/1247584 (https://phabricator.wikimedia.org/T418201) [12:43:08] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:43:18] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:43:20] (03Merged) 10jenkins-bot: rest-gateway: no limits for wmcs for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243807 (owner: 10Daniel Kinzler) [12:43:43] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [12:43:52] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [12:43:54] FIRING: [2x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:44:07] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . [12:44:19] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:44:32] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:44:51] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [12:45:07] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [12:45:11] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:45:23] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [12:45:40] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:45:44] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:46:02] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [12:46:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1175.eqiad.wmnet [12:46:14] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1176.eqiad.wmnet [12:46:18] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:46:36] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:46:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11667864 (10ops-monitoring-bot) Host an-worker1176.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:47:02] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:47:11] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:47:23] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:47:42] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:47:57] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247584 (https://phabricator.wikimedia.org/T418201) (owner: 10Slyngshede) [12:48:34] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1356.eqiad.wmnet with OS trixie [12:48:54] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1359.eqiad.wmnet with reason: host reimage [12:50:42] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:51:16] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:52:55] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1357.eqiad.wmnet with OS trixie [12:53:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P89694 and previous config saved to /var/cache/conftool/dbconfig/20260303-125335-marostegui.json [12:53:40] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1359.eqiad.wmnet with reason: host reimage [12:54:53] (03CR) 10Esanders: "Do we need to backport? The messages are on TW already:" [extensions/VisualEditor] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247434 (https://phabricator.wikimedia.org/T414987) (owner: 10Medelius) [12:55:12] (03PS3) 10Slyngshede: P:idm disallow signups from select domains [puppet] - 10https://gerrit.wikimedia.org/r/1247584 (https://phabricator.wikimedia.org/T418201) [12:55:43] (03PS3) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [12:55:54] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:56:22] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:56:55] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244696 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [12:58:17] (03PS1) 10Filippo Giunchedi: openstack: do not restart rabbitmq-server on cert renewal [puppet] - 10https://gerrit.wikimedia.org/r/1247588 (https://phabricator.wikimedia.org/T418444) [12:58:54] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:59:00] (03CR) 10Filippo Giunchedi: "The idea being to check back in a few days if certificates have been reloaded" [puppet] - 10https://gerrit.wikimedia.org/r/1247588 (https://phabricator.wikimedia.org/T418444) (owner: 10Filippo Giunchedi) [12:59:03] (03Merged) 10jenkins-bot: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244696 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [12:59:38] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1358.eqiad.wmnet with OS trixie [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1300) [13:00:21] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1176.eqiad.wmnet [13:00:23] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:00:24] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1177.eqiad.wmnet [13:00:35] (03CR) 10CI reject: [V:04-1] memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [13:00:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11667942 (10ops-monitoring-bot) Host an-worker1177.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:01:06] (03CR) 10Filippo Giunchedi: "e.g. currently on cloudrabbit1001 certs are set to expire on March 25th, thus renewal will happen in ~7d IIRC" [puppet] - 10https://gerrit.wikimedia.org/r/1247588 (https://phabricator.wikimedia.org/T418444) (owner: 10Filippo Giunchedi) [13:01:14] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:01:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T418465)', diff saved to https://phabricator.wikimedia.org/P89695 and previous config saved to /var/cache/conftool/dbconfig/20260303-130117-marostegui.json [13:01:21] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [13:02:52] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247584 (https://phabricator.wikimedia.org/T418201) (owner: 10Slyngshede) [13:02:58] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1247588 (https://phabricator.wikimedia.org/T418444) (owner: 10Filippo Giunchedi) [13:04:17] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: do not restart rabbitmq-server on cert renewal [puppet] - 10https://gerrit.wikimedia.org/r/1247588 (https://phabricator.wikimedia.org/T418444) (owner: 10Filippo Giunchedi) [13:05:25] (03CR) 10Brouberol: [C:03+1] dse-k8s airflow-analytics-test: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247570 (https://phabricator.wikimedia.org/T415874) (owner: 10Aqu) [13:08:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P89696 and previous config saved to /var/cache/conftool/dbconfig/20260303-130842-marostegui.json [13:10:15] (03CR) 10Jelto: [C:03+1] "I'd not worry too much about breaking the replica for a short amount of time. If end-users or bots require a high uptime they should use t" [dns] - 10https://gerrit.wikimedia.org/r/1247530 (https://phabricator.wikimedia.org/T418108) (owner: 10Arnaudb) [13:10:17] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1359.eqiad.wmnet with OS trixie [13:11:43] (03CR) 10Vgutierrez: "overall it looks good (VTCs are happy), please see inline comments" [puppet] - 10https://gerrit.wikimedia.org/r/1247034 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [13:11:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1177.eqiad.wmnet [13:11:54] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1178.eqiad.wmnet [13:12:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11667959 (10ops-monitoring-bot) Host an-worker1178.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:13:15] (03PS1) 10Tiziano Fogli: Revert "P::thanos::store::ruler (TMP): select only blocks generated locally" [puppet] - 10https://gerrit.wikimedia.org/r/1247590 [13:13:27] (03PS1) 10Tiziano Fogli: Revert "thanos/querier (TMP): filter out non local ruler from query configs" [puppet] - 10https://gerrit.wikimedia.org/r/1247591 [13:16:06] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in eqiad/ml-serve-eqiad: maintenance [13:16:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P89697 and previous config saved to /var/cache/conftool/dbconfig/20260303-131624-marostegui.json [13:17:02] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in eqiad/ml-serve-eqiad: maintenance [13:17:05] !log dpogorzelski@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=recommendation-api,name=eqiad [13:19:58] !log Thanos: re-enable querier<->ruler cross-site traffic T412924 [13:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:02] T412924: Multi-instance thanos store gateway - https://phabricator.wikimedia.org/T412924 [13:21:23] (03PS2) 10Tiziano Fogli: Revert "P::thanos::store::ruler (TMP): select only blocks generated locally" [puppet] - 10https://gerrit.wikimedia.org/r/1247590 [13:21:40] (03PS2) 10Tiziano Fogli: Revert "thanos/querier (TMP): filter out non local ruler from query configs" [puppet] - 10https://gerrit.wikimedia.org/r/1247591 [13:23:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T418465)', diff saved to https://phabricator.wikimedia.org/P89698 and previous config saved to /var/cache/conftool/dbconfig/20260303-132350-marostegui.json [13:23:54] FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:23:54] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:23:54] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [13:24:07] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2236.codfw.wmnet with reason: Maintenance [13:24:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2236 (T418465)', diff saved to https://phabricator.wikimedia.org/P89699 and previous config saved to /var/cache/conftool/dbconfig/20260303-132414-marostegui.json [13:25:36] (03PS5) 10Fabfur: varnish: add trusted_req and rl_class fields to x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1247034 (https://phabricator.wikimedia.org/T417864) [13:25:47] (03CR) 10Fabfur: varnish: add trusted_req and rl_class fields to x-analytics (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1247034 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [13:26:25] (03CR) 10Tiziano Fogli: [C:03+2] Revert "P::thanos::store::ruler (TMP): select only blocks generated locally" [puppet] - 10https://gerrit.wikimedia.org/r/1247590 (owner: 10Tiziano Fogli) [13:26:35] (03CR) 10Tiziano Fogli: [C:03+2] Revert "thanos/querier (TMP): filter out non local ruler from query configs" [puppet] - 10https://gerrit.wikimedia.org/r/1247591 (owner: 10Tiziano Fogli) [13:29:10] (03PS1) 10Gergő Tisza: Enable JWT session cookie for bot passwords (all wikis) (attempt #2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247596 (https://phabricator.wikimedia.org/T415007) [13:29:30] (03PS3) 10CDanis: haproxy: ja3n is session-scoped [puppet] - 10https://gerrit.wikimedia.org/r/1247194 [13:29:31] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247194 (owner: 10CDanis) [13:29:54] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11667996 (10Jclark-ctr) @MatthewVernon, sorry about that I forgot to go back and set up the HDDs. They’re all set up now, and the drives are pending. It should be done sh... [13:30:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11667997 (10Jclark-ctr) [13:31:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P89700 and previous config saved to /var/cache/conftool/dbconfig/20260303-133131-marostegui.json [13:31:45] !log installing NSS security updates [13:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:08] PROBLEM - Host an-worker1178 is DOWN: PING CRITICAL - Packet loss = 100% [13:32:34] (03PS1) 10Joal: Update HDFS RPC queue alerts [alerts] - 10https://gerrit.wikimedia.org/r/1247600 (https://phabricator.wikimedia.org/T418152) [13:34:31] (03CR) 10CI reject: [V:04-1] Update HDFS RPC queue alerts [alerts] - 10https://gerrit.wikimedia.org/r/1247600 (https://phabricator.wikimedia.org/T418152) (owner: 10Joal) [13:42:17] (03CR) 10CDanis: [C:03+2] haproxy: ja3n is session-scoped [puppet] - 10https://gerrit.wikimedia.org/r/1247194 (owner: 10CDanis) [13:44:13] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:45:34] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, I can't think of an additional domain to block ATM" [puppet] - 10https://gerrit.wikimedia.org/r/1247584 (https://phabricator.wikimedia.org/T418201) (owner: 10Slyngshede) [13:45:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T418465)', diff saved to https://phabricator.wikimedia.org/P89701 and previous config saved to /var/cache/conftool/dbconfig/20260303-134554-marostegui.json [13:45:58] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [13:46:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T418465)', diff saved to https://phabricator.wikimedia.org/P89702 and previous config saved to /var/cache/conftool/dbconfig/20260303-134639-marostegui.json [13:46:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1243.eqiad.wmnet with reason: Maintenance [13:47:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1243 (T418465)', diff saved to https://phabricator.wikimedia.org/P89703 and previous config saved to /var/cache/conftool/dbconfig/20260303-134702-marostegui.json [13:49:17] (03CR) 10Btullis: [C:03+1] dse-k8s airflow-analytics-test: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247570 (https://phabricator.wikimedia.org/T415874) (owner: 10Aqu) [13:54:56] (03PS1) 10Vgutierrez: apt::package_from_component: Split ensure and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/1247606 [13:56:47] (03CR) 10Fabfur: "varnish tests looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/1247034 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [13:57:46] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247606 (owner: 10Vgutierrez) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1400). [14:00:05] nya_1F616EMO, jakob_WMDE, and edsanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] o/ [14:00:14] o/ [14:00:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240716 (https://phabricator.wikimedia.org/T400063) (owner: 10Esanders) [14:01:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P89704 and previous config saved to /var/cache/conftool/dbconfig/20260303-140102-marostegui.json [14:02:04] (03PS1) 10AOkoth: catalog: add wmf-navigator to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1247608 (https://phabricator.wikimedia.org/T414405) [14:02:45] jakob_WMDE: are you self-deploying? [14:02:49] (03CR) 10Silvan Heintze: [C:03+1] Enable Wikibase GraphQL on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247576 (https://phabricator.wikimedia.org/T417619) (owner: 10Jakob) [14:02:56] (03CR) 10Silvan Heintze: [C:03+1] Enable Wikibase GraphQL on production wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247577 (https://phabricator.wikimedia.org/T417619) (owner: 10Jakob) [14:02:59] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11668102 (10MatthewVernon) 05Open→03Resolved @Jclark-ctr yep, both look good now, thanks! [14:03:03] no, I'd like someone else to deploy if possible [14:04:01] I can start if you like - I assume the 2 patches are going together? [14:05:13] they could go together. I thought that doing test wikidata first might be safer but I also don't expect any explosions [14:05:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247576 (https://phabricator.wikimedia.org/T417619) (owner: 10Jakob) [14:05:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247577 (https://phabricator.wikimedia.org/T417619) (owner: 10Jakob) [14:06:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:06:51] (03Merged) 10jenkins-bot: Enable Wikibase GraphQL on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247576 (https://phabricator.wikimedia.org/T417619) (owner: 10Jakob) [14:06:54] (03Merged) 10jenkins-bot: Enable Wikibase GraphQL on production wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247577 (https://phabricator.wikimedia.org/T417619) (owner: 10Jakob) [14:06:58] (03PS2) 10Vgutierrez: apt::package_from_component: Split ensure and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/1247606 [14:07:24] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1247576|Enable Wikibase GraphQL on test.wikidata.org (T417619)]], [[gerrit:1247577|Enable Wikibase GraphQL on production wikidata.org (T417619)]] [14:07:27] T417619: Turn the feature flag on for GraphQL on prod and test wikidata - https://phabricator.wikimedia.org/T417619 [14:08:48] (03CR) 10Ssingh: [C:03+1] "Sorry for missing this earlier. Thanks and looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1243751 (owner: 10Muehlenhoff) [14:09:00] 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#11668151 (10tappof) The solution outlined in the diagram has been implemented. It is now possible to test the backfill process with the new conf... [14:09:27] !log esanders@deploy2002 esanders, jakob: Backport for [[gerrit:1247576|Enable Wikibase GraphQL on test.wikidata.org (T417619)]], [[gerrit:1247577|Enable Wikibase GraphQL on production wikidata.org (T417619)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:10:11] edsanders: works on test, thanks! [14:10:56] jakob_WMDE: and wikidata.org? [14:11:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:11:26] edsanders: yes, there too. yay! [14:11:34] \o/ [14:11:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T418465)', diff saved to https://phabricator.wikimedia.org/P89706 and previous config saved to /var/cache/conftool/dbconfig/20260303-141142-marostegui.json [14:11:45] !log esanders@deploy2002 esanders, jakob: Continuing with sync [14:11:47] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [14:11:55] (03PS1) 10Elukey: confluent::kafka::common: update ensure for apt configs [puppet] - 10https://gerrit.wikimedia.org/r/1247612 [14:12:05] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Request to deactivate/disable AndreiJirohOnDevsCentral LDAP dev account - https://phabricator.wikimedia.org/T418068#11668172 (10A_smart_kitten) Thank you @SLyngshede-WMF :) And apologies @jelto/all if I worded my pr... [14:12:19] (03Abandoned) 10Medelius: Create message strings for experimental checks [extensions/VisualEditor] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247434 (https://phabricator.wikimedia.org/T414987) (owner: 10Medelius) [14:12:24] (03PS1) 10Muehlenhoff: rsyslog: Remove obsolete and misleading comments [puppet] - 10https://gerrit.wikimedia.org/r/1247613 [14:13:05] (03PS2) 10Elukey: confluent::kafka::common: update ensure for apt configs [puppet] - 10https://gerrit.wikimedia.org/r/1247612 [14:13:36] (03CR) 10Elukey: [C:03+1] apt::package_from_component: Split ensure and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/1247606 (owner: 10Vgutierrez) [14:13:38] (03CR) 10CI reject: [V:04-1] confluent::kafka::common: update ensure for apt configs [puppet] - 10https://gerrit.wikimedia.org/r/1247612 (owner: 10Elukey) [14:14:39] (03PS3) 10Elukey: confluent::kafka::common: update ensure for apt configs [puppet] - 10https://gerrit.wikimedia.org/r/1247612 [14:15:41] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247576|Enable Wikibase GraphQL on test.wikidata.org (T417619)]], [[gerrit:1247577|Enable Wikibase GraphQL on production wikidata.org (T417619)]] (duration: 08m 17s) [14:15:44] T417619: Turn the feature flag on for GraphQL on prod and test wikidata - https://phabricator.wikimedia.org/T417619 [14:16:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P89707 and previous config saved to /var/cache/conftool/dbconfig/20260303-141610-marostegui.json [14:16:57] (03CR) 10Vgutierrez: [C:03+2] apt::package_from_component: Split ensure and ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/1247606 (owner: 10Vgutierrez) [14:18:08] (03CR) 10Vgutierrez: [C:03+1] confluent::kafka::common: update ensure for apt configs [puppet] - 10https://gerrit.wikimedia.org/r/1247612 (owner: 10Elukey) [14:18:13] (03PS1) 10AOkoth: aux: add wmf-navigator namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247615 [14:18:36] (03CR) 10Elukey: [C:03+2] confluent::kafka::common: update ensure for apt configs [puppet] - 10https://gerrit.wikimedia.org/r/1247612 (owner: 10Elukey) [14:19:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247578 (https://phabricator.wikimedia.org/T405127) (owner: 10Esanders) [14:20:47] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1097.eqiad.wmnet [14:20:49] (03CR) 10Gehel: [C:04-1] "LGTM, except failing test" [alerts] - 10https://gerrit.wikimedia.org/r/1247600 (https://phabricator.wikimedia.org/T418152) (owner: 10Joal) [14:21:02] (03Merged) 10jenkins-bot: PasteCheck: Enable by default [extensions/VisualEditor] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247578 (https://phabricator.wikimedia.org/T405127) (owner: 10Esanders) [14:21:35] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1247578|PasteCheck: Enable by default (T405127)]] [14:21:38] T405127: [MILESTONE] Deploy Paste Check to all Wikipedias - https://phabricator.wikimedia.org/T405127 [14:22:14] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11668230 (10Ladsgroup) Top "file formats" for the non-standard sizes with enwiki as referrer are as follows: ` spark-sql (default)> s... [14:23:35] !log esanders@deploy2002 esanders: Backport for [[gerrit:1247578|PasteCheck: Enable by default (T405127)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:25:45] !log esanders@deploy2002 esanders: Continuing with sync [14:26:44] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp - 3.0 upgrade () [14:26:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P89708 and previous config saved to /var/cache/conftool/dbconfig/20260303-142649-marostegui.json [14:27:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1097.eqiad.wmnet [14:27:11] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp - 3.0 upgrade () [14:28:19] (03PS1) 10JMeybohm: Add wikikube-worker[1352-1359] [puppet] - 10https://gerrit.wikimedia.org/r/1247617 (https://phabricator.wikimedia.org/T418259) [14:29:05] (03Abandoned) 10Trueg: admin: bash cfg for trueg home dir [puppet] - 10https://gerrit.wikimedia.org/r/1237855 (owner: 10Trueg) [14:29:36] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247578|PasteCheck: Enable by default (T405127)]] (duration: 08m 01s) [14:29:46] T405127: [MILESTONE] Deploy Paste Check to all Wikipedias - https://phabricator.wikimedia.org/T405127 [14:30:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11668254 (10MatthewVernon) 05Resolved→03Open @Jclark-ctr sorry, I was wrong, the disks are now setup incorrectly - it looks like you've set them up as a set of RAID-0... [14:30:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240716 (https://phabricator.wikimedia.org/T400063) (owner: 10Esanders) [14:31:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T418465)', diff saved to https://phabricator.wikimedia.org/P89709 and previous config saved to /var/cache/conftool/dbconfig/20260303-143117-marostegui.json [14:31:22] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [14:31:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2237.codfw.wmnet with reason: Maintenance [14:31:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2237 (T418465)', diff saved to https://phabricator.wikimedia.org/P89710 and previous config saved to /var/cache/conftool/dbconfig/20260303-143141-marostegui.json [14:31:44] (03Merged) 10jenkins-bot: Remove Editing-related config for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240716 (https://phabricator.wikimedia.org/T400063) (owner: 10Esanders) [14:31:44] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1178.eqiad.wmnet [14:31:48] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1179.eqiad.wmnet [14:32:01] (03CR) 10JMeybohm: [C:03+2] Add wikikube-worker[1352-1359] [puppet] - 10https://gerrit.wikimedia.org/r/1247617 (https://phabricator.wikimedia.org/T418259) (owner: 10JMeybohm) [14:32:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11668270 (10ops-monitoring-bot) Host an-worker1179.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:32:16] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1240716|Remove Editing-related config for special wikis (T400063)]] [14:32:20] T400063: Clean up Editing-related settings on ex-Wikipedia special wikis - https://phabricator.wikimedia.org/T400063 [14:32:55] (03PS1) 10Muehlenhoff: aptrepo: Remove buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1247618 [14:34:15] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:34:19] !log esanders@deploy2002 esanders: Backport for [[gerrit:1240716|Remove Editing-related config for special wikis (T400063)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:34:29] (03CR) 10Gkyziridis: [C:03+2] changeprop: Add revertrisk-multilingual model to changeprop staging configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247025 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [14:34:59] !log esanders@deploy2002 esanders: Continuing with sync [14:35:11] (03CR) 10Jelto: [C:03+1] "thank you for the change, looks good to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247615 (owner: 10AOkoth) [14:35:17] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11668307 (10Jclark-ctr) @MatthewVernon I looked at ms-be1095, and it looked to be set up the same way. I’ll need to go back through my notes to see what I’m missing. I w... [14:36:22] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:36:56] (03Merged) 10jenkins-bot: changeprop: Add revertrisk-multilingual model to changeprop staging configuration. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247025 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [14:37:47] (03CR) 10Jelto: [C:03+1] "looks reasonable to me, but let's get a review from traffic as well" [puppet] - 10https://gerrit.wikimedia.org/r/1247608 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [14:38:50] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1240716|Remove Editing-related config for special wikis (T400063)]] (duration: 06m 34s) [14:38:54] T400063: Clean up Editing-related settings on ex-Wikipedia special wikis - https://phabricator.wikimedia.org/T400063 [14:38:56] !log gkyziridis@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:39:51] (03PS1) 10Muehlenhoff: Remove support for PHP 7.4/8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1247620 [14:41:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P89711 and previous config saved to /var/cache/conftool/dbconfig/20260303-144156-marostegui.json [14:43:22] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1179.eqiad.wmnet [14:43:25] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1180.eqiad.wmnet [14:43:31] (03PS1) 10Muehlenhoff: memcached: Update comment on TLS support [puppet] - 10https://gerrit.wikimedia.org/r/1247621 [14:43:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11668346 (10ops-monitoring-bot) Host an-worker1180.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:45:33] FIRING: [7x] KubernetesCalicoDown: wikikube-worker1352.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:46:02] RECOVERY - Host an-worker1178 is UP: PING OK - Packet loss = 0%, RTA = 4.22 ms [14:46:15] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1352-1359].eqiad.wmnet [14:46:16] !log jayme@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[1352-1359].eqiad.wmnet [14:46:54] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11668376 (10MatthewVernon) Looking at 1095, the drives appear in the web-iDRAC as "NonRAID Disk 0" and the Storage Overview says 26 "Non-RAID Disks". On 1096, they inste... [14:49:32] PROBLEM - Host an-worker1178 is DOWN: PING CRITICAL - Packet loss = 100% [14:49:34] !log installing php7.4 security updates [14:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:33] FIRING: [7x] KubernetesCalicoDown: wikikube-worker1352.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:50:50] (03CR) 10Eevans: "Gotcha, I was following https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service in sequence, and went https://wikitech.wikimedia." [puppet] - 10https://gerrit.wikimedia.org/r/1247175 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [14:52:44] (03PS2) 10Eevans: service: add linked-artifact service (k8s ingress) [puppet] - 10https://gerrit.wikimedia.org/r/1247175 (https://phabricator.wikimedia.org/T414112) [14:52:59] (03PS5) 10JMeybohm: loadbalancer.migrate-service-ipip: Allow to skip puppet on realservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1242419 (https://phabricator.wikimedia.org/T352956) [14:52:59] (03PS1) 10JMeybohm: k8s.pool-depool-node: Don't ask for confirmation on check [cookbooks] - 10https://gerrit.wikimedia.org/r/1247624 (https://phabricator.wikimedia.org/T410537) [14:55:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1180.eqiad.wmnet [14:55:33] FIRING: [8x] KubernetesCalicoDown: wikikube-worker1352.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:55:35] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1181.eqiad.wmnet [14:55:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T418465)', diff saved to https://phabricator.wikimedia.org/P89712 and previous config saved to /var/cache/conftool/dbconfig/20260303-145541-marostegui.json [14:55:45] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [14:55:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11668439 (10ops-monitoring-bot) Host an-worker1181.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:57:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T418465)', diff saved to https://phabricator.wikimedia.org/P89713 and previous config saved to /var/cache/conftool/dbconfig/20260303-145704-marostegui.json [14:57:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1244.eqiad.wmnet with reason: Maintenance [14:57:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1244 (T418465)', diff saved to https://phabricator.wikimedia.org/P89714 and previous config saved to /var/cache/conftool/dbconfig/20260303-145727-marostegui.json [14:57:45] (03CR) 10JMeybohm: "Hm..good question. You could move the ingress section down below the deployment section. But that does not make sense in all cases (since " [puppet] - 10https://gerrit.wikimedia.org/r/1247175 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [14:59:26] (03CR) 10Filippo Giunchedi: [C:03+1] "\o/ \o/ \o/ feels good" [puppet] - 10https://gerrit.wikimedia.org/r/1247618 (owner: 10Muehlenhoff) [14:59:46] (03CR) 10Filippo Giunchedi: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1247613 (owner: 10Muehlenhoff) [15:00:04] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1500) [15:00:33] FIRING: [8x] KubernetesCalicoDown: wikikube-worker1352.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:00:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:03:02] RECOVERY - Host an-worker1178 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [15:04:00] (03PS1) 10Vgutierrez: trafficserver: Fix backend rules for gerrit-(replica|spare) [puppet] - 10https://gerrit.wikimedia.org/r/1247625 [15:04:50] (03CR) 10Vgutierrez: [C:03+2] trafficserver: Fix backend rules for gerrit-(replica|spare) [puppet] - 10https://gerrit.wikimedia.org/r/1247625 (owner: 10Vgutierrez) [15:05:10] (03CR) 10Dzahn: [C:03+1] cache:text: add gerrit-spare to alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/1247528 (https://phabricator.wikimedia.org/T418361) (owner: 10Arnaudb) [15:05:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:07:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1181.eqiad.wmnet [15:07:21] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1182.eqiad.wmnet [15:07:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11668507 (10ops-monitoring-bot) Host an-worker1182.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [15:08:19] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:10:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P89715 and previous config saved to /var/cache/conftool/dbconfig/20260303-151049-marostegui.json [15:13:21] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1178.eqiad.wmnet [15:13:57] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo and A:cp - 3.0 upgrade () [15:14:13] FIRING: [2x] SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:14:22] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp - 3.0 upgrade () [15:14:42] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo and A:cp - 3.0 upgrade () [15:15:33] RESOLVED: [2x] KubernetesCalicoDown: wikikube-worker1352.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:15:34] !log fabfur@cumin1003 END (FAIL) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=1) rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp - 3.0 upgrade () [15:16:36] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp5032.*} and A:cp - 3.0 upgrade () [15:19:05] Oopsie, I forgot the backport thing [15:19:10] was working on local community [15:19:16] Will rescheulde [15:19:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1182.eqiad.wmnet [15:19:20] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1183.eqiad.wmnet [15:19:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11668562 (10ops-monitoring-bot) Host an-worker1183.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [15:20:22] Sorry edsanders [15:21:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T418465)', diff saved to https://phabricator.wikimedia.org/P89716 and previous config saved to /var/cache/conftool/dbconfig/20260303-152157-marostegui.json [15:22:00] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp5032.*} and A:cp - 3.0 upgrade () [15:22:01] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [15:22:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244373 (https://phabricator.wikimedia.org/T418089) (owner: 101F616EMO) [15:22:21] ^ done [15:23:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1178.eqiad.wmnet [15:25:46] (03PS1) 10Ssingh: wikimedia.org/wikipedia.org: bump TTL for NS and glue records [dns] - 10https://gerrit.wikimedia.org/r/1247626 [15:25:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P89717 and previous config saved to /var/cache/conftool/dbconfig/20260303-152557-marostegui.json [15:29:26] (03CR) 10BBlack: [C:03+1] "SGTM. Note many of our other zones have these at 1D. These are not as bad as the 1H, but probably should be shifted to 2D as well in a s" [dns] - 10https://gerrit.wikimedia.org/r/1247626 (owner: 10Ssingh) [15:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1530) [15:30:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1183.eqiad.wmnet [15:30:58] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1184.eqiad.wmnet [15:31:06] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:31:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11668642 (10ops-monitoring-bot) Host an-worker1184.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [15:35:02] (03PS1) 10JMeybohm: k8s.pool-depool-cookbook: Handle calicoctl exiting with error [cookbooks] - 10https://gerrit.wikimedia.org/r/1247628 (https://phabricator.wikimedia.org/T418259) [15:36:02] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1352-1359].eqiad.wmnet [15:36:04] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1352-1359].eqiad.wmnet [15:36:56] jouncebot: nowandnext [15:36:57] For the next 0 hour(s) and 23 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1530) [15:36:57] In 0 hour(s) and 23 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1600) [15:37:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P89718 and previous config saved to /var/cache/conftool/dbconfig/20260303-153704-marostegui.json [15:37:24] (03CR) 10Zabe: [C:03+2] ImageListPager: Use correct name field for batch lookups [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247569 (https://phabricator.wikimedia.org/T418327) (owner: 10Zabe) [15:41:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T418465)', diff saved to https://phabricator.wikimedia.org/P89719 and previous config saved to /var/cache/conftool/dbconfig/20260303-154104-marostegui.json [15:41:09] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [15:41:09] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:41:21] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2239.codfw.wmnet with reason: Maintenance [15:41:41] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:42:25] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:42:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1184.eqiad.wmnet [15:42:47] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1185.eqiad.wmnet [15:43:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11668738 (10ops-monitoring-bot) Host an-worker1185.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [15:44:17] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:45:09] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:45:17] (03CR) 10Gehel: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247513 (https://phabricator.wikimedia.org/T418665) (owner: 10Brouberol) [15:45:59] (03CR) 10Brouberol: [C:03+2] growhbook: allow WMDE engineers to self-enroll [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247513 (https://phabricator.wikimedia.org/T418665) (owner: 10Brouberol) [15:46:21] (03PS1) 10Daniel Kinzler: rest-gateway: fix time usnit used in tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247629 [15:47:16] PROBLEM - Host ms-fe1013 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:38] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11668782 (10elukey) After the BMC reset I don't see any issue anymore with ms-fe1013: ` Enabling PXE boot on NIC NIC.Integrated.1-1-1 ` [15:49:25] (03PS8) 10Btullis: Add a ValidatingAdmissionPolicy for use with analytics workloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) [15:49:25] (03PS10) 10Btullis: Apply the new VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) [15:49:57] (03Merged) 10jenkins-bot: ImageListPager: Use correct name field for batch lookups [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247569 (https://phabricator.wikimedia.org/T418327) (owner: 10Zabe) [15:49:59] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo and A:cp - 3.0 upgrade () [15:50:21] (03PS1) 10Herron: mwlog: add trixie hosts to udp tee [puppet] - 10https://gerrit.wikimedia.org/r/1247630 (https://phabricator.wikimedia.org/T417002) [15:50:37] (03PS2) 10Herron: mwlog: add trixie hosts to udp tee [puppet] - 10https://gerrit.wikimedia.org/r/1247630 (https://phabricator.wikimedia.org/T417002) [15:50:37] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1247569|ImageListPager: Use correct name field for batch lookups (T418327)]] [15:50:40] T418327: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Pager\IndexPager::buildQueryInfo (MediaWiki\Specials\Pager\ImageListPager): [1054] Unk - https://phabricator.wikimedia.org/T418327 [15:50:45] (03PS2) 10Daniel Kinzler: rest-gateway: fix time unit used in tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247629 [15:51:08] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: fix time unit used in tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247629 (owner: 10Daniel Kinzler) [15:51:30] elukey@cumin1003 provision (PID 3200208) is awaiting input [15:52:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P89720 and previous config saved to /var/cache/conftool/dbconfig/20260303-155212-marostegui.json [15:52:54] (03Merged) 10jenkins-bot: rest-gateway: fix time unit used in tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247629 (owner: 10Daniel Kinzler) [15:53:47] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo and A:cp - 3.0 upgrade () [15:54:19] !log zabe@deploy2002 zabe: Backport for [[gerrit:1247569|ImageListPager: Use correct name field for batch lookups (T418327)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:54:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1185.eqiad.wmnet [15:54:38] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1186.eqiad.wmnet [15:54:39] !log zabe@deploy2002 zabe: Continuing with sync [15:55:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11668825 (10ops-monitoring-bot) Host an-worker1186.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [16:00:05] jelto, arnoldokoth, mutante, and arnaudb: It is that lovely time of the day again! You are hereby commanded to deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1600). [16:00:06] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247569|ImageListPager: Use correct name field for batch lookups (T418327)]] (duration: 09m 28s) [16:00:10] T418327: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Pager\IndexPager::buildQueryInfo (MediaWiki\Specials\Pager\ImageListPager): [1054] Unk - https://phabricator.wikimedia.org/T418327 [16:01:34] !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator deploy [16:02:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2240.codfw.wmnet with reason: Maintenance [16:02:07] !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator deploy [16:02:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2240 (T418465)', diff saved to https://phabricator.wikimedia.org/P89721 and previous config saved to /var/cache/conftool/dbconfig/20260303-160207-marostegui.json [16:02:11] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [16:04:48] !log brennen@deploy2002 Started deploy [phabricator/deployment@a883b6d]: deploy phab2002 for T418872 [16:04:52] T418872: Deploy Phab/Phorge 2026-03-03 - https://phabricator.wikimedia.org/T418872 [16:05:20] !log brennen@deploy2002 Finished deploy [phabricator/deployment@a883b6d]: deploy phab2002 for T418872 (duration: 00m 32s) [16:05:57] !log brennen@deploy2002 Started deploy [phabricator/deployment@a883b6d]: deploy phab1004 for T418872 [16:06:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1186.eqiad.wmnet [16:06:22] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1187.eqiad.wmnet [16:06:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11668899 (10ops-monitoring-bot) Host an-worker1187.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [16:07:04] !log brennen@deploy2002 Finished deploy [phabricator/deployment@a883b6d]: deploy phab1004 for T418872 (duration: 01m 07s) [16:07:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T418465)', diff saved to https://phabricator.wikimedia.org/P89722 and previous config saved to /var/cache/conftool/dbconfig/20260303-160720-marostegui.json [16:07:24] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [16:07:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1245.eqiad.wmnet with reason: Maintenance [16:08:54] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:16] (03PS7) 10Ryan Kemper: wdqs: Separate deadlock remediation config [puppet] - 10https://gerrit.wikimedia.org/r/1244022 (https://phabricator.wikimedia.org/T242453) [16:10:17] (03PS7) 10Ryan Kemper: wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) [16:11:11] (03CR) 10CI reject: [V:04-1] wdqs: Per-instance deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1244023 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [16:12:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Setting db1188 weight to 300 T416705', diff saved to https://phabricator.wikimedia.org/P89723 and previous config saved to /var/cache/conftool/dbconfig/20260303-161230-fceratto.json [16:12:37] T416705: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T416705 [16:13:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Setting db1169 weight to 300 T416705', diff saved to https://phabricator.wikimedia.org/P89724 and previous config saved to /var/cache/conftool/dbconfig/20260303-161323-fceratto.json [16:13:54] FIRING: [5x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:14:20] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1166: testing:crash [16:14:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1166: testing:crash [16:17:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1187.eqiad.wmnet [16:17:57] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1188.eqiad.wmnet [16:18:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [16:18:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11668982 (10ops-monitoring-bot) Host an-worker1188.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [16:18:42] (03CR) 10Clément Goubert: [C:03+1] k8s.pool-depool-cookbook: Handle calicoctl exiting with error (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1247628 (https://phabricator.wikimedia.org/T418259) (owner: 10JMeybohm) [16:18:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Setting db1188 weight to 100 T416705', diff saved to https://phabricator.wikimedia.org/P89726 and previous config saved to /var/cache/conftool/dbconfig/20260303-161846-fceratto.json [16:18:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [16:18:50] T416705: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T416705 [16:26:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T418465)', diff saved to https://phabricator.wikimedia.org/P89727 and previous config saved to /var/cache/conftool/dbconfig/20260303-162603-marostegui.json [16:26:05] (03CR) 10Clément Goubert: [C:03+1] k8s.pool-depool-node: Don't ask for confirmation on check [cookbooks] - 10https://gerrit.wikimedia.org/r/1247624 (https://phabricator.wikimedia.org/T410537) (owner: 10JMeybohm) [16:26:08] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [16:28:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1247.eqiad.wmnet with reason: Maintenance [16:28:10] (03CR) 10Vgutierrez: [C:03+1] loadbalancer.migrate-service-ipip: Allow to skip puppet on realservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1242419 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [16:28:27] (03CR) 10Matthieulec: [C:03+1] "Good catch, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1247624 (https://phabricator.wikimedia.org/T410537) (owner: 10JMeybohm) [16:28:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Setting x1 codfw weights to 300 T416705', diff saved to https://phabricator.wikimedia.org/P89728 and previous config saved to /var/cache/conftool/dbconfig/20260303-162836-fceratto.json [16:28:40] T416705: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T416705 [16:28:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1247 (T418465)', diff saved to https://phabricator.wikimedia.org/P89729 and previous config saved to /var/cache/conftool/dbconfig/20260303-162845-marostegui.json [16:30:04] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1188.eqiad.wmnet [16:30:07] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1189.eqiad.wmnet [16:30:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11669045 (10ops-monitoring-bot) Host an-worker1189.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [16:33:54] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:31] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:41:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P89730 and previous config saved to /var/cache/conftool/dbconfig/20260303-164111-marostegui.json [16:41:36] (03PS4) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [16:41:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1189.eqiad.wmnet [16:44:16] (03PS7) 10Ebernhardson: cirrus: Add semantic search test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969) [16:44:16] (03CR) 10Ebernhardson: cirrus: Add semantic search test cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [16:46:22] (03PS1) 10Jforrester: Style fixes for copy-paste feature [extensions/WikiLambda] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247635 (https://phabricator.wikimedia.org/T414072) [16:48:08] (03PS1) 10Muehlenhoff: Add a new role for routed jumbo Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1247636 (https://phabricator.wikimedia.org/T410314) [16:50:03] (03PS2) 10Muehlenhoff: Add a new role for routed jumbo Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1247636 (https://phabricator.wikimedia.org/T410314) [16:53:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T418465)', diff saved to https://phabricator.wikimedia.org/P89731 and previous config saved to /var/cache/conftool/dbconfig/20260303-165327-marostegui.json [16:53:33] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [16:55:13] (03CR) 10Btullis: "Thanks. I believe that I've fixed those tests now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [16:56:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P89732 and previous config saved to /var/cache/conftool/dbconfig/20260303-165618-marostegui.json [16:56:41] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1190.eqiad.wmnet [16:57:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11669168 (10ops-monitoring-bot) Host an-worker1190.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [16:57:09] (03PS2) 10Joal: Update HDFS RPC queue alerts [alerts] - 10https://gerrit.wikimedia.org/r/1247600 (https://phabricator.wikimedia.org/T418152) [16:59:13] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:59:19] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11669181 (10VRiley-WMF) a:03VRiley-WMF [17:00:04] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:03:07] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11669201 (10VRiley-WMF) Hey @Dzahn Here is a list of what is decommed and offline that we could use. https://netbox.... [17:04:51] (03CR) 10Joal: "tests fixed!" [alerts] - 10https://gerrit.wikimedia.org/r/1247600 (https://phabricator.wikimedia.org/T418152) (owner: 10Joal) [17:05:10] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp2043 is OK: HTTP OK: HTTP/1.0 200 OK - 36267 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:05:10] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp2043 is OK: HTTP OK: HTTP/1.1 200 OK - 48459 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:05:10] RECOVERY - Ensure traffic_server is running for instance backend on cp2043 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:06:40] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1190.eqiad.wmnet [17:06:43] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1191.eqiad.wmnet [17:07:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11669226 (10ops-monitoring-bot) Host an-worker1191.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [17:08:29] elukey@cumin1003 provision (PID 3200208) is awaiting input [17:08:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P89733 and previous config saved to /var/cache/conftool/dbconfig/20260303-170835-marostegui.json [17:10:56] (03CR) 10Daniel Kinzler: [C:03+1] "The intent looks correct to me, as far as I can tell by squinting at the code." [puppet] - 10https://gerrit.wikimedia.org/r/1247034 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [17:11:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T418465)', diff saved to https://phabricator.wikimedia.org/P89734 and previous config saved to /var/cache/conftool/dbconfig/20260303-171126-marostegui.json [17:11:30] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [17:11:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2245.codfw.wmnet with reason: Maintenance [17:11:46] (03PS1) 10Kgraessle: Enable rr-ml AutoModerator CC Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247639 (https://phabricator.wikimedia.org/T400727) [17:11:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2245 (T418465)', diff saved to https://phabricator.wikimedia.org/P89735 and previous config saved to /var/cache/conftool/dbconfig/20260303-171149-marostegui.json [17:13:40] (03PS2) 10Santiago Faci: TestKitchen renaming (MetricsPlatform => TestKitchen) [puppet] - 10https://gerrit.wikimedia.org/r/1247551 (https://phabricator.wikimedia.org/T416865) [17:18:29] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1191.eqiad.wmnet [17:18:32] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1192.eqiad.wmnet [17:18:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11669295 (10ops-monitoring-bot) Host an-worker1192.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [17:20:21] jouncebot: nowandnext [17:20:21] For the next 0 hour(s) and 39 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1700) [17:20:21] In 0 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1800) [17:21:05] jeena: Is it OK if I slip out a backport to the new train (wmf.18)? It's i18n-touching so it'll take a while, so I don't want to wait for a normal window and take over everyone else's time. [17:23:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P89736 and previous config saved to /var/cache/conftool/dbconfig/20260303-172343-marostegui.json [17:24:13] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:25:36] scap's full localization updates should be 5 minutes faster today than they were last week. The container registry has had a backend change that seems to be working to fix a data consistency bug we were working around with a 5 minute sleep. [17:27:45] James_F: can't speak for jeena but i'd pretty much assume you'd be ok at this point in the day, given that there don't seem to be any blockers getting worked on or anything. [17:27:49] Ack. [17:27:57] (03CR) 10Jforrester: [C:03+2] Style fixes for copy-paste feature [extensions/WikiLambda] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247635 (https://phabricator.wikimedia.org/T414072) (owner: 10Jforrester) [17:28:00] Let's go for it. [17:28:33] Yes fine with me [17:28:40] elukey@cumin1003 provision (PID 3200208) is awaiting input [17:29:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Degraded RAID on an-worker1204 - https://phabricator.wikimedia.org/T414861#11669334 (10Jclark-ctr) a:05BTullis→03Jclark-ctr @BTullis [17:30:00] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1192.eqiad.wmnet [17:30:03] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1193.eqiad.wmnet [17:30:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11669341 (10ops-monitoring-bot) Host an-worker1193.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [17:30:30] (03Merged) 10jenkins-bot: Style fixes for copy-paste feature [extensions/WikiLambda] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247635 (https://phabricator.wikimedia.org/T414072) (owner: 10Jforrester) [17:31:08] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk in slot 10 for an-worker1194 - https://phabricator.wikimedia.org/T389065#11669348 (10Jclark-ctr) @BTullis @RKemper have you been able to look at this? [17:31:55] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1247635|Style fixes for copy-paste feature (T414072)]] [17:31:58] T414072: Design: Provide a design proposal for function call block clipboard dialog - https://phabricator.wikimedia.org/T414072 [17:32:10] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T416066#11669354 (10Jclark-ctr) @BTullis @RKemper have you been able to look at this? [17:34:19] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527#11669357 (10BCornwall) `perf` shows a lot of the drops related to ats and purged: ` [ET_NET 72] 747843 [025] 71562.924664: skb:kfree_skb: skbaddr=0xff4203b3c053fa00 rx_sk=(nil)... [17:37:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T418465)', diff saved to https://phabricator.wikimedia.org/P89738 and previous config saved to /var/cache/conftool/dbconfig/20260303-173756-marostegui.json [17:38:01] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [17:38:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T418465)', diff saved to https://phabricator.wikimedia.org/P89739 and previous config saved to /var/cache/conftool/dbconfig/20260303-173850-marostegui.json [17:39:07] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1248.eqiad.wmnet with reason: Maintenance [17:39:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1248 (T418465)', diff saved to https://phabricator.wikimedia.org/P89740 and previous config saved to /var/cache/conftool/dbconfig/20260303-173914-marostegui.json [17:41:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1193.eqiad.wmnet [17:41:53] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1194.eqiad.wmnet [17:42:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11669388 (10ops-monitoring-bot) Host an-worker1194.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [17:43:09] (03PS38) 10CDobbins: prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [17:46:10] (03CR) 10Cyndywikime: [C:03+1] Enable new HTML confirmation emails for all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247566 (https://phabricator.wikimedia.org/T416748) (owner: 10Michael Große) [17:46:35] !log ariel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:47:25] !log ariel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:48:48] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [17:49:18] (03CR) 10CDobbins: prometheus: add pooled host check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [17:51:16] (03PS1) 10BCornwall: cp2047: Disable performance tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1247642 (https://phabricator.wikimedia.org/T418527) [17:51:43] (03CR) 10Ssingh: [C:03+1] "worth a shot" [puppet] - 10https://gerrit.wikimedia.org/r/1247642 (https://phabricator.wikimedia.org/T418527) (owner: 10BCornwall) [17:51:46] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1247635|Style fixes for copy-paste feature (T414072)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:51:49] T414072: Design: Provide a design proposal for function call block clipboard dialog - https://phabricator.wikimedia.org/T414072 [17:52:31] !log jforrester@deploy2002 jforrester: Continuing with sync [17:52:47] (03CR) 10Ssingh: "yes thanks, I will do those in another commit." [dns] - 10https://gerrit.wikimedia.org/r/1247626 (owner: 10Ssingh) [17:53:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P89741 and previous config saved to /var/cache/conftool/dbconfig/20260303-175304-marostegui.json [17:53:08] PROBLEM - MD RAID on ms-be1096 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:53:09] ACKNOWLEDGEMENT - MD RAID on ms-be1096 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T418893 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:53:14] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1096 - https://phabricator.wikimedia.org/T418893 (10ops-monitoring-bot) 03NEW [17:53:52] !log ariel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:54:02] (03CR) 10BCornwall: [C:03+2] cp2047: Disable performance tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1247642 (https://phabricator.wikimedia.org/T418527) (owner: 10BCornwall) [17:55:18] !log ariel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:59:54] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host an-worker1194.eqiad.wmnet [17:59:57] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1195.eqiad.wmnet [18:00:00] (03PS1) 10Joal: Update hadoop namenode JVM memory settings [puppet] - 10https://gerrit.wikimedia.org/r/1247643 (https://phabricator.wikimedia.org/T418551) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1800) [18:00:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11669482 (10ops-monitoring-bot) Host an-worker1195.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [18:01:28] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#11669487 (10Andrew) 05Open→03Resolved [18:02:17] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [18:02:18] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1096.eqiad.wmnet with OS bullseye [18:02:27] (03CR) 10Mooeypoo: [C:03+1] REST: show the beta Attribution API in the REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244748 (https://phabricator.wikimedia.org/T418522) (owner: 10BPirkle) [18:03:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244748 (https://phabricator.wikimedia.org/T418522) (owner: 10BPirkle) [18:03:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T418465)', diff saved to https://phabricator.wikimedia.org/P89742 and previous config saved to /var/cache/conftool/dbconfig/20260303-180352-marostegui.json [18:03:56] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [18:04:25] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [18:04:49] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247635|Style fixes for copy-paste feature (T414072)]] (duration: 32m 54s) [18:04:52] T414072: Design: Provide a design proposal for function call block clipboard dialog - https://phabricator.wikimedia.org/T414072 [18:04:55] All done. [18:05:06] jeena, brennen: Thanks! [18:08:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P89743 and previous config saved to /var/cache/conftool/dbconfig/20260303-180814-marostegui.json [18:11:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1195.eqiad.wmnet [18:11:38] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1196.eqiad.wmnet [18:12:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11669543 (10ops-monitoring-bot) Host an-worker1196.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [18:13:31] (03CR) 10C. Scott Ananian: "(This has to wait until the de/fr/pl translations roll out on the wmf.18 train.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247119 (https://phabricator.wikimedia.org/T414852) (owner: 10C. Scott Ananian) [18:14:14] (03CR) 10AOkoth: "Ack." [puppet] - 10https://gerrit.wikimedia.org/r/1247608 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [18:17:04] (03CR) 10BCornwall: "This can be abandoned as it was already handled in I5a6b7a9082d06e7b7cca0beaf5b73d79ad70cabc" [puppet] - 10https://gerrit.wikimedia.org/r/1247113 (owner: 10CDobbins) [18:17:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P89744 and previous config saved to /var/cache/conftool/dbconfig/20260303-181859-marostegui.json [18:19:37] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1096.eqiad.wmnet with reason: host reimage [18:21:56] (03CR) 10Aaron Schulz: [C:03+1] REST: show the beta Attribution API in the REST Sandbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244748 (https://phabricator.wikimedia.org/T418522) (owner: 10BPirkle) [18:23:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T418465)', diff saved to https://phabricator.wikimedia.org/P89745 and previous config saved to /var/cache/conftool/dbconfig/20260303-182321-marostegui.json [18:23:26] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [18:23:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2246.codfw.wmnet with reason: Maintenance [18:23:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1096.eqiad.wmnet with reason: host reimage [18:23:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2246 (T418465)', diff saved to https://phabricator.wikimedia.org/P89746 and previous config saved to /var/cache/conftool/dbconfig/20260303-182346-marostegui.json [18:24:05] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1196.eqiad.wmnet [18:24:07] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1197.eqiad.wmnet [18:24:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11669612 (10ops-monitoring-bot) Host an-worker1197.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [18:24:45] (03PS1) 10C. Scott Ananian: Localisation updates from https://translatewiki.net. [extensions/ParserMigration] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247648 [18:29:38] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp2047.codfw.wmnet with OS trixie [18:31:58] (03PS1) 10Jforrester: [DNM] Create Abstract Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247650 (https://phabricator.wikimedia.org/T411725) [18:32:02] (03PS1) 10Mmartorana: Enable confirmemail logstash channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247651 (https://phabricator.wikimedia.org/T415902) [18:32:07] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899 (10RobH) 03NEW [18:32:26] (03PS1) 10Aaron Schulz: Remove redundant mw-extra wgRestSandboxSpecs entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247652 [18:32:27] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11669674 (10RobH) a:03Dzahn Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the... [18:32:44] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11669681 (10RobH) [18:33:40] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: FY2526 Q3:rack/setup/install restbase2039 - https://phabricator.wikimedia.org/T416538#11669689 (10RobH) [18:34:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P89747 and previous config saved to /var/cache/conftool/dbconfig/20260303-183406-marostegui.json [18:34:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06), 07Essential-Work: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11669694 (10RobH) [18:34:48] (03CR) 10C. Scott Ananian: "Backport to ensure ru localization is on the train." [extensions/ParserMigration] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247648 (owner: 10C. Scott Ananian) [18:35:10] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11669697 (10RobH) [18:36:00] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1197.eqiad.wmnet [18:36:03] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1198.eqiad.wmnet [18:36:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11669700 (10ops-monitoring-bot) Host an-worker1198.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [18:38:51] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901 (10RobH) 03NEW [18:39:21] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901#11669735 (10RobH) [18:39:43] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901#11669737 (10RobH) a:03MatthewVernon Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki... [18:41:11] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902 (10RobH) 03NEW [18:41:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11669753 (10RobH) a:03MatthewVernon Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki... [18:41:57] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be200[56] - https://phabricator.wikimedia.org/T418902#11669761 (10RobH) [18:43:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[34] - https://phabricator.wikimedia.org/T418903 (10RobH) 03NEW [18:44:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[34] - https://phabricator.wikimedia.org/T418903#11669782 (10RobH) [18:44:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[34] - https://phabricator.wikimedia.org/T418903#11669794 (10RobH) a:03MoritzMuehlenhoff Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-op... [18:45:00] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1096.eqiad.wmnet with OS bullseye [18:46:41] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905 (10RobH) 03NEW [18:47:02] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11669815 (10RobH) [18:47:22] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11669818 (10Jclark-ctr) @MatthewVernon sorry again. i think i have it right now if you want to look again when you get a chance [18:47:25] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11669819 (10RobH) a:03Dzahn Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the... [18:47:43] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2047.codfw.wmnet with reason: host reimage [18:48:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246 (T418465)', diff saved to https://phabricator.wikimedia.org/P89749 and previous config saved to /var/cache/conftool/dbconfig/20260303-184815-marostegui.json [18:48:19] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [18:49:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T418465)', diff saved to https://phabricator.wikimedia.org/P89750 and previous config saved to /var/cache/conftool/dbconfig/20260303-184913-marostegui.json [18:49:21] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907 (10RobH) 03NEW [18:49:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1249.eqiad.wmnet with reason: Maintenance [18:49:35] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11669866 (10RobH) [18:49:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1249 (T418465)', diff saved to https://phabricator.wikimedia.org/P89751 and previous config saved to /var/cache/conftool/dbconfig/20260303-184937-marostegui.json [18:49:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1198.eqiad.wmnet [18:49:57] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1199.eqiad.wmnet [18:50:01] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install pc202[1-4] - https://phabricator.wikimedia.org/T418907#11669869 (10RobH) a:03Marostegui Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add th... [18:50:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11669875 (10ops-monitoring-bot) Host an-worker1199.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [18:51:30] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908 (10RobH) 03NEW [18:51:47] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11669888 (10RobH) a:03Marostegui Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add th... [18:52:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc102[1-4] - https://phabricator.wikimedia.org/T418908#11669897 (10RobH) [18:53:34] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2047.codfw.wmnet with reason: host reimage [18:55:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909 (10RobH) 03NEW [18:55:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [18:56:11] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1096 - https://phabricator.wikimedia.org/T418893#11669919 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Server was just reimaged ` jclark@ms-be1096:~$ cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [rai... [18:56:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11669923 (10RobH) [18:56:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11669925 (10RobH) a:03Marostegui Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add... [18:56:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [18:59:15] PROBLEM - Host ms-be1096 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:34] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911 (10RobH) 03NEW [18:59:47] RECOVERY - Host ms-be1096 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [19:00:05] jeena and dduvall: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1900). [19:00:30] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11669966 (10RobH) a:03Marostegui @Marostegui, We didn't get racking details on the parent ordering task but I didn't want to block ordering for that so I've filed t... [19:00:40] jeena: o/ [19:00:43] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install 4 new db hosts in codfw - https://phabricator.wikimedia.org/T418911#11669975 (10RobH) [19:01:00] hi dduvall [19:01:06] I will proceed shortly [19:03:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246', diff saved to https://phabricator.wikimedia.org/P89752 and previous config saved to /var/cache/conftool/dbconfig/20260303-190323-marostegui.json [19:08:01] (03PS1) 10Joal: Update HDFS total FIles Heap alert [alerts] - 10https://gerrit.wikimedia.org/r/1247658 (https://phabricator.wikimedia.org/T418551) [19:08:23] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247659 (https://phabricator.wikimedia.org/T413809) [19:08:25] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247659 (https://phabricator.wikimedia.org/T413809) (owner: 10TrainBranchBot) [19:09:20] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247659 (https://phabricator.wikimedia.org/T413809) (owner: 10TrainBranchBot) [19:10:09] PROBLEM - Host an-worker1199 is DOWN: PING CRITICAL - Packet loss = 100% [19:10:24] jouncebot: nowandnext [19:10:24] For the next 1 hour(s) and 49 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T1900) [19:10:24] In 1 hour(s) and 49 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T2100) [19:11:39] (03PS1) 10Dzahn: site: add phab1006 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1247660 (https://phabricator.wikimedia.org/T418905) [19:13:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T418465)', diff saved to https://phabricator.wikimedia.org/P89753 and previous config saved to /var/cache/conftool/dbconfig/20260303-191312-marostegui.json [19:13:16] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [19:14:13] FIRING: [2x] SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:15:56] (03CR) 10Btullis: [C:03+2] Update HDFS total FIles Heap alert [alerts] - 10https://gerrit.wikimedia.org/r/1247658 (https://phabricator.wikimedia.org/T418551) (owner: 10Joal) [19:16:49] (03CR) 10Ssingh: [C:03+2] profile::dns::recursor: Unconditionally enable the webserver [puppet] - 10https://gerrit.wikimedia.org/r/1243751 (owner: 10Muehlenhoff) [19:17:09] (03Merged) 10jenkins-bot: Update HDFS total FIles Heap alert [alerts] - 10https://gerrit.wikimedia.org/r/1247658 (https://phabricator.wikimedia.org/T418551) (owner: 10Joal) [19:17:12] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2047.codfw.wmnet with OS trixie [19:18:19] (03CR) 10Dzahn: [C:03+2] site: add phab1006 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1247660 (https://phabricator.wikimedia.org/T418905) (owner: 10Dzahn) [19:18:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246', diff saved to https://phabricator.wikimedia.org/P89754 and previous config saved to /var/cache/conftool/dbconfig/20260303-191830-marostegui.json [19:19:14] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.18 refs T413809 [19:19:17] T413809: 1.46.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T413809 [19:19:57] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11670036 (10RobH) [19:20:23] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11670039 (10Dzahn) a:05Dzahn→03None Thank. Done! Added to site.pp with insetup role. preseed.yml already covered by a wildcard. [19:25:05] 10ops-codfw, 06SRE, 06DC-Ops, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914 (10RobH) 03NEW [19:25:27] 10ops-codfw, 06SRE, 06DC-Ops, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#11670072 (10RobH) [19:25:57] 10ops-codfw, 06SRE, 06DC-Ops, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#11670073 (10RobH) a:03Scott_French Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operat... [19:26:42] (03Abandoned) 10Jforrester: mcrouter: Allow configuring secondary replicated caches [puppet] - 10https://gerrit.wikimedia.org/r/1229229 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester) [19:26:47] (03Abandoned) 10Jforrester: [WIP] mcrouter: Configure the Wikifunctions pool as replicated [puppet] - 10https://gerrit.wikimedia.org/r/1229230 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester) [19:26:55] (03Abandoned) 10Jforrester: [DNM] memcached: Drop the local-only Wikifunctions cache route [puppet] - 10https://gerrit.wikimedia.org/r/1229231 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester) [19:26:59] 06SRE, 06DC-Ops, 10ServiceOps-Upgrades-Hardware: conf200[7-9] implementation tracking - https://phabricator.wikimedia.org/T418915 (10RobH) 03NEW [19:28:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P89755 and previous config saved to /var/cache/conftool/dbconfig/20260303-192820-marostegui.json [19:28:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527#11670094 (10Papaul) @BCornwall I did some testing today to eliminate the fact that it is the Network card that is causing the issue by putting in another 10G network card in cp20... [19:29:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916 (10RobH) 03NEW [19:30:02] (03CR) 10Aaron Schulz: [C:03+1] REST: show the beta Attribution API in the REST Sandbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244748 (https://phabricator.wikimedia.org/T418522) (owner: 10BPirkle) [19:30:10] (03CR) 10Catrope: [C:03+1] Enable confirmemail logstash channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247651 (https://phabricator.wikimedia.org/T415902) (owner: 10Mmartorana) [19:30:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11670114 (10RobH) [19:30:26] (03CR) 10BCornwall: [C:03+1] "Great work! Just a nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [19:30:30] (03PS1) 10Dzahn: site: add phab2003 with collab insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1247665 (https://phabricator.wikimedia.org/T418899) [19:31:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb101[56] - https://phabricator.wikimedia.org/T418916#11670136 (10RobH) a:03Clement_Goubert @Clement_Goubert, I have made some assumptions in racking details. The parent order is already in approval... [19:32:43] (03CR) 10Aaron Schulz: [C:03+1] REST: show the beta Attribution API in the REST Sandbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244748 (https://phabricator.wikimedia.org/T418522) (owner: 10BPirkle) [19:32:50] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918 (10RobH) 03NEW [19:32:55] (03CR) 10Dzahn: [C:03+2] site: add phab2003 with collab insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1247665 (https://phabricator.wikimedia.org/T418899) (owner: 10Dzahn) [19:33:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246 (T418465)', diff saved to https://phabricator.wikimedia.org/P89756 and previous config saved to /var/cache/conftool/dbconfig/20260303-193338-marostegui.json [19:33:42] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [19:33:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2247.codfw.wmnet with reason: Maintenance [19:33:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2247 (T418465)', diff saved to https://phabricator.wikimedia.org/P89757 and previous config saved to /var/cache/conftool/dbconfig/20260303-193351-marostegui.json [19:34:05] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install phab2003 - https://phabricator.wikimedia.org/T418899#11670161 (10Dzahn) a:05Dzahn→03None Thank you. Done! Added to site.pp with insetup role. preseed.yml already covered by a wildcard. [19:36:24] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11670175 (10JerryWang-WMF) Approved. Thanks [19:38:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install wikikube-ctrl100[45] - https://phabricator.wikimedia.org/T418919 (10RobH) 03NEW [19:39:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install wikikube-ctrl100[45] - https://phabricator.wikimedia.org/T418919#11670190 (10RobH) a:03Clement_Goubert @Clement_Goubert, I've assumed the racking details, please double check them for accuracy and pro... [19:39:45] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: wikikube-ctrl100[45] implementation tracking - https://phabricator.wikimedia.org/T418920 (10RobH) 03NEW [19:40:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install wikikube-ctrl100[45] - https://phabricator.wikimedia.org/T418919#11670220 (10RobH) [19:42:23] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp2043.codfw.wmnet [19:42:24] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2043.codfw.wmnet [19:42:49] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T416066#11670224 (10BTullis) >>! In T416066#11669353, @Jclark-ctr wrote: > @BTullis @RKemper have you been able to look at this? Sorry for the delay. I've been looking at this today, as part of updating the Serv... [19:43:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P89758 and previous config saved to /var/cache/conftool/dbconfig/20260303-194327-marostegui.json [19:49:17] (03PS3) 10Aaron Schulz: REST: show the beta Attribution API in the REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244748 (https://phabricator.wikimedia.org/T418522) (owner: 10BPirkle) [19:50:30] (03PS2) 10Aaron Schulz: Remove redundant mw-extra wgRestSandboxSpecs entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247652 [19:51:18] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp7008.magru.wmnet with OS trixie [19:53:52] (03PS1) 10Ebernhardson: semantic: Update image to opensearch 3.5.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247669 [19:57:44] (03CR) 10DLynch: "We'd still need to backport if we want to give this to people early next week. We can wait to bundle it all into one patch, though." [extensions/VisualEditor] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247434 (https://phabricator.wikimedia.org/T414987) (owner: 10Medelius) [19:58:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T418465)', diff saved to https://phabricator.wikimedia.org/P89759 and previous config saved to /var/cache/conftool/dbconfig/20260303-195835-marostegui.json [19:58:40] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [19:58:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1252.eqiad.wmnet with reason: Maintenance [19:59:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1252 (T418465)', diff saved to https://phabricator.wikimedia.org/P89760 and previous config saved to /var/cache/conftool/dbconfig/20260303-195900-marostegui.json [19:59:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247 (T418465)', diff saved to https://phabricator.wikimedia.org/P89761 and previous config saved to /var/cache/conftool/dbconfig/20260303-195916-marostegui.json [20:03:38] (03CR) 10Bking: semantic: Update image to opensearch 3.5.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247669 (owner: 10Ebernhardson) [20:09:29] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922 (10RobH) 03NEW [20:09:47] (03PS2) 10Ebernhardson: semantic: Update image to opensearch 3.5.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247669 [20:09:50] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11670345 (10RobH) [20:09:59] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1199.eqiad.wmnet [20:11:28] (03CR) 10Ebernhardson: semantic: Update image to opensearch 3.5.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247669 (owner: 10Ebernhardson) [20:11:43] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11670349 (10RobH) a:03Clement_Goubert @Clement_Goubert, Please double check the racking details in the task description as I made some assumption... [20:12:13] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: rdb201[34] implementation tracking - https://phabricator.wikimedia.org/T418924 (10RobH) 03NEW [20:12:46] 10ops-eqiad, 06SRE, 06DC-Ops: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418825#11670383 (10VRiley-WMF) 05Open→03Resolved This seems like it happened due to a loose cable connection. Will monitor, but if it happens again, we may want to swap t... [20:14:13] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:14:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P89762 and previous config saved to /var/cache/conftool/dbconfig/20260303-201423-marostegui.json [20:17:02] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7008.magru.wmnet with reason: host reimage [20:18:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install wikikube-worker23[57-74] - https://phabricator.wikimedia.org/T418925 (10RobH) 03NEW [20:19:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install wikikube-worker23[57-74] - https://phabricator.wikimedia.org/T418925#11670413 (10RobH) [20:20:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install wikikube-worker23[57-74] - https://phabricator.wikimedia.org/T418925#11670437 (10RobH) a:03Clement_Goubert @Clement_Goubert, We didn't have hostnames on the ordering task racking info, so I assumed th... [20:21:05] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: wikikube-worker23[57-74] implementation tracking - https://phabricator.wikimedia.org/T418927 (10RobH) 03NEW [20:23:04] (03CR) 10Bking: [C:03+2] semantic: Update image to opensearch 3.5.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247669 (owner: 10Ebernhardson) [20:24:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T418465)', diff saved to https://phabricator.wikimedia.org/P89763 and previous config saved to /var/cache/conftool/dbconfig/20260303-202447-marostegui.json [20:24:51] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [20:24:51] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7008.magru.wmnet with reason: host reimage [20:29:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P89764 and previous config saved to /var/cache/conftool/dbconfig/20260303-202931-marostegui.json [20:34:23] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [20:34:41] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [20:36:39] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928 (10RobH) 03NEW [20:37:01] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11670500 (10RobH) [20:37:39] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11670501 (10RobH) a:03Jgreen Jeff, I made assumptions on this since we didn't have racking details on the parent ordering task. Please doublecheck all racking details. [20:39:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P89765 and previous config saved to /var/cache/conftool/dbconfig/20260303-203954-marostegui.json [20:41:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929 (10RobH) 03NEW [20:42:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11670529 (10RobH) [20:42:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11670530 (10RobH) a:03herron Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and ad... [20:43:07] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745#11670534 (10Krinkle) >>! In T418745#11667612, @Ladsgroup wrote: >>>! In T418745#... [20:44:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247 (T418465)', diff saved to https://phabricator.wikimedia.org/P89766 and previous config saved to /var/cache/conftool/dbconfig/20260303-204439-marostegui.json [20:44:43] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [20:44:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2248.codfw.wmnet with reason: Maintenance [20:44:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2248 (T418465)', diff saved to https://phabricator.wikimedia.org/P89767 and previous config saved to /var/cache/conftool/dbconfig/20260303-204452-marostegui.json [20:45:08] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931 (10RobH) 03NEW [20:45:34] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#11670574 (10RobH) a:03herron Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and ad... [20:45:56] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#11670582 (10RobH) [20:48:04] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11670603 (10Dzahn) Thank you @VRiley-WMF Gotcha about the filter, makes sense. I took a look and started sorting... [20:49:26] (03CR) 10Jforrester: REST: show the beta Attribution API in the REST Sandbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244748 (https://phabricator.wikimedia.org/T418522) (owner: 10BPirkle) [20:50:11] (03CR) 10Gehel: [C:03+1] "lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/1247600 (https://phabricator.wikimedia.org/T418152) (owner: 10Joal) [20:51:46] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7008.magru.wmnet with OS trixie [20:55:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P89768 and previous config saved to /var/cache/conftool/dbconfig/20260303-205502-marostegui.json [20:55:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527#11670617 (10BCornwall) We reimaged cp7008 to see if the other hardware would also be affected. [[ https://grafana-rw.wikimedia.org/d/000000377/host-overview?forceLogin=true&from=... [20:58:36] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp2045.codfw.wmnet with reason: troubleshooting for T418527 [20:58:39] T418527: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527 [20:59:13] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:00:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247652 (owner: 10Aaron Schulz) [21:01:04] (03PS1) 10RLazarus: _mediawiki-common_: Add /*/wf-wan memcache routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247677 (https://phabricator.wikimedia.org/T411807) [21:01:06] (03PS1) 10RLazarus: wikifunctions and friends: Add /*/wf-wan memcache routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247678 (https://phabricator.wikimedia.org/T411807) [21:01:51] (03PS2) 10RLazarus: mw-debug: Add /*/wf-wan memcache routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245162 (https://phabricator.wikimedia.org/T411807) [21:01:53] any deployers available for the window? I have 2 config patches, and am here 👋 [21:02:16] I can deploy [21:02:29] much obliged! woot [21:02:42] I wonder what happened to the automated message! [21:02:56] ... are you an automated message? [21:03:02] maybe you are and you haven't realized. [21:03:04] :D [21:03:05] jeencebot [21:03:08] hahaha [21:03:15] jeenai [21:03:20] oh no [21:03:21] jouncebot: now [21:03:21] For the next 0 hour(s) and 56 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260303T2100) [21:04:01] weird indeed that it didn't speak up on the hour [21:04:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248 (T418465)', diff saved to https://phabricator.wikimedia.org/P89769 and previous config saved to /var/cache/conftool/dbconfig/20260303-210407-marostegui.json [21:04:11] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [21:04:28] is the one that pings all the people on the calendar also jouncebot? I forget [21:04:48] it is, yeah [21:04:52] mooeypoo: shall I deploy both yours at the same time? [21:05:07] It's a different test for each but they're also super small so up to you? [21:05:56] okay [21:07:59] they're both in Special:RestSandox so I'm just checking "is it still here" and "is this new thing now here" [21:08:18] mooeypoo: I noticed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1244748 has an unresolved comment on it, is it fine to go forward? [21:08:39] I'll start https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1247652 in the mean time [21:08:56] Yes, James is right that we should remove when the task is done, but it is good to go [21:09:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247652 (owner: 10Aaron Schulz) [21:10:01] 👍 [21:10:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T418465)', diff saved to https://phabricator.wikimedia.org/P89770 and previous config saved to /var/cache/conftool/dbconfig/20260303-211009-marostegui.json [21:10:13] +1 [21:10:14] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [21:10:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1260.eqiad.wmnet with reason: Maintenance [21:10:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1260 (T418465)', diff saved to https://phabricator.wikimedia.org/P89771 and previous config saved to /var/cache/conftool/dbconfig/20260303-211033-marostegui.json [21:11:00] (03Merged) 10jenkins-bot: Remove redundant mw-extra wgRestSandboxSpecs entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247652 (owner: 10Aaron Schulz) [21:11:31] !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1247652|Remove redundant mw-extra wgRestSandboxSpecs entry]] [21:12:12] ok tested it on mwdebug -- still showing what it needs to show on enwiki and testwiki. Good to go [21:13:33] !log jhuneidi@deploy2002 jhuneidi, aaron: Backport for [[gerrit:1247652|Remove redundant mw-extra wgRestSandboxSpecs entry]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:14:06] yes, sorry, jumped the gun there. I tested again, good. [21:14:18] okay cool [21:14:31] !log jhuneidi@deploy2002 jhuneidi, aaron: Continuing with sync [21:14:54] 06SRE, 10Wikimedia-Mailing-lists: Some messages on wikitech-l seem to lack an x-spam-score header - https://phabricator.wikimedia.org/T386559#11670722 (10bd808) https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/LNEA5MZV6IUKCTZPRBJ2BP2WXYHA4TA6/ seems to be another message that i... [21:18:27] !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247652|Remove redundant mw-extra wgRestSandboxSpecs entry]] (duration: 06m 56s) [21:19:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248', diff saved to https://phabricator.wikimedia.org/P89772 and previous config saved to /var/cache/conftool/dbconfig/20260303-211915-marostegui.json [21:21:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244748 (https://phabricator.wikimedia.org/T418522) (owner: 10BPirkle) [21:22:50] 10ops-eqiad, 06DC-Ops: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418939 (10phaultfinder) 03NEW [21:27:44] (03Merged) 10jenkins-bot: REST: show the beta Attribution API in the REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244748 (https://phabricator.wikimedia.org/T418522) (owner: 10BPirkle) [21:28:02] (03PS2) 10Bking: dse-k8s-ingress: Enable active-active [puppet] - 10https://gerrit.wikimedia.org/r/1238832 (https://phabricator.wikimedia.org/T396478) [21:28:14] !log jhuneidi@deploy2002 Started scap sync-world: Backport for [[gerrit:1244748|REST: show the beta Attribution API in the REST Sandbox (T418522)]] [21:28:17] T418522: Include Attribution API beta module in the REST Sandbox on test wiki - https://phabricator.wikimedia.org/T418522 [21:28:54] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-magru:et-0/0/1 (Core: asw1-b3-magru:et-0/0/50 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:29:44] (03PS3) 10Bking: dse-k8s-ingress: Enable active-active [puppet] - 10https://gerrit.wikimedia.org/r/1238832 (https://phabricator.wikimedia.org/T396478) [21:30:15] !log jhuneidi@deploy2002 jhuneidi, bpirkle: Backport for [[gerrit:1244748|REST: show the beta Attribution API in the REST Sandbox (T418522)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:31:21] woot, works! [21:31:35] thanks jeena! [21:31:56] this can go on to its merry prod way [21:32:00] yay thanks! [21:32:05] !log jhuneidi@deploy2002 jhuneidi, bpirkle: Continuing with sync [21:32:43] I just realized I celebrated "woo it works" as if I'm surprised. We *knew* the patch was going to work. I was just excitedly validating.... you know. with excitement. [21:33:43] (03CR) 10Herron: [C:03+2] mwlog: add trixie hosts to udp tee [puppet] - 10https://gerrit.wikimedia.org/r/1247630 (https://phabricator.wikimedia.org/T417002) (owner: 10Herron) [21:33:46] LOL [21:34:07] I had confidence in you anyway [21:34:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248', diff saved to https://phabricator.wikimedia.org/P89773 and previous config saved to /var/cache/conftool/dbconfig/20260303-213423-marostegui.json [21:35:54] !log jhuneidi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244748|REST: show the beta Attribution API in the REST Sandbox (T418522)]] (duration: 07m 41s) [21:35:58] T418522: Include Attribution API beta module in the REST Sandbox on test wiki - https://phabricator.wikimedia.org/T418522 [21:36:57] haha good [21:37:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T418465)', diff saved to https://phabricator.wikimedia.org/P89774 and previous config saved to /var/cache/conftool/dbconfig/20260303-213739-marostegui.json [21:37:43] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [21:37:45] woo! okay, thank you! i'm'a move along now to notify our PM that she can play around with the newly available stuff. [21:37:56] you're welcome! [21:39:54] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [21:39:58] OK, is prod free for some SRE deploy-age by rzl? [21:40:03] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [21:42:08] 👋 yep grabbing the floor if that's okay [21:43:02] James_F: if you're still happy with it will you re-+1 https://gerrit.wikimedia.org/r/1245162 after I split it up? then I'll start it rolling [21:43:15] Sure. [21:43:25] (03CR) 10Jforrester: [C:03+1] mw-debug: Add /*/wf-wan memcache routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245162 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [21:43:34] (03CR) 10RLazarus: [C:03+2] mw-debug: Add /*/wf-wan memcache routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245162 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [21:44:00] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 07Wikimedia-Incident: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11670845 (10AlexisJazz) On 28 February at 13:16 there was another bump in 5xx haproxy errors paired with... [21:44:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic: Network drop errors with new codfw cp hosts - https://phabricator.wikimedia.org/T418527#11670848 (10BCornwall) 05Open→03Resolved a:03BCornwall @BBlack discovered the source of the issue: LLDP packets are being dropped by lldpd. See T418941 for a follow-up.... [21:45:45] (03Merged) 10jenkins-bot: mw-debug: Add /*/wf-wan memcache routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245162 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [21:47:55] (03CR) 10D3r1ck01: [C:03+1] Enable JWT session cookie for bot passwords (all wikis) (attempt #2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247596 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [21:48:39] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp2045.codfw.wmnet [21:48:40] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2045.codfw.wmnet [21:48:54] FIRING: [2x] SLOMetricAbsent: wdqs-scholarly-availability codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:48:57] looking at the helmfile diffs now -- I see the new mcrouter routes in eqiad/mw-debug-next and eqiad/mw-debug-pinkunicorn but no changes for other targets, which is correct [21:49:19] I *don't* see the same changes for codfw/mw-debug-*, which is sketchy, looking [21:49:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248 (T418465)', diff saved to https://phabricator.wikimedia.org/P89775 and previous config saved to /var/cache/conftool/dbconfig/20260303-214931-marostegui.json [21:49:35] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [21:50:09] here's what I'm looking at https://www.irccloud.com/pastebin/tqloLnGT/ [21:50:15] It's possibleHmm. [21:50:38] Is mw-debug-* defined for codfw? [21:50:56] Or rather, is it defined as empty somehow? [21:50:57] sorry rzl James_F, if you were asking me about deploying, then yes, all backports are finished [21:51:03] jeena: Thanks! [21:51:16] James_F: mw-debug does exist there fully-fledged, yeah [21:51:21] Hmm. Odd. [21:52:37] oh weird, but the mediawiki-{next,pinkunicorn}-mcrouter-config configmap *doesn't* exist [21:52:44] How Did This Ever Work [21:52:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P89776 and previous config saved to /var/cache/conftool/dbconfig/20260303-215247-marostegui.json [21:52:48] Ah. [21:52:50] Maybe it didn't? [21:53:04] There's an awful lot of complexity around the mw-* flavours. [21:53:48] oh man, the mw-debug pods exist in codfw but they just don't run mcrouter at all [21:54:03] okay, let's address that later -- for now we can at least check that this does the right thing in eqiad [21:54:07] Ack. [21:54:18] !log rzl@deploy2002 Started scap sync-world: https://gerrit.wikimedia.org/r/1245162 T411807 [21:54:22] T411807: WF memcached service is dc-local but used for dc-global content - https://phabricator.wikimedia.org/T411807 [21:55:17] !log rzl@deploy2002 rzl: https://gerrit.wikimedia.org/r/1245162 T411807 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:55:56] (for non-wikifunctions memcache usages we depend on the separate mcrouter that runs as a daemonset, not as a sidecar container, so it isn't crazy that we didn't notice this -- but the two DCs should match each other one way or the other) [21:56:08] True. [21:56:47] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [21:56:56] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [21:56:57] OK, in manual testing on k8s-mwdebug-eqiad I can confirm that WF memcached behaviour is still working. [21:57:10] and I do see mcrouter picked up the new config there, great [21:57:10] (Which we anticipated.) [21:58:01] okay, scap "continue with sync" is a no-op since this is mw-debug only, I'm going to let it run in order to unblock the 22 UTC window in case it's in use today [21:58:03] !log rzl@deploy2002 rzl: Continuing with sync [21:58:09] And the caches are still split, if I switch to k8s-mwdebug-codfw it has a memcached miss at first, and a hit but with a different timestamp. [21:58:38] I doubt the Web window will be used, but good to be clean, yes. [21:58:57] yeah, and that "still split" is expected since we're still using the /local/wf route, right? [21:59:04] Yup. [21:59:07] !log rzl@deploy2002 Finished scap sync-world: https://gerrit.wikimedia.org/r/1245162 T411807 (duration: 12m 15s) [21:59:11] Just confirming that behaviour is unchanged. [21:59:15] yep, perfect [21:59:25] okay I'll hold until :05 just in case, but I agree it's probably ours to play with [21:59:37] Should I write the mc.php change? [21:59:46] to try with mw-experimental? yeah, sounds good to me [21:59:56] I'll see about getting a mcrouter configmap placed there too [22:01:56] okay, we have no in-pod mcrouter in mw-experimental either, just like mw-debug in eqiad (and for that matter, just like production mw deployments apart from mw-wikifunctions) [22:02:19] but we don't want to mess with the shared daemonset mcrouter config at this point, so let me see if we have an easy way to switch on a sidecar one [22:02:43] *just like mw-debug in codwf [22:02:46] **codfw jeez [22:02:48] Ack. [22:07:29] cool, https://gerrit.wikimedia.org/r/1093390 is why eqiad and codfw are different -- we can easily do the same in mw-experimental, so why don't I do that for now, we can experiment, and I'll just roll it back after [22:07:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P89777 and previous config saved to /var/cache/conftool/dbconfig/20260303-220754-marostegui.json [22:09:14] (03PS1) 10Jforrester: mc: Shift the Wikifunctions MC route from /local/wf/ to //wf-wan/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247687 (https://phabricator.wikimedia.org/T411807) [22:09:36] I *think* that'll configure MC for our new route. It'll also break our old one for MW things, though./ [22:09:44] (03PS1) 10Bking: opensearch-test: update to latest OpenSearch 3 image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247688 (https://phabricator.wikimedia.org/T418388) [22:11:04] (03PS1) 10Gergő Tisza: Do not invalidate anon sessions with non-anon JWT cookies [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247689 (https://phabricator.wikimedia.org/T415007) [22:11:26] (03PS1) 10Gergő Tisza: Do not invalidate anon sessions with non-anon JWT cookies [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247690 (https://phabricator.wikimedia.org/T415007) [22:12:22] rzl: can you ping me when you are done? I have a late addition to the backport window [22:12:52] tgr_: you can go ahead and play through, we'll need a few more minutes [22:13:05] (03CR) 10Ryan Kemper: [C:03+1] opensearch-test: update to latest OpenSearch 3 image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247688 (https://phabricator.wikimedia.org/T418388) (owner: 10Bking) [22:13:14] (03CR) 10Bking: [C:03+2] opensearch-test: update to latest OpenSearch 3 image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247688 (https://phabricator.wikimedia.org/T418388) (owner: 10Bking) [22:14:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247689 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [22:14:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247690 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [22:14:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247596 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [22:15:16] (03Merged) 10jenkins-bot: opensearch-test: update to latest OpenSearch 3 image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247688 (https://phabricator.wikimedia.org/T418388) (owner: 10Bking) [22:16:41] thx [22:16:58] (testing my mw-experimental charts change now) [22:17:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:19:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247689 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [22:19:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247690 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [22:19:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247596 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [22:21:19] 10ops-eqiad, 06SRE, 06DC-Ops: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418939#11670992 (10VRiley-WMF) a:03VRiley-WMF [22:23:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T418465)', diff saved to https://phabricator.wikimedia.org/P89778 and previous config saved to /var/cache/conftool/dbconfig/20260303-222301-marostegui.json [22:23:05] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [22:23:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1261.eqiad.wmnet with reason: Maintenance [22:23:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1261 (T418465)', diff saved to https://phabricator.wikimedia.org/P89779 and previous config saved to /var/cache/conftool/dbconfig/20260303-222324-marostegui.json [22:26:49] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [22:26:58] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [22:27:30] 10ops-eqiad, 06SRE, 06DC-Ops: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418939#11671018 (10VRiley-WMF) Hey @Clement_Goubert it looks like the interface needs to be replaced. Is there a good time to have a little bit of downtime for this unit? [22:29:24] (03PS1) 10RLazarus: mw-experimental: Add /*/wf-wan memcache routes and use in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247694 (https://phabricator.wikimedia.org/T411807) [22:30:05] (03PS2) 10RLazarus: mw-experimental: Add /*/wf-wan memcache routes and use in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247694 (https://phabricator.wikimedia.org/T411807) [22:33:37] (03Merged) 10jenkins-bot: Do not invalidate anon sessions with non-anon JWT cookies [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1247689 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [22:33:45] (03Merged) 10jenkins-bot: Do not invalidate anon sessions with non-anon JWT cookies [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247690 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [22:35:15] (03CR) 10Jforrester: [C:03+1] mw-experimental: Add /*/wf-wan memcache routes and use in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247694 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [22:36:23] (03CR) 10Gergő Tisza: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247596 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [22:37:04] James_F: ^ that still doesn't quite have the desired effect, in the diff I see local mcrouter still turned off, working on why [22:37:25] Right. Note that it looks like mw-debug has most of this already set up, if that'd be easier to test? [22:37:50] Both /eqiad/mw and /local/wf routes, etc. [22:38:24] (03CR) 10Gergő Tisza: [C:03+2] Enable JWT session cookie for bot passwords (all wikis) (attempt #2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247596 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [22:39:04] 06SRE, 10SRE-Access-Requests: Requesting update of Raymond Ndibe's SSH key to Yubikey-backed key - https://phabricator.wikimedia.org/T417594#11671055 (10Raymond_Ndibe) >>! In T417594#11632621, @MatthewVernon wrote: > @Raymond_Ndibe this is done - give it half an hour for puppet to run everywhere, and you s... [22:40:02] (03Merged) 10jenkins-bot: Enable JWT session cookie for bot passwords (all wikis) (attempt #2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247596 (https://phabricator.wikimedia.org/T415007) (owner: 10Gergő Tisza) [22:40:20] (got it, one sec) [22:40:36] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1247689|Do not invalidate anon sessions with non-anon JWT cookies (T415007)]], [[gerrit:1247690|Do not invalidate anon sessions with non-anon JWT cookies (T415007)]], [[gerrit:1247596|Enable JWT session cookie for bot passwords (all wikis) (attempt #2) (T415007)]] [22:40:39] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [22:42:43] !log tgr@deploy2002 tgr: Backport for [[gerrit:1247689|Do not invalidate anon sessions with non-anon JWT cookies (T415007)]], [[gerrit:1247690|Do not invalidate anon sessions with non-anon JWT cookies (T415007)]], [[gerrit:1247596|Enable JWT session cookie for bot passwords (all wikis) (attempt #2) (T415007)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:43:38] (03PS3) 10RLazarus: mw-experimental: Add /*/wf-wan memcache routes and use in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247694 (https://phabricator.wikimedia.org/T411807) [22:44:11] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [22:44:22] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [22:45:06] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745#11671101 (10Ladsgroup) I like the idea. +2'ed the patch. Will backport all three... [22:45:11] ^ better -- previous version was a no-op because I had the values file precedence order wrong, and all I had to do was stare at it for ten minutes, send it to a teammate for help, and immediately see the problem [22:45:37] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [22:45:49] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [22:45:51] Ha. [22:46:01] helmchart is "simple", AKA a nightmare. [22:46:08] (03CR) 10Jforrester: [C:03+1] mw-experimental: Add /*/wf-wan memcache routes and use in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247694 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [22:47:43] (03PS1) 10Jasmine: install_server: use UEFI for new control plane nodes wikikube-ctrl200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1247695 (https://phabricator.wikimedia.org/T390861) [22:49:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T418465)', diff saved to https://phabricator.wikimedia.org/P89780 and previous config saved to /var/cache/conftool/dbconfig/20260303-224913-marostegui.json [22:49:18] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [22:50:15] (03PS2) 10Jasmine: install_server: use UEFI for new control plane nodes wikikube-ctrl200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1247695 (https://phabricator.wikimedia.org/T390861) [22:54:06] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size - https://phabricator.wikimedia.org/T418745#11671125 (10Ladsgroup) First, fixing the ones that are already explicitly settin... [22:56:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - asw1-b3-magru:et-0/0/50 (Core: cr2-magru:et-0/0/1 {#70130}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=asw1-b3-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:56:51] !log bking@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker1028.eqiad.wmnet [22:57:17] (03CR) 10Scott French: [C:03+1] "While I can't really speak to the content of the config authoritatively, this looks like a sensible way to test the config change! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247694 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [22:57:46] tgr_: looks like you're still at testservers? no rush but lmk whenever you're done, ready to proceed here :) [22:58:15] !log tgr@deploy2002 tgr: Continuing with sync [22:58:20] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp7008 [reason: lldpd packet drop issues] [22:58:26] rzl: just finished testing [22:58:37] tgr_: Does it work? Awesome. [23:00:25] yeah although it seemed to work last week as well, will see if it survives bots with weird cookie handling this time [23:00:33] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp7008.magru.wmnet [reason: lldpd packet drop issues] [23:00:42] Fingers crossed. [23:02:23] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247689|Do not invalidate anon sessions with non-anon JWT cookies (T415007)]], [[gerrit:1247690|Do not invalidate anon sessions with non-anon JWT cookies (T415007)]], [[gerrit:1247596|Enable JWT session cookie for bot passwords (all wikis) (attempt #2) (T415007)]] (duration: 21m 47s) [23:02:27] T415007: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007 [23:02:34] rzl: over to you [23:03:01] thanks! [23:03:09] (03CR) 10RLazarus: [C:03+2] mw-experimental: Add /*/wf-wan memcache routes and use in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247694 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [23:04:09] !log bking@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker1028.eqiad.wmnet [23:04:15] James_F: I'll deploy that change which will make mcrouter pods appear with the new config, and then you should be able to play with mw-config as much as you like in mw-experimental in both DCs [23:04:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P89781 and previous config saved to /var/cache/conftool/dbconfig/20260303-230421-marostegui.json [23:04:22] * James_F nods. [23:04:52] just waiting for gate-and-submit to do one and then the other [23:05:01] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [23:05:11] !log ebernhardson@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [23:05:20] (03Merged) 10jenkins-bot: mw-experimental: Add /*/wf-wan memcache routes and use in-pod mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247694 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [23:05:39] James_F: it's also getting late for you, the mw-experimental part can happen on your own schedule too! [23:05:50] Eh, it's only 18:00. [23:07:05] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [23:08:50] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [23:08:58] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [23:10:05] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [23:10:17] James_F: all yours [23:10:20] Ack. [23:13:39] OK, I've applied my proposed change to mc.php but k8s-mw-experimental-eqiad and -codfw still exhibited the split-cache. [23:15:06] hm, interesting [23:15:18] They still work to cache things, though. [23:15:34] So either my change isn't working/being picked up, or this isn't enough. [23:15:48] which means either it's still using the old routing prefix (problem applying the mw-config) or the new routing isn't working as expected (problem with the new mcrouter config) [23:15:50] yeah [23:16:05] Ah, is the mw-wikifunctions re-routing taking precedence? I've never tried using mw-experimental for wikifunctions.org fixes. [23:17:34] Let me see if I can do the more circuitous check from test.wikipedia.org instead. [23:17:48] that's a part of WF request routing I haven't dug into as deeply as I want to yet, so all I can do is confirm you're asking the right question [23:17:54] Ack. [23:19:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P89782 and previous config saved to /var/cache/conftool/dbconfig/20260303-231929-marostegui.json [23:20:04] ahh tracing to the rescue though https://trace.wikimedia.org/trace/c5ee3164f7251cac3119729d4a96ce6a [23:22:23] So… it's not clear to me from that if that's our cache touches or someone else's. [23:22:36] all the spans here are on a mw-experimental hostname but I can't find the wikifunctions cache lookups specifically [23:22:36] yeah [23:22:52] The wf-cache lookups should be OTel-spanned. [23:24:25] And the non-WF code that runs on test.wikipedia.org doesn't give me a nice way to find out if it's cached. [23:25:56] Let's see if I can pull something from mwscript in the shell. [23:27:37] OK, I get $wgObjectCaches['mcrouter-wikifunctions']['routingPrefix'] to be "/codfw/wf-wan/" with my patch. [23:27:49] So it's using it, and still seems to work. [23:28:09] but not replicated in the way you expected? [23:28:19] If we apply the chart changes to the rest of the mw-* world it should be deployable. [23:28:40] I can't tell if it's the mw-wikifunctions server taking over either way. [23:29:30] I've removed my local patch to /srv/mediawiki/wmf-config/mc.php so we're no longer in a dirty set-up. [23:30:20] got it [23:30:22] The docs on https://wikitech.wikimedia.org/wiki/Mw-experimental say I can use mw-experimental-mediawiki-image-update.service but apparently it doesn't exist. [23:31:35] (03CR) 10Jforrester: [C:03+1] _mediawiki-common_: Add /*/wf-wan memcache routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247677 (https://phabricator.wikimedia.org/T411807) (owner: 10RLazarus) [23:32:39] hm, I'd like to validate that the mcrouter config is correct before we put it everywhere [23:32:47] Ack. [23:32:52] thinking through what that might look like [23:33:37] it would be nice if we could send WF traffic from mw-experimental to mw-experimental instead of to mw-wikifunctions -- do you know if we have the right config hooks for that? [23:34:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T418465)', diff saved to https://phabricator.wikimedia.org/P89783 and previous config saved to /var/cache/conftool/dbconfig/20260303-233436-marostegui.json [23:34:40] T418465: Drop cuc_agent & cuc_ip from cu_changes, cule_agent & cule_ip from cu_log_event, and cupe_agent & cupe_ip from cu_private_event on WMF wikis - https://phabricator.wikimedia.org/T418465 [23:34:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1262.eqiad.wmnet with reason: Maintenance [23:34:58] Yeah, good question. [23:35:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1262 (T418465)', diff saved to https://phabricator.wikimedia.org/P89784 and previous config saved to /var/cache/conftool/dbconfig/20260303-233500-marostegui.json [23:36:05] (03PS2) 10Jforrester: mc: Shift the Wikifunctions MC route from /local/wf/ to //wf-wan/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247687 (https://phabricator.wikimedia.org/T411807) [23:37:48] I don't think I can reason enough about it, sorry. [23:38:03] no worries, I'm digging a little to see if I can find it [23:38:06] Thanks. [23:53:20] aside: it also turns out mw-experimental only gets traced at a 1% sampling rate, so I'm cranking that up to 100% which will make this kind of thing easier [23:54:53] Ha, yes, that'd really help. [23:58:22] (03PS1) 10RLazarus: mw-experimental: Increase tracing sampling from 1% to 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247706