[00:00:15] (03CR) 10Zabe: [C:03+2] "..." [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248125 (https://phabricator.wikimedia.org/T419062) (owner: 10Zabe) [00:00:51] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248140 [00:08:57] (03Merged) 10jenkins-bot: NewFilesPager: Properly support file schema migration read new [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248123 (https://phabricator.wikimedia.org/T419062) (owner: 10Zabe) [00:09:04] (03Merged) 10jenkins-bot: NewFilesPager: Properly support file schema migration read new [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248125 (https://phabricator.wikimedia.org/T419062) (owner: 10Zabe) [00:10:07] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248125|NewFilesPager: Properly support file schema migration read new (T419062)]], [[gerrit:1248123|NewFilesPager: Properly support file schema migration read new (T419062)]] [00:10:10] T419062: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'img_timestamp' in 'WHERE'Function: MediaWiki\Pager\IndexPager::buildQueryInfo (MediaWiki\Specials\Pager\NewFilesPager)Query: SELECT /*! STRAIGHT_JOIN */ file_name AS - https://phabricator.wikimedia.org/T419062 [00:12:07] !log zabe@deploy2002 zabe: Backport for [[gerrit:1248125|NewFilesPager: Properly support file schema migration read new (T419062)]], [[gerrit:1248123|NewFilesPager: Properly support file schema migration read new (T419062)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:13:01] !log zabe@deploy2002 zabe: Continuing with sync [00:18:59] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248125|NewFilesPager: Properly support file schema migration read new (T419062)]], [[gerrit:1248123|NewFilesPager: Properly support file schema migration read new (T419062)]] (duration: 08m 52s) [00:19:03] T419062: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'img_timestamp' in 'WHERE'Function: MediaWiki\Pager\IndexPager::buildQueryInfo (MediaWiki\Specials\Pager\NewFilesPager)Query: SELECT /*! STRAIGHT_JOIN */ file_name AS - https://phabricator.wikimedia.org/T419062 [00:19:27] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:26:40] (03CR) 10Dzahn: [C:03+2] zuul::main: add zuul client cert to full chain of trust [puppet] - 10https://gerrit.wikimedia.org/r/1248137 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [00:29:11] (03PS1) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248148 (https://phabricator.wikimedia.org/T414112) [00:34:13] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:39:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1248149 [00:39:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1248149 (owner: 10TrainBranchBot) [00:46:51] (03CR) 10Zabe: [C:03+2] Stop writing to il_to on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248021 (https://phabricator.wikimedia.org/T415787) (owner: 10Zabe) [00:47:46] (03Merged) 10jenkins-bot: Stop writing to il_to on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248021 (https://phabricator.wikimedia.org/T415787) (owner: 10Zabe) [00:48:18] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248021|Stop writing to il_to on small wikis (T415787)]] [00:48:22] T415787: Stop writing to il_to by setting imagelinks migration to write new - https://phabricator.wikimedia.org/T415787 [00:50:17] !log zabe@deploy2002 zabe: Backport for [[gerrit:1248021|Stop writing to il_to on small wikis (T415787)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:51:08] !log zabe@deploy2002 zabe: Continuing with sync [00:54:07] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1248149 (owner: 10TrainBranchBot) [00:54:13] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:55:07] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248021|Stop writing to il_to on small wikis (T415787)]] (duration: 06m 49s) [00:55:11] T415787: Stop writing to il_to by setting imagelinks migration to write new - https://phabricator.wikimedia.org/T415787 [00:58:23] (03PS1) 10Zabe: Revert^2 "ImageListPager: Properly support file schema migration read new" [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248153 [00:58:55] (03PS1) 10Zabe: ImageListPager: Use correct name field for batch lookups [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248154 (https://phabricator.wikimedia.org/T418327) [00:59:10] (03PS2) 10Zabe: ImageListPager: Use correct name field for batch lookups [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248154 (https://phabricator.wikimedia.org/T418327) [00:59:29] (03CR) 10Zabe: [C:03+2] Revert^2 "ImageListPager: Properly support file schema migration read new" [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248153 (owner: 10Zabe) [00:59:32] (03CR) 10Zabe: [C:03+2] ImageListPager: Use correct name field for batch lookups [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248154 (https://phabricator.wikimedia.org/T418327) (owner: 10Zabe) [01:01:26] (03PS1) 10Dzahn: zuul: zuul scheduler needs to also have updated cert path [puppet] - 10https://gerrit.wikimedia.org/r/1248155 (https://phabricator.wikimedia.org/T395938) [01:03:11] (03PS12) 10Andrew Bogott: toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 [01:03:54] (03CR) 10Dzahn: [C:03+2] zuul: zuul scheduler needs to also have updated cert path [puppet] - 10https://gerrit.wikimedia.org/r/1248155 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [01:09:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1248156 [01:09:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1248156 (owner: 10TrainBranchBot) [01:15:21] (03Merged) 10jenkins-bot: Revert^2 "ImageListPager: Properly support file schema migration read new" [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248153 (owner: 10Zabe) [01:15:27] (03CR) 10CI reject: [V:04-1] ImageListPager: Use correct name field for batch lookups [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248154 (https://phabricator.wikimedia.org/T418327) (owner: 10Zabe) [01:15:37] (03CR) 10Zabe: [C:03+2] "..." [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248154 (https://phabricator.wikimedia.org/T418327) (owner: 10Zabe) [01:20:49] (03Merged) 10jenkins-bot: ImageListPager: Use correct name field for batch lookups [core] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1248154 (https://phabricator.wikimedia.org/T418327) (owner: 10Zabe) [01:21:39] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248154|ImageListPager: Use correct name field for batch lookups (T418327)]], [[gerrit:1248153|Revert^2 "ImageListPager: Properly support file schema migration read new"]] [01:21:43] T418327: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Pager\IndexPager::buildQueryInfo (MediaWiki\Specials\Pager\ImageListPager): [1054] Unk - https://phabricator.wikimedia.org/T418327 [01:23:39] !log zabe@deploy2002 zabe: Backport for [[gerrit:1248154|ImageListPager: Use correct name field for batch lookups (T418327)]], [[gerrit:1248153|Revert^2 "ImageListPager: Properly support file schema migration read new"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:25:06] !log zabe@deploy2002 zabe: Continuing with sync [01:29:00] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248154|ImageListPager: Use correct name field for batch lookups (T418327)]], [[gerrit:1248153|Revert^2 "ImageListPager: Properly support file schema migration read new"]] (duration: 07m 21s) [01:29:03] T418327: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Pager\IndexPager::buildQueryInfo (MediaWiki\Specials\Pager\ImageListPager): [1054] Unk - https://phabricator.wikimedia.org/T418327 [01:32:53] (03CR) 10Zabe: [C:03+2] Start reading from new file tables on medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1246099 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [01:33:47] (03Merged) 10jenkins-bot: Start reading from new file tables on medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1246099 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [01:34:21] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1246099|Start reading from new file tables on medium wikis (T416548)]] [01:34:25] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [01:36:24] !log zabe@deploy2002 zabe: Backport for [[gerrit:1246099|Start reading from new file tables on medium wikis (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:36:44] !log zabe@deploy2002 zabe: Continuing with sync [01:38:10] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1248156 (owner: 10TrainBranchBot) [01:40:37] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1246099|Start reading from new file tables on medium wikis (T416548)]] (duration: 06m 15s) [01:40:41] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [01:52:54] (03PS1) 10Zabe: Stop writing to il_to on medium size wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248163 (https://phabricator.wikimedia.org/T415787) [01:54:04] (03CR) 10Zabe: [C:03+2] Stop writing to il_to on medium size wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248163 (https://phabricator.wikimedia.org/T415787) (owner: 10Zabe) [01:54:54] (03Merged) 10jenkins-bot: Stop writing to il_to on medium size wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248163 (https://phabricator.wikimedia.org/T415787) (owner: 10Zabe) [01:55:40] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248163|Stop writing to il_to on medium size wikis (T415787)]] [01:55:44] T415787: Stop writing to il_to by setting imagelinks migration to write new - https://phabricator.wikimedia.org/T415787 [01:57:42] !log zabe@deploy2002 zabe: Backport for [[gerrit:1248163|Stop writing to il_to on medium size wikis (T415787)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:58:00] !log zabe@deploy2002 zabe: Continuing with sync [02:01:54] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248163|Stop writing to il_to on medium size wikis (T415787)]] (duration: 06m 14s) [02:01:58] T415787: Stop writing to il_to by setting imagelinks migration to write new - https://phabricator.wikimedia.org/T415787 [02:02:17] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:08:54] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:12] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 07m 55s) [02:10:40] FIRING: [2x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:33:54] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:39:13] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:19:27] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:34:13] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:54:13] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:10:40] FIRING: [2x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:21:42] 06SRE, 06Infrastructure-Foundations, 10netops: Update esams network pop diagrams - https://phabricator.wikimedia.org/T368084#11676461 (10ayounsi) LGTM thanks ! [06:33:49] (03PS1) 10Marostegui: instances.yaml: Remove es1033 [puppet] - 10https://gerrit.wikimedia.org/r/1248309 (https://phabricator.wikimedia.org/T408772) [06:34:46] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1033 [puppet] - 10https://gerrit.wikimedia.org/r/1248309 (https://phabricator.wikimedia.org/T408772) (owner: 10Marostegui) [06:35:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1033 T408772', diff saved to https://phabricator.wikimedia.org/P89804 and previous config saved to /var/cache/conftool/dbconfig/20260305-063548-marostegui.json [06:35:52] T408772: decommission es1033.eqiad.wmnet - https://phabricator.wikimedia.org/T408772 [06:43:31] (03PS1) 10Kevin Bazira: ml-services: update embeddings isvc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248310 (https://phabricator.wikimedia.org/T418976) [06:50:28] (03PS1) 10Marostegui: installserver: Do not format db2246 and db2247 [puppet] - 10https://gerrit.wikimedia.org/r/1248311 [06:52:58] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2246 and db2247 [puppet] - 10https://gerrit.wikimedia.org/r/1248311 (owner: 10Marostegui) [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T0700) [07:00:04] marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T0700). [07:16:58] (03PS1) 10Stang: Revert "zhwiki: Add 2026 CNY celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248314 [07:17:42] (03PS2) 10Stang: Revert "zhwiki: Add 2026 CNY celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248314 [07:18:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248314 (owner: 10Stang) [07:37:34] (03PS1) 10Muehlenhoff: Remove bast4005 from list of active bastions [puppet] - 10https://gerrit.wikimedia.org/r/1248318 (https://phabricator.wikimedia.org/T418993) [07:39:13] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:40:30] (03CR) 10Muehlenhoff: [C:03+2] Remove bast4005 from list of active bastions [puppet] - 10https://gerrit.wikimedia.org/r/1248318 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [07:42:43] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast4005.wikimedia.org [07:42:55] (03CR) 10Ozge: [C:03+2] ml-services: update embeddings isvc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248310 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [07:44:55] (03Merged) 10jenkins-bot: ml-services: update embeddings isvc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248310 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [07:47:13] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [07:47:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:47:29] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:48:48] !log uploaded bird2 2.18-1~wmf13u2 to the main component of trixie-wikimedia T413740 [07:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:51] T413740: Backport and test Bird 2.18 - https://phabricator.wikimedia.org/T413740 [07:50:25] FIRING: [3x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:52:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:53:15] jmm@cumin2002 decommission (PID 3795348) is awaiting input [07:55:09] (03PS1) 10Muehlenhoff: Remove support for enabling Bird 2.18 selectively [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) [07:55:25] FIRING: [3x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:55:42] (03CR) 10CI reject: [V:04-1] Remove support for enabling Bird 2.18 selectively [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [07:55:54] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast4005.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:56:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast4005.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:56:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:56:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast4005.wikimedia.org [07:56:46] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11676534 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast4005.wikimedia.org` - bast4005.wikimedia.org (**PASS**)... [07:58:27] (03CR) 10Muehlenhoff: [C:03+1] "All systems have been migrated to Bird 2.18 fleet-wide and I've uploaded Bird 2.18 to the main component of bookworm-wikimedia and trixie-" [puppet] - 10https://gerrit.wikimedia.org/r/1238007 (owner: 10Ssingh) [08:00:05] Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T0800). [08:00:05] kipfel: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:18] (03PS2) 10Muehlenhoff: Remove support for enabling Bird 2.18 selectively [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) [08:00:27] o/ [08:02:40] (03PS1) 10KartikMistry: WIP: machinetranslation: Reduce GUNICORN_WORKERS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248388 (https://phabricator.wikimedia.org/T411058) [08:02:57] !log uploaded openjdk-8 8u482-ga-1~deb11u1 to component/jdk8 of bullseye-wikimedia [08:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:05] hi there, anyone can deploy? [08:03:35] kipfel: yes I can! [08:03:49] sorry I was dealing with some other things [08:04:31] thanks a lot [08:04:35] (03CR) 10Brouberol: [C:03+2] kafka-mirrormkaer: ensure consumer group names are the same than on the puppetized kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247066 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [08:04:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248314 (owner: 10Stang) [08:05:43] kipfel: it started :-] [08:05:45] (03Merged) 10jenkins-bot: Revert "zhwiki: Add 2026 CNY celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248314 (owner: 10Stang) [08:06:22] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1248314|Revert "zhwiki: Add 2026 CNY celebration logos"]] [08:07:12] (03CR) 10Elukey: [C:03+1] docker base image build: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1247912 (owner: 10Muehlenhoff) [08:07:45] (03CR) 10Elukey: [C:03+1] Remove obsolete spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1247923 (owner: 10Muehlenhoff) [08:08:31] !log hashar@deploy2002 hashar, stang: Backport for [[gerrit:1248314|Revert "zhwiki: Add 2026 CNY celebration logos"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:08:46] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: ms-fe1013 reports a backplane error - https://phabricator.wikimedia.org/T419010#11676545 (10elukey) @VRiley-WMF yes please it is depooled and basically down, green light! [08:10:47] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1247923 (owner: 10Muehlenhoff) [08:11:12] hashar, i test this patch and LGTM [08:11:37] kipfel: nice thank you! [08:11:40] !log hashar@deploy2002 hashar, stang: Continuing with sync [08:11:57] (03CR) 10Muehlenhoff: [C:03+2] docker base image build: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1247912 (owner: 10Muehlenhoff) [08:12:40] (03PS2) 10KartikMistry: WIP: machinetranslation: Optimize model loading and memory footprints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248388 (https://phabricator.wikimedia.org/T411058) [08:12:59] kipfel: in the future if there is nobody for deployments at this time of the day, feel free to ping me here on IRC. I am most often present but working on other things so I do not always look at this channel [08:13:10] I am always happy to deploy patches [08:13:36] get it, many thanks [08:15:42] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248314|Revert "zhwiki: Add 2026 CNY celebration logos"]] (duration: 09m 19s) [08:17:11] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for mbedtls [puppet] - 10https://gerrit.wikimedia.org/r/1247991 (owner: 10Muehlenhoff) [08:18:05] kipfel: everything looks good. Thank you for the logo change! [08:18:20] :) [08:18:33] (03PS2) 10Muehlenhoff: Add repository sync definition for nodejs 24 [puppet] - 10https://gerrit.wikimedia.org/r/1248001 (https://phabricator.wikimedia.org/T418440) [08:19:28] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:21:55] 10SRE-SLO, 06Data-Platform-SRE, 06ServiceOps new, 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11676564 (10MLechvien-WMF) Tagging Data-Platform-SRE to assess when this can be completed [08:22:09] (03CR) 10Federico Ceratto: "I updated the CR to see what the puppet does and tested with a handful of nodes in the CI: https://puppet-compiler.wmflabs.org/output/1247" [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [08:23:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [08:23:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [08:25:01] (03PS1) 10Slyngshede: data.yaml: extend atitkov [puppet] - 10https://gerrit.wikimedia.org/r/1248393 [08:26:00] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1248393 (owner: 10Slyngshede) [08:26:19] (03CR) 10Slyngshede: [C:03+2] data.yaml: extend atitkov [puppet] - 10https://gerrit.wikimedia.org/r/1248393 (owner: 10Slyngshede) [08:27:00] !log installing mbedtls security updates [08:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:44] (03CR) 10Gehel: [C:03+2] wdqs: remove query-legacy-full.wikidata.org - end of life [dns] - 10https://gerrit.wikimedia.org/r/1247926 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel) [08:28:01] !log gehel@dns1004 START - running authdns-update [08:29:16] !log gehel@dns1004 END - running authdns-update [08:29:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247990 (https://phabricator.wikimedia.org/T418580) (owner: 10Mszwarc) [08:30:26] (03Merged) 10jenkins-bot: Drop 'centralnoticeadmin' from $wgOATHRequiredForGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247990 (https://phabricator.wikimedia.org/T418580) (owner: 10Mszwarc) [08:30:55] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1247990|Drop 'centralnoticeadmin' from $wgOATHRequiredForGroups (T418580)]] [08:30:59] T418580: Deploy 2FA requirement using $wgRestrictedGroups to Wikimedia production, instead of OATHAuth's custom config - https://phabricator.wikimedia.org/T418580 [08:33:02] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1247990|Drop 'centralnoticeadmin' from $wgOATHRequiredForGroups (T418580)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:34:11] !log mszwarc@deploy2002 mszwarc: Continuing with sync [08:34:13] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:35:38] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/airflow-main: apply [08:35:49] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/airflow-main: apply [08:38:03] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247990|Drop 'centralnoticeadmin' from $wgOATHRequiredForGroups (T418580)]] (duration: 07m 07s) [08:38:07] T418580: Deploy 2FA requirement using $wgRestrictedGroups to Wikimedia production, instead of OATHAuth's custom config - https://phabricator.wikimedia.org/T418580 [08:39:04] (03PS1) 10Gehel: admin: add suecarmol to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1248395 (https://phabricator.wikimedia.org/T418664) [08:39:52] (03CR) 10CI reject: [V:04-1] admin: add suecarmol to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1248395 (https://phabricator.wikimedia.org/T418664) (owner: 10Gehel) [08:40:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet [08:41:01] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11676620 (10ops-monitoring-bot) Draining ganeti4006.ulsfo.wmnet of running VMs [08:42:55] (03PS1) 10Muehlenhoff: Prepare ganeti4005 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1248396 (https://phabricator.wikimedia.org/T418993) [08:43:34] (03PS1) 10Ayounsi: ulsfo routed Ganeti: add private v4/v6 IPs [puppet] - 10https://gerrit.wikimedia.org/r/1248397 (https://phabricator.wikimedia.org/T402259) [08:43:42] (03PS1) 10Daniel Kinzler: rest-gateway: test: check for www-authenticate error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248398 (https://phabricator.wikimedia.org/T419034) [08:44:19] (03PS2) 10Gehel: admin: add suecarmol to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1248395 (https://phabricator.wikimedia.org/T418664) [08:45:03] (03CR) 10CI reject: [V:04-1] admin: add suecarmol to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1248395 (https://phabricator.wikimedia.org/T418664) (owner: 10Gehel) [08:50:31] (03PS1) 10Bartosz Wójtowicz: inference-services: Deploy outlink-cache-adapter service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248399 (https://phabricator.wikimedia.org/T418493) [08:51:30] (03CR) 10Joal: [C:03+1] "LGTM except for test failing" [puppet] - 10https://gerrit.wikimedia.org/r/1248395 (https://phabricator.wikimedia.org/T418664) (owner: 10Gehel) [08:54:13] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:54:22] (03CR) 10Ayounsi: [C:03+1] Prepare ganeti4005 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1248396 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [08:57:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1248397 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [08:57:47] (03CR) 10Muehlenhoff: [C:03+2] Prepare ganeti4005 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1248396 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [09:02:13] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [09:08:00] ayounsi@cumin1003 netbox (PID 702685) is awaiting input [09:11:14] (03CR) 10Btullis: [C:04-1] "The user suecarmol is in the `ldap_only_users:` section, so does not get rendered as a posix user account. That is why the test is failing" [puppet] - 10https://gerrit.wikimedia.org/r/1248395 (https://phabricator.wikimedia.org/T418664) (owner: 10Gehel) [09:11:23] (03PS1) 10Brouberol: deployment_server: provision the kafka-mirrormaker kubeconfigs in the aux clusters [puppet] - 10https://gerrit.wikimedia.org/r/1248401 (https://phabricator.wikimedia.org/T417407) [09:11:57] (03CR) 10CI reject: [V:04-1] deployment_server: provision the kafka-mirrormaker kubeconfigs in the aux clusters [puppet] - 10https://gerrit.wikimedia.org/r/1248401 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:13:54] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:15:01] (03PS1) 10Brouberol: aux-k8s: define the kafka-mirrormaker namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248404 (https://phabricator.wikimedia.org/T417407) [09:15:03] (03PS1) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) [09:15:04] (03PS2) 10Brouberol: deployment_server: add the kafka-mirrormaker kubeconfigs in the aux clusters [puppet] - 10https://gerrit.wikimedia.org/r/1248401 (https://phabricator.wikimedia.org/T417407) [09:15:26] (03CR) 10Elukey: [C:03+1] aux-k8s: define the kafka-mirrormaker namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248404 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:16:24] (03PS2) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) [09:16:37] (03PS3) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) [09:17:04] (03PS4) 10Brouberol: aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) [09:18:54] FIRING: [2x] JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:19:14] (03CR) 10Btullis: [C:03+2] Remove the cpufrequtils class from the hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1248070 (https://phabricator.wikimedia.org/T415002) (owner: 10Btullis) [09:20:20] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1220.eqiad.wmnet [09:20:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06), 13Patch-For-Review: Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11676743 (10ops-monitoring-bot) Host an-worker1220.eqiad.wmnet rebooted by btullis@cumin... [09:23:11] (03CR) 10Marostegui: [C:03+1] "Can you run that also on some db11X hosts, like db1163 and see if something changes there?" [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [09:23:50] (03CR) 10JavierMonton: [V:03+1] deployment_server: add the kafka-mirrormaker kubeconfigs in the aux clusters [puppet] - 10https://gerrit.wikimedia.org/r/1248401 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:26:15] (03PS2) 10Ayounsi: ulsfo routed Ganeti: add private v4/v6 IPs [puppet] - 10https://gerrit.wikimedia.org/r/1248397 (https://phabricator.wikimedia.org/T402259) [09:26:22] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248397 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [09:27:30] (03CR) 10JavierMonton: [C:03+1] "It looks good to me, but I'm not confident enough with these deployment files, it'd be good if someone else can look at it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:27:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1220.eqiad.wmnet [09:28:00] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1221.eqiad.wmnet [09:28:16] (03CR) 10JavierMonton: [C:03+1] aux-k8s: define the kafka-mirrormaker namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248404 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:28:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06), 13Patch-For-Review: Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11676753 (10ops-monitoring-bot) Host an-worker1221.eqiad.wmnet rebooted by btullis@cumin... [09:31:57] (03CR) 10Elukey: [C:03+1] Add repository sync definition for nodejs 24 [puppet] - 10https://gerrit.wikimedia.org/r/1248001 (https://phabricator.wikimedia.org/T418440) (owner: 10Muehlenhoff) [09:32:18] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:32:27] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:36:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1221.eqiad.wmnet [09:36:35] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1222.eqiad.wmnet [09:37:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11676838 (10ops-monitoring-bot) Host an-worker1222.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [09:42:13] (03CR) 10Ayounsi: [C:03+2] ulsfo routed Ganeti: add private v4/v6 IPs [puppet] - 10https://gerrit.wikimedia.org/r/1248397 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [09:43:38] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add gw-virtual.ulsfo.wmnet - ayounsi@cumin1003" [09:44:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1222.eqiad.wmnet [09:44:38] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1223.eqiad.wmnet [09:45:01] (03CR) 10David Caro: [C:03+2] legacy_redirector: remove some disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/1247963 (https://phabricator.wikimedia.org/T418829) (owner: 10David Caro) [09:45:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11676863 (10ops-monitoring-bot) Host an-worker1223.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [09:46:46] ayounsi@cumin1003 netbox (PID 702685) is awaiting input [09:51:17] (03CR) 10Effie Mouzeli: [C:03+1] "thanks for finding this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247706 (owner: 10RLazarus) [09:51:59] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1223.eqiad.wmnet [09:52:02] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1224.eqiad.wmnet [09:52:03] (03CR) 10Effie Mouzeli: [C:03+1] memcached: Update comment on TLS support [puppet] - 10https://gerrit.wikimedia.org/r/1247621 (owner: 10Muehlenhoff) [09:52:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11676899 (10ops-monitoring-bot) Host an-worker1224.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [09:52:43] (03CR) 10Effie Mouzeli: [C:03+1] benthos: add chart metadata [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248039 (https://phabricator.wikimedia.org/T412693) (owner: 10Kamila Součková) [09:53:19] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8209/co" [puppet] - 10https://gerrit.wikimedia.org/r/1238400 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn) [09:53:36] (03PS5) 10Tiziano Fogli: thanos/rec_rules: add prometheus_ingested_metrics rec rules group [puppet] - 10https://gerrit.wikimedia.org/r/1248006 (https://phabricator.wikimedia.org/T415317) [09:57:31] (03PS1) 10Ayounsi: Add esams routed ganeti VM ranges to network/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1248414 (https://phabricator.wikimedia.org/T418993) [09:57:48] (03PS2) 10Ayounsi: Add ulsfo routed ganeti VM ranges to network/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1248414 (https://phabricator.wikimedia.org/T418993) [09:59:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1224.eqiad.wmnet [09:59:30] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1225.eqiad.wmnet [09:59:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11676924 (10ops-monitoring-bot) Host an-worker1225.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:00:26] (03CR) 10Tiziano Fogli: [C:03+2] "Manually tested with promtool, I'm self-merging." [puppet] - 10https://gerrit.wikimedia.org/r/1248006 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [10:01:23] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1238400 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn) [10:03:44] (03CR) 10Federico Ceratto: "db1150, 1163 and 1198 had no changes as expected: https://puppet-compiler.wmflabs.org/output/1247996/8210/" [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [10:04:19] (03CR) 10Marostegui: [C:03+1] "Then I think we are good to merge." [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [10:07:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1225.eqiad.wmnet [10:07:06] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1226.eqiad.wmnet [10:07:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11676934 (10ops-monitoring-bot) Host an-worker1226.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:08:00] (03PS1) 10Kevin Bazira: ml-services: add performance optimization env vars to embeddings isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248416 (https://phabricator.wikimedia.org/T418976) [10:08:24] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add gw-virtual.ulsfo.wmnet - ayounsi@cumin1003" [10:08:24] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:08:54] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti4005.ulsfo.wmnet [10:09:25] (03PS5) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [10:09:28] (03PS1) 10Ayounsi: ulsfo: add Ganeti "customer" [homer/public] - 10https://gerrit.wikimedia.org/r/1248417 (https://phabricator.wikimedia.org/T418993) [10:10:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti4005.ulsfo.wmnet [10:10:46] (03PS4) 10Abban Dunne: Add WMDE Fundraising banner event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148322 [10:11:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4005.ulsfo.wmnet with OS bookworm [10:11:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet [10:11:40] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11676943 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bookworm [10:14:19] (03CR) 10Muehlenhoff: memcached: add memcached restart/reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [10:14:43] (03CR) 10Abban Dunne: "I've rebased and updated this to reflect the changes requested in https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-second" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148322 (owner: 10Abban Dunne) [10:14:45] (03CR) 10CI reject: [V:04-1] memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [10:14:59] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1226.eqiad.wmnet [10:15:02] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1227.eqiad.wmnet [10:15:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11676960 (10ops-monitoring-bot) Host an-worker1227.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:15:45] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: limit access to http/https/ssh in firewall [puppet] - 10https://gerrit.wikimedia.org/r/1238400 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn) [10:16:08] (03PS1) 10Marostegui: orchestrator.sql.erb: Remove grants from dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/1248418 (https://phabricator.wikimedia.org/T416582) [10:16:46] (03PS1) 10Kgraessle: Enable rr-ml AutoModerator CC Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247639 (https://phabricator.wikimedia.org/T400727) [10:16:52] (03CR) 10Marostegui: "CC @fceratto@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1248418 (https://phabricator.wikimedia.org/T416582) (owner: 10Marostegui) [10:16:56] (03CR) 10Marostegui: [C:03+2] orchestrator.sql.erb: Remove grants from dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/1248418 (https://phabricator.wikimedia.org/T416582) (owner: 10Marostegui) [10:17:28] (03PS2) 10Bartosz Wójtowicz: inference-services: Deploy cache-adapter namespace and outlink-cache-adapter service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248399 (https://phabricator.wikimedia.org/T418493) [10:18:10] (03PS3) 10Bartosz Wójtowicz: inference-services: Deploy outlink-cache-adapter service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248399 (https://phabricator.wikimedia.org/T418493) [10:18:37] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [homer/public] - 10https://gerrit.wikimedia.org/r/1248417 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [10:22:22] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1227.eqiad.wmnet [10:22:25] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1228.eqiad.wmnet [10:22:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11676983 (10ops-monitoring-bot) Host an-worker1228.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:23:18] (03PS6) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [10:24:53] !log installing Java 8 security updates [10:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:21] (03CR) 10CI reject: [V:04-1] memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [10:29:46] (03PS1) 10Tiziano Fogli: thanos/rec_rules: change metric name [puppet] - 10https://gerrit.wikimedia.org/r/1248421 (https://phabricator.wikimedia.org/T415317) [10:30:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1228.eqiad.wmnet [10:30:21] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1229.eqiad.wmnet [10:30:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11676991 (10ops-monitoring-bot) Host an-worker1229.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:31:41] (03CR) 10CI reject: [V:04-1] thanos/rec_rules: change metric name [puppet] - 10https://gerrit.wikimedia.org/r/1248421 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [10:32:17] (03PS2) 10Tiziano Fogli: thanos/rec_rules: change metric name [puppet] - 10https://gerrit.wikimedia.org/r/1248421 (https://phabricator.wikimedia.org/T415317) [10:32:27] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4005.ulsfo.wmnet with reason: host reimage [10:37:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1229.eqiad.wmnet [10:37:44] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1230.eqiad.wmnet [10:37:53] (03PS1) 10Slyngshede: P:idm auto expire permission requests [puppet] - 10https://gerrit.wikimedia.org/r/1248424 (https://phabricator.wikimedia.org/T416152) [10:38:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677013 (10ops-monitoring-bot) Host an-worker1230.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:39:06] (03CR) 10Tiziano Fogli: [C:03+2] "Manually tested with promtool, I'm self-merging." [puppet] - 10https://gerrit.wikimedia.org/r/1248421 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [10:39:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4005.ulsfo.wmnet with reason: host reimage [10:40:23] (03PS7) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [10:41:40] !log elukey@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [10:42:37] (03PS3) 10Aqu: dse-k8s airflow-analytics-test: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247570 (https://phabricator.wikimedia.org/T415874) [10:45:34] (03CR) 10CI reject: [V:04-1] memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [10:46:26] (03CR) 10Ozge: [C:03+2] ml-services: add performance optimization env vars to embeddings isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248416 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [10:47:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1230.eqiad.wmnet [10:47:29] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1231.eqiad.wmnet [10:47:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677049 (10ops-monitoring-bot) Host an-worker1231.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:48:22] (03Merged) 10jenkins-bot: ml-services: add performance optimization env vars to embeddings isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248416 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [10:48:46] (03PS8) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [10:52:17] (03CR) 10Brouberol: [C:03+2] dse-k8s airflow-analytics-test: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247570 (https://phabricator.wikimedia.org/T415874) (owner: 10Aqu) [10:54:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:55:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1231.eqiad.wmnet [10:55:17] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1232.eqiad.wmnet [10:55:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677071 (10ops-monitoring-bot) Host an-worker1232.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [10:55:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:57:03] (03PS9) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [10:58:17] (03PS4) 10Federico Ceratto: mariadb: fix regexp in hieradata/regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) [10:59:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4005.ulsfo.wmnet with OS bookworm [10:59:17] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11677085 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bookworm completed: - ganeti4... [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T1100) [11:00:10] (03PS4) 10Hnowlan: grafana: replace grafana read-only with IDP-authentication [puppet] - 10https://gerrit.wikimedia.org/r/1248419 (https://phabricator.wikimedia.org/T418671) [11:00:23] !log elukey@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad [11:02:04] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [11:02:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1232.eqiad.wmnet [11:02:53] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1233.eqiad.wmnet [11:03:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677091 (10ops-monitoring-bot) Host an-worker1233.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [11:04:17] (03CR) 10AikoChou: "Maybe we should loop in SREs on this use case, since they have more experience with how we build non-isvc deployments." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248399 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz Wójtowicz) [11:05:42] (03CR) 10Federico Ceratto: [C:03+2] mariadb: fix regexp in hieradata/regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1247996 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [11:08:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1248424 (https://phabricator.wikimedia.org/T416152) (owner: 10Slyngshede) [11:09:17] (03CR) 10Muehlenhoff: [C:03+2] memcached: Update comment on TLS support [puppet] - 10https://gerrit.wikimedia.org/r/1247621 (owner: 10Muehlenhoff) [11:09:49] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1233.eqiad.wmnet [11:09:52] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1234.eqiad.wmnet [11:10:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677103 (10ops-monitoring-bot) Host an-worker1234.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [11:10:22] (03PS2) 10Muehlenhoff: varnish: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1247953 [11:10:34] (03PS2) 10Muehlenhoff: profile::java Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1247915 [11:16:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1234.eqiad.wmnet [11:16:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1247953 (owner: 10Muehlenhoff) [11:16:53] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1235.eqiad.wmnet [11:17:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677109 (10ops-monitoring-bot) Host an-worker1235.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [11:17:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1248414 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [11:21:52] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new, 07Essential-Work: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11677113 (10elukey) >>! In T418160#11673959, @Jdforrester-WMF wrote: > This metric is on the MW<->... [11:22:00] (03CR) 10JMeybohm: [C:03+1] Add a ValidatingAdmissionPolicy for use with analytics workloads [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245403 (https://phabricator.wikimedia.org/T412925) (owner: 10Btullis) [11:24:07] (03CR) 10JMeybohm: [C:03+1] Apply the new VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [11:24:38] (03PS1) 10Muehlenhoff: Enable ganeti4005 as routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1248429 (https://phabricator.wikimedia.org/T418993) [11:24:41] (03PS1) 10Elukey: role::kafka::test::broker: upgrade Kafka to 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/1248430 (https://phabricator.wikimedia.org/T417035) [11:25:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1235.eqiad.wmnet [11:25:11] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1236.eqiad.wmnet [11:25:12] (03PS12) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [11:25:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677121 (10ops-monitoring-bot) Host an-worker1236.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [11:27:59] (03CR) 10Muehlenhoff: [C:03+2] Enable ganeti4005 as routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1248429 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [11:29:21] !log remove ganeti4006 from ganeti/ulsfo cluster T418993 [11:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:24] T418993: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993 [11:31:28] PROBLEM - ganeti-noded running on ganeti4006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [11:31:28] PROBLEM - ganeti-confd running on ganeti4006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:31:46] (03PS1) 10Muehlenhoff: Enable the RAPI cert for routed ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1248432 (https://phabricator.wikimedia.org/T418993) [11:32:50] FIRING: ProbeDown: Service ganeti4006:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:33:05] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1236.eqiad.wmnet [11:35:04] (03CR) 10Muehlenhoff: [C:03+2] Enable the RAPI cert for routed ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1248432 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [11:36:40] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8215/co" [puppet] - 10https://gerrit.wikimedia.org/r/1248424 (https://phabricator.wikimedia.org/T416152) (owner: 10Slyngshede) [11:39:13] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:39:59] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idm auto expire permission requests [puppet] - 10https://gerrit.wikimedia.org/r/1248424 (https://phabricator.wikimedia.org/T416152) (owner: 10Slyngshede) [11:44:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet [11:48:57] (03PS3) 10KartikMistry: WIP: machinetranslation: Optimize model loading and memory footprints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248388 (https://phabricator.wikimedia.org/T411058) [11:51:13] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11677162 (10JMeybohm) [11:53:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet [11:53:28] (03CR) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [11:53:50] (03PS10) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [11:55:40] FIRING: [2x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:37] (03PS1) 10Muehlenhoff: Prepare ganeri4006 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1248438 (https://phabricator.wikimedia.org/T418993) [11:57:26] (03PS2) 10Muehlenhoff: lvs: Run spec tests on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1240841 [11:57:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [11:58:44] (03CR) 10CI reject: [V:04-1] memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [12:01:33] (03PS11) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [12:01:44] (03CR) 10Muehlenhoff: [C:03+2] Prepare ganeri4006 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1248438 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [12:02:50] RESOLVED: ProbeDown: Service ganeti4006:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:03:53] (03PS12) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [12:04:43] (03CR) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [12:06:08] (03PS1) 10Giuseppe Lavagetto: cache::upload: increase limits for non-obvious bots for media files [puppet] - 10https://gerrit.wikimedia.org/r/1248443 (https://phabricator.wikimedia.org/T418323) [12:08:09] RECOVERY - SSH on an-worker1207 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:11:56] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1220.eqiad.wmnet [12:11:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677210 (10BTullis) I have nearly finished applying the new server settings. We can see how many of the hosts... [12:12:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677212 (10ops-monitoring-bot) Host an-worker1220.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:14:50] (03PS1) 10Jelto: gitlab-runner: remove buster default image [puppet] - 10https://gerrit.wikimedia.org/r/1248450 (https://phabricator.wikimedia.org/T384595) [12:17:48] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8216/console" [puppet] - 10https://gerrit.wikimedia.org/r/1248450 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto) [12:19:28] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:23:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4006.ulsfo.wmnet with OS bookworm [12:23:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1220.eqiad.wmnet [12:23:45] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1221.eqiad.wmnet [12:23:50] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11677256 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti4006.ulsfo.wmnet with OS bookworm [12:24:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677257 (10ops-monitoring-bot) Host an-worker1221.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:27:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1248450 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto) [12:29:50] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab-runner: remove buster default image [puppet] - 10https://gerrit.wikimedia.org/r/1248450 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto) [12:34:13] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:36:17] (03PS13) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [12:37:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1221.eqiad.wmnet [12:37:31] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1222.eqiad.wmnet [12:37:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677322 (10ops-monitoring-bot) Host an-worker1222.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:38:26] PROBLEM - SSH on build2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:40:25] FIRING: [3x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:41:18] RECOVERY - SSH on build2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:42:50] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11677339 (10MoritzMuehlenhoff) [12:43:25] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage [12:45:43] (03CR) 10Elukey: [C:03+1] profile::java Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1247915 (owner: 10Muehlenhoff) [12:46:46] !log elukey@cumin1003 START - Cookbook sre.kafka.change-confluent-distro-version Change Confluent distribution for Kafka A:kafka-test-eqiad cluster: Change Confluent distribution. [12:47:22] (03CR) 10Elukey: [C:03+2] role::kafka::test::broker: upgrade Kafka to 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/1248430 (https://phabricator.wikimedia.org/T417035) (owner: 10Elukey) [12:49:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1222.eqiad.wmnet [12:49:33] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1223.eqiad.wmnet [12:50:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677352 (10ops-monitoring-bot) Host an-worker1223.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [12:50:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage [12:51:25] (03CR) 10Muehlenhoff: [C:03+2] profile::java Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1247915 (owner: 10Muehlenhoff) [12:51:35] (03PS1) 10Kevin Bazira: ml-services: update embeddings isvc to image that includes GPU device bitcode files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248462 (https://phabricator.wikimedia.org/T418976) [12:52:33] (03CR) 10Ozge: [C:03+2] ml-services: update embeddings isvc to image that includes GPU device bitcode files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248462 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [12:53:19] (03PS1) 10Muehlenhoff: dist-upgrade: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1248463 [12:53:48] (03CR) 10Ayounsi: [C:03+2] Add ulsfo routed ganeti VM ranges to network/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1248414 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [12:54:13] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:54:39] (03Merged) 10jenkins-bot: ml-services: update embeddings isvc to image that includes GPU device bitcode files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248462 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [12:54:52] (03CR) 10Clément Goubert: [C:04-1] "Removes `wikikube-ctrl100[56]*`" [puppet] - 10https://gerrit.wikimedia.org/r/1247695 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [12:55:05] (03PS1) 10Muehlenhoff: phab_deploy_finalize: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1248466 [12:55:17] (03CR) 10Ayounsi: [C:03+2] ulsfo: add Ganeti "customer" [homer/public] - 10https://gerrit.wikimedia.org/r/1248417 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [12:55:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:55:34] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [12:55:59] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [12:56:28] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1162.eqiad.wmnet [12:57:05] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1162.eqiad.wmnet [12:57:13] 10ops-eqiad, 06SRE, 06DC-Ops: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418939#11677358 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 depool for host wikikube-worker1162.eqiad.wmnet completed: - wiki... [12:57:38] (03Merged) 10jenkins-bot: ulsfo: add Ganeti "customer" [homer/public] - 10https://gerrit.wikimedia.org/r/1248417 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [12:57:43] 10ops-eqiad, 06SRE, 06DC-Ops: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418939#11677360 (10Clement_Goubert) It's cordoned and depooled, fire away. [12:58:44] !log cgoubert@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on wikikube-worker1162.eqiad.wmnet with reason: dcops intervention [12:58:50] 10ops-eqiad, 06SRE, 06DC-Ops: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418939#11677368 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=43aaa18b-9dfb-4905-89b1-4ef0ba742d2a) set by cgoubert@cumin1003 for 14 days, 0:00:00 on 1 host... [12:58:53] elukey@cumin1003 change-confluent-distro-version (PID 846990) is awaiting input [12:58:54] FIRING: [6x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:59:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:59:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418939#11677371 (10Clement_Goubert) [12:59:34] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:00:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T1300) [13:00:52] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:01:20] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1223.eqiad.wmnet [13:01:23] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1224.eqiad.wmnet [13:01:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677375 (10ops-monitoring-bot) Host an-worker1224.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:02:31] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:02:54] elukey@cumin1003 change-confluent-distro-version (PID 846990) is awaiting input [13:04:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:05:05] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:06:12] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:06:59] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new VIP for routed ganeti in ulsfo - jmm@cumin2002" [13:07:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new VIP for routed ganeti in ulsfo - jmm@cumin2002" [13:07:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:07:31] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:08:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:08:42] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:08:54] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:09:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4006.ulsfo.wmnet with OS bookworm [13:10:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11677388 (10ayounsi) [13:10:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11677389 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti4006.ulsfo.wmnet with OS bookworm completed: - ganeti4... [13:13:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:15:16] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1224.eqiad.wmnet [13:15:19] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1225.eqiad.wmnet [13:15:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677397 (10ops-monitoring-bot) Host an-worker1225.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:16:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:18:33] (03CR) 10Jelto: [C:03+1] "looks reasonable. Have you tested the Apache config from https://puppet-compiler.wmflabs.org/output/1240197/5976/gerrit2003.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [13:19:13] FIRING: JobUnavailable: Reduced availability for job thanos-store in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:21:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:21:56] (03CR) 10Nemo bis: [C:03+1] "A limit of 800 requests per 10 seconds, if I read correctly, should help with the case of a page with ~700 standard thumbnails https://pha" [puppet] - 10https://gerrit.wikimedia.org/r/1248443 (https://phabricator.wikimedia.org/T418323) (owner: 10Giuseppe Lavagetto) [13:23:54] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:23:58] (03PS1) 10Jelto: gerrit: remove nftables_throttling::abusers [puppet] - 10https://gerrit.wikimedia.org/r/1248474 (https://phabricator.wikimedia.org/T417263) [13:26:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:26:28] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [13:26:45] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11677427 (10MoritzMuehlenhoff) [13:26:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1225.eqiad.wmnet [13:26:55] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1226.eqiad.wmnet [13:27:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677429 (10ops-monitoring-bot) Host an-worker1226.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:27:42] (03CR) 10Arnaudb: "thanks for the review! I was aiming to do a progressive rollout, starting on the spare instance with puppet disabled on the 2 others. The " [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [13:28:24] (03CR) 10Arnaudb: [C:03+1] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/1248474 (https://phabricator.wikimedia.org/T417263) (owner: 10Jelto) [13:33:15] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [13:33:19] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [13:33:23] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:33:26] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:34:20] (03CR) 10Muehlenhoff: [C:03+2] Add repository sync definition for nodejs 24 [puppet] - 10https://gerrit.wikimedia.org/r/1248001 (https://phabricator.wikimedia.org/T418440) (owner: 10Muehlenhoff) [13:34:35] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1199 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [13:35:09] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1199.eqiad.wmnet [13:35:56] !log installing glib2.0 security updates [13:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:16] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8217/co" [puppet] - 10https://gerrit.wikimedia.org/r/1248474 (https://phabricator.wikimedia.org/T417263) (owner: 10Jelto) [13:36:24] elukey@cumin1003 change-confluent-distro-version (PID 846990) is awaiting input [13:37:07] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [13:37:11] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [13:37:14] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:37:17] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:37:21] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:37:24] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [13:37:27] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [13:37:29] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [13:37:49] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [13:37:51] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:37:54] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:37:58] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [13:38:01] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [13:38:04] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:38:07] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:38:10] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:38:13] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:38:16] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:38:19] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:38:22] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:38:40] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: remove nftables_throttling::abusers [puppet] - 10https://gerrit.wikimedia.org/r/1248474 (https://phabricator.wikimedia.org/T417263) (owner: 10Jelto) [13:40:36] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1226.eqiad.wmnet [13:40:40] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1227.eqiad.wmnet [13:41:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677489 (10ops-monitoring-bot) Host an-worker1227.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:42:32] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache::upload: increase limits for non-obvious bots for media files [puppet] - 10https://gerrit.wikimedia.org/r/1248443 (https://phabricator.wikimedia.org/T418323) (owner: 10Giuseppe Lavagetto) [13:42:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1199.eqiad.wmnet [13:42:48] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:43:52] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:46:43] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:47:51] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:50:01] (03PS15) 10Arnaudb: gerrit: sync httpd config to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) [13:50:06] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:50:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:00] (03CR) 10Arnaudb: gerrit: sync httpd config to ATS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [13:51:12] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:51:52] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [13:52:06] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:52:13] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1227.eqiad.wmnet [13:52:16] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [13:52:17] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1228.eqiad.wmnet [13:52:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677534 (10ops-monitoring-bot) Host an-worker1228.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [13:53:17] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:53:54] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:55:00] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:56:08] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:56:17] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:57:26] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:58:38] (03PS14) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T1400). [14:00:05] manfredi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:26] Sounds good to me! Thanks [14:00:26] (03PS1) 10Kevin Bazira: ml-services: update embeddings isvc to image that adds missing dev headers required by AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248480 (https://phabricator.wikimedia.org/T418976) [14:01:19] !log initialised ganeti02/ulsfo cluster T418993 [14:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:22] T418993: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993 [14:03:33] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129 (10BTullis) 03NEW [14:03:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1228.eqiad.wmnet [14:03:59] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1229.eqiad.wmnet [14:04:03] (03PS1) 10AikoChou: httpbb: fix rec-api-ng test [puppet] - 10https://gerrit.wikimedia.org/r/1248481 [14:04:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677579 (10ops-monitoring-bot) Host an-worker1229.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:04:40] (03PS1) 10Muehlenhoff: Add ganeti4006 to the routed Ganeti cluster in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1248482 (https://phabricator.wikimedia.org/T418993) [14:05:16] (03PS1) 10Zabe: SpecialWantedFiles: Use lt_title instead of lt_to [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248483 (https://phabricator.wikimedia.org/T299953) [14:05:58] !log imported nodejs 24.14.0-1nodesource1 to thirdparty/node24 T418440 [14:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:01] T418440: Create nodejs 24 production images - https://phabricator.wikimedia.org/T418440 [14:08:19] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [14:08:22] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [14:08:25] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:08:29] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:08:32] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:08:35] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [14:08:38] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [14:08:40] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [14:08:44] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [14:08:46] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:08:50] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:08:53] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [14:08:56] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [14:09:00] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:09:03] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:09:06] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:09:09] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:09:12] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:09:15] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:09:18] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:10:31] (03PS1) 10Jelto: gerrit: remove CDN lookups from sshkey [puppet] - 10https://gerrit.wikimedia.org/r/1248484 (https://phabricator.wikimedia.org/T411895) [14:11:14] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1178.eqiad.wmnet [14:11:19] (03CR) 10Arnaudb: [C:03+1] gerrit: remove CDN lookups from sshkey [puppet] - 10https://gerrit.wikimedia.org/r/1248484 (https://phabricator.wikimedia.org/T411895) (owner: 10Jelto) [14:13:54] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:15:11] (03PS1) 10Giuseppe Lavagetto: varnish: raise limit for upload also for cache misses [puppet] - 10https://gerrit.wikimedia.org/r/1248487 (https://phabricator.wikimedia.org/T418323) [14:15:17] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8218/console" [puppet] - 10https://gerrit.wikimedia.org/r/1248484 (https://phabricator.wikimedia.org/T411895) (owner: 10Jelto) [14:15:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1229.eqiad.wmnet [14:15:55] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1230.eqiad.wmnet [14:16:00] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: remove CDN lookups from sshkey [puppet] - 10https://gerrit.wikimedia.org/r/1248484 (https://phabricator.wikimedia.org/T411895) (owner: 10Jelto) [14:16:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677607 (10ops-monitoring-bot) Host an-worker1230.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:16:37] (03CR) 10Ssingh: [C:03+1] varnish: raise limit for upload also for cache misses [puppet] - 10https://gerrit.wikimedia.org/r/1248487 (https://phabricator.wikimedia.org/T418323) (owner: 10Giuseppe Lavagetto) [14:16:47] (03CR) 10Vgutierrez: [C:03+1] varnish: raise limit for upload also for cache misses [puppet] - 10https://gerrit.wikimedia.org/r/1248487 (https://phabricator.wikimedia.org/T418323) (owner: 10Giuseppe Lavagetto) [14:17:11] (03CR) 10Ssingh: [C:03+2] varnish: raise limit for upload also for cache misses [puppet] - 10https://gerrit.wikimedia.org/r/1248487 (https://phabricator.wikimedia.org/T418323) (owner: 10Giuseppe Lavagetto) [14:17:29] (03PS6) 10Bking: dse-k8s-ingress: Enable active-active [puppet] - 10https://gerrit.wikimedia.org/r/1238832 (https://phabricator.wikimedia.org/T396478) [14:18:47] (03PS1) 10Muehlenhoff: Add a node24 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1248488 (https://phabricator.wikimedia.org/T418440) [14:19:12] (03CR) 10Jforrester: "Thank you! Do you want me to create the image using the package, like in I4f8a0047ce013ea3c28b2cc5b0a44220dcd75b58?" [puppet] - 10https://gerrit.wikimedia.org/r/1248001 (https://phabricator.wikimedia.org/T418440) (owner: 10Muehlenhoff) [14:20:02] (03CR) 10Jforrester: [C:03+1] "Ha, you're too fast for me. Looks good!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1248488 (https://phabricator.wikimedia.org/T418440) (owner: 10Muehlenhoff) [14:21:02] (03CR) 10Muehlenhoff: "Thanks for the quick review :-)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1248488 (https://phabricator.wikimedia.org/T418440) (owner: 10Muehlenhoff) [14:21:12] (03CR) 10Ssingh: [C:04-2] "Many thanks for all the work!" [puppet] - 10https://gerrit.wikimedia.org/r/1238007 (owner: 10Ssingh) [14:21:19] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1238007 (owner: 10Ssingh) [14:21:42] (03PS3) 10Ssingh: bird: add return type for function (bool) [puppet] - 10https://gerrit.wikimedia.org/r/1238007 [14:23:13] (03PS16) 10Bking: dse-k8s: Enable active/active for dse-k8s clusters (1/2) [dns] - 10https://gerrit.wikimedia.org/r/1238441 (https://phabricator.wikimedia.org/T396478) [14:23:19] (03CR) 10Bking: [C:03+2] dse-k8s: Enable active/active for dse-k8s clusters (1/2) [dns] - 10https://gerrit.wikimedia.org/r/1238441 (https://phabricator.wikimedia.org/T396478) (owner: 10Bking) [14:24:13] !log bking@dns1004 START - running authdns-update [14:24:44] (03PS1) 10AikoChou: httpbb: fix ores-legacy test [puppet] - 10https://gerrit.wikimedia.org/r/1248491 [14:25:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:25:17] (03PS1) 10Elukey: profile::kafka::broker: update authorizer class for kafka 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/1248492 (https://phabricator.wikimedia.org/T416670) [14:25:39] (03PS2) 10AikoChou: httpbb: fix ores-legacy test [puppet] - 10https://gerrit.wikimedia.org/r/1248491 [14:27:10] (03Abandoned) 10Elukey: profile::kafka::broker: allow to force openjdk-17 [puppet] - 10https://gerrit.wikimedia.org/r/1237507 (https://phabricator.wikimedia.org/T416674) (owner: 10Elukey) [14:27:29] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1230.eqiad.wmnet [14:27:32] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1231.eqiad.wmnet [14:27:34] (03CR) 10Elukey: [C:03+2] profile::kafka::broker: update authorizer class for kafka 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/1248492 (https://phabricator.wikimedia.org/T416670) (owner: 10Elukey) [14:27:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11677653 (10ops-monitoring-bot) Host an-worker1231.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [14:28:04] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11677654 (10Papaul) [14:28:07] (03PS1) 10Zabe: Stop writing to il_to on all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248493 (https://phabricator.wikimedia.org/T415787) [14:28:13] (03PS2) 10AikoChou: httpbb: fix rec-api-ng test [puppet] - 10https://gerrit.wikimedia.org/r/1248481 [14:28:16] !log sukhe@dns1004 START - running authdns-update [14:28:40] !log elukey@cumin1003 END (FAIL) - Cookbook sre.kafka.change-confluent-distro-version (exit_code=99) Change Confluent distribution for Kafka A:kafka-test-eqiad cluster: Change Confluent distribution. [14:29:11] (03CR) 10Bking: [C:03+2] dse-k8s-ingress: Enable active-active [puppet] - 10https://gerrit.wikimedia.org/r/1238832 (https://phabricator.wikimedia.org/T396478) (owner: 10Bking) [14:29:31] (03PS1) 10Ssingh: Revert "dse-k8s: Enable active/active for dse-k8s clusters (1/2)" [dns] - 10https://gerrit.wikimedia.org/r/1248494 [14:29:45] (03PS9) 10Andrew Bogott: toolforge etcdctl: update cert flag names [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (https://phabricator.wikimedia.org/T361237) [14:29:47] (03PS13) 10Andrew Bogott: toolforge etcd: update handling of 'member list' output [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) [14:30:16] (03CR) 10Muehlenhoff: memcached: add memcached restart/reboot cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [14:30:42] (03CR) 10Ssingh: [C:03+2] Revert "dse-k8s: Enable active/active for dse-k8s clusters (1/2)" [dns] - 10https://gerrit.wikimedia.org/r/1248494 (owner: 10Ssingh) [14:30:49] !log sukhe@dns1004 START - running authdns-update [14:31:18] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11677666 (10MoritzMuehlenhoff) [14:31:21] PROBLEM - Host an-worker1178 is DOWN: PING CRITICAL - Packet loss = 100% [14:32:03] !log sukhe@dns1004 END - running authdns-update [14:32:16] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1010 [14:32:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1010 [14:36:10] (03CR) 10Majavah: [C:04-1] toolforge etcdctl: update cert flag names (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248027 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott) [14:37:44] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add a node24 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1248488 (https://phabricator.wikimedia.org/T418440) (owner: 10Muehlenhoff) [14:38:27] (03PS1) 10Bking: Revert^2 "dse-k8s: Enable active/active for dse-k8s clusters (1/2)" [dns] - 10https://gerrit.wikimedia.org/r/1248495 [14:38:36] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1010 [14:38:57] (03CR) 10Bking: [V:03+2 C:03+2] Revert^2 "dse-k8s: Enable active/active for dse-k8s clusters (1/2)" [dns] - 10https://gerrit.wikimedia.org/r/1248495 (owner: 10Bking) [14:38:59] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1010 [14:39:48] (03PS1) 10Elukey: confluent: update kafka.sh with kafka-leader-election [puppet] - 10https://gerrit.wikimedia.org/r/1248496 (https://phabricator.wikimedia.org/T416670) [14:40:06] (03CR) 10Majavah: "The json output contains the exact same numbers as what's encoded as hex in the non-json format, can this not use the json output anyway a" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1248047 (https://phabricator.wikimedia.org/T361237) (owner: 10Andrew Bogott) [14:42:33] PROBLEM - gdnsd checkconf #page on dns3004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:42:39] ^ ok [14:42:43] wow [14:42:46] yep [14:42:54] expected? [14:42:57] (03PS1) 10Bking: Revert "dse-k8s-ingress: Enable active-active" [puppet] - 10https://gerrit.wikimedia.org/r/1248499 [14:43:04] (03CR) 10Bking: [V:03+2 C:03+2] Revert "dse-k8s-ingress: Enable active-active" [puppet] - 10https://gerrit.wikimedia.org/r/1248499 (owner: 10Bking) [14:43:05] not really but yes in the sense tht it failed [14:43:16] sorry for the noise [14:43:17] reverting [14:43:31] PROBLEM - gdnsd checkconf #page on dns1006 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:43:34] sukhe: need any help or assistance? [14:43:39] volans: thanks, [14:43:40] reveriting [14:43:41] I'm acking the pages fwiw [14:43:41] !ack [14:43:42] 7716 (ACKED) dns1006/gdnsd checkconf (paged) [14:44:29] PROBLEM - gdnsd checkconf #page on dns2005 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:44:30] PROBLEM - gdnsd checkconf #page on dns7001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:44:31] PROBLEM - gdnsd checkconf #page on dns7002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:44:32] PROBLEM - gdnsd checkconf #page on dns4003 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:44:33] PROBLEM - gdnsd checkconf #page on dns1004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:44:34] PROBLEM - gdnsd checkconf #page on dns1005 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:44:35] PROBLEM - gdnsd checkconf #page on dns4004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:44:35] !ack [14:44:35] PROBLEM - gdnsd checkconf #page on dns6002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:44:36] 7717 (ACKED) dns2005/gdnsd checkconf (paged) [14:44:36] 7718 (ACKED) dns7002/gdnsd checkconf (paged) [14:44:36] PROBLEM - gdnsd checkconf #page on dns2004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:44:37] PROBLEM - gdnsd checkconf #page on dns2006 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:44:38] PROBLEM - gdnsd checkconf #page on dns3003 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:44:39] PROBLEM - gdnsd checkconf #page on dns6001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:44:40] PROBLEM - gdnsd checkconf #page on dns5004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:44:40] PROBLEM - gdnsd checkconf #page on dns5003 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:44:42] sukhe: leave the pages to me [14:44:46] do the revert ;) [14:44:46] revert in progress [14:44:51] volans: yep, it's rolling out [14:45:00] <3 great [14:45:12] should be resolvings oon [14:45:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11677713 (10MoritzMuehlenhoff) [14:45:31] RECOVERY - gdnsd checkconf #page on dns2005 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:32] RECOVERY - gdnsd checkconf #page on dns7002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:32] thankfully gdnsd is quite smart so nothing is broken as far as the DNS queries go but yeah [14:45:32] RECOVERY - gdnsd checkconf #page on dns7001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:33] RECOVERY - gdnsd checkconf #page on dns4003 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:34] RECOVERY - gdnsd checkconf #page on dns1005 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:35] RECOVERY - gdnsd checkconf #page on dns1004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:36] RECOVERY - gdnsd checkconf #page on dns1006 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:36] RECOVERY - gdnsd checkconf #page on dns6001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:37] RECOVERY - gdnsd checkconf #page on dns4004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:38] RECOVERY - gdnsd checkconf #page on dns3003 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:39] RECOVERY - gdnsd checkconf #page on dns6002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:40] RECOVERY - gdnsd checkconf #page on dns2006 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:41] RECOVERY - gdnsd checkconf #page on dns2004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:42] RECOVERY - gdnsd checkconf #page on dns3004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:42] RECOVERY - gdnsd checkconf #page on dns5004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:43] RECOVERY - gdnsd checkconf #page on dns5003 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [14:45:46] yeah I knew that [14:45:53] right, you wrote the CI :P [14:46:03] :D [14:46:17] I wrote the additional CI, gdnsd internally it's already super safe [14:46:29] yep, all hail Brandon [14:46:39] should all be resolved, thanks folks and sorry for the noise [14:46:52] I'm curioyus about the error now [14:46:57] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update embeddings isvc to image that adds missing dev headers required by AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248480 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [14:47:13] (03CR) 10Muehlenhoff: [C:03+2] Add separate puppetserver hooks for the private repo [puppet] - 10https://gerrit.wikimedia.org/r/1247976 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:47:57] free surprise check to make sure the alerting still works :D [14:48:19] PROBLEM - Host an-worker1231 is DOWN: PING CRITICAL - Packet loss = 100% [14:49:04] (03Merged) 10jenkins-bot: ml-services: update embeddings isvc to image that adds missing dev headers required by AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248480 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [14:50:12] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1248419 (https://phabricator.wikimedia.org/T418671) (owner: 10Hnowlan) [14:50:52] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [14:51:21] (03CR) 10Ayounsi: [C:03+1] Add ganeti4006 to the routed Ganeti cluster in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1248482 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [14:51:33] sukhe: as a potential improvement, maybe traffic could look at having AM aggregating them, if *all* fail no need to page for every one :D [14:52:00] volans: yeah, this and the authdns-update not run need to be moved to AM [14:52:10] FIRING: [3x] GanetiBGPDown: BGP session down between ganeti4005 and cr3-ulsfo - group Ganeti6 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [14:52:11] at least the failure one is so rare that we haven't worried about it [14:52:20] it's on the list™ [14:52:34] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti4006 to the routed Ganeti cluster in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1248482 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [14:52:36] :D [14:52:39] (03PS13) 10Elukey: WIP: add sre.kafka.change-confluent-distro-version [cookbooks] - 10https://gerrit.wikimedia.org/r/1247942 (https://phabricator.wikimedia.org/T417035) [14:53:03] !log elukey@cumin1003 START - Cookbook sre.kafka.change-confluent-distro-version Change Confluent distribution for Kafka A:kafka-test-eqiad cluster: Change Confluent distribution. [14:53:50] (03PS1) 10Federico Ceratto: mariadb: fix dbproxy* in hieradata/regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1248489 (https://phabricator.wikimedia.org/T416578) [14:54:40] (03CR) 10Federico Ceratto: "Tested against all dbproxy* hosts in https://puppet-compiler.wmflabs.org/output/1248489/8220/" [puppet] - 10https://gerrit.wikimedia.org/r/1248489 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [14:55:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [14:56:02] !jouncebot next [14:56:02] a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [14:56:37] (03CR) 10SBassett: [C:03+1] grafana: replace grafana read-only with IDP-authentication [puppet] - 10https://gerrit.wikimedia.org/r/1248419 (https://phabricator.wikimedia.org/T418671) (owner: 10Hnowlan) [14:56:43] jouncebot: nowandnext [14:56:43] For the next 0 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T1400) [14:56:44] In 0 hour(s) and 33 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T1530) [14:57:01] anyone mind if i sneak a config patch into the end of backport window? [14:57:10] FIRING: [6x] GanetiBGPDown: BGP session down between ganeti4005 and cr4-ulsfo - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [14:57:23] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [14:58:32] (03CR) 10Federico Ceratto: "A bit more cleanup after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1247996" [puppet] - 10https://gerrit.wikimedia.org/r/1248489 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [14:58:44] (03PS1) 10Muehlenhoff: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1248503 [14:58:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [14:59:46] (03Merged) 10jenkins-bot: cirrus: Add semantic search test cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244713 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [15:00:05] (03CR) 10Muehlenhoff: [C:03+2] Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1248503 (owner: 10Muehlenhoff) [15:00:46] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1244713|cirrus: Add semantic search test cluster (T413969)]] [15:00:49] T413969: Make semantic search accessible through Action API - https://phabricator.wikimedia.org/T413969 [15:01:20] (03PS1) 10Ayounsi: Routed Ganeti: allow setting an explicit neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/1248504 (https://phabricator.wikimedia.org/T418993) [15:01:53] (03CR) 10CI reject: [V:04-1] Routed Ganeti: allow setting an explicit neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/1248504 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [15:02:43] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host contint2003 [15:02:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host contint2003 [15:02:56] !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:1244713|cirrus: Add semantic search test cluster (T413969)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:03:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:03:17] !log sukhe@dns1004 START - running authdns-update [15:03:23] (03CR) 10MVernon: [C:03+1] "Looks reasonable to me." [puppet] - 10https://gerrit.wikimedia.org/r/1248489 (https://phabricator.wikimedia.org/T416578) (owner: 10Federico Ceratto) [15:04:08] (03PS2) 10Ayounsi: Routed Ganeti: allow setting an explicit neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/1248504 (https://phabricator.wikimedia.org/T418993) [15:04:27] !log sukhe@dns1004 END - running authdns-update [15:06:05] !log ebernhardson@deploy2002 ebernhardson: Continuing with sync [15:06:14] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248504 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [15:07:10] FIRING: [6x] GanetiBGPDown: BGP session down between ganeti4005 and cr4-ulsfo - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [15:10:04] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244713|cirrus: Add semantic search test cluster (T413969)]] (duration: 09m 18s) [15:10:07] T413969: Make semantic search accessible through Action API - https://phabricator.wikimedia.org/T413969 [15:10:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:11:37] (03PS1) 10Ebernhardson: cirrus: Correct semantic builder config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248506 [15:11:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.kafka.change-confluent-distro-version (exit_code=0) Change Confluent distribution for Kafka A:kafka-test-eqiad cluster: Change Confluent distribution. [15:12:33] (03PS2) 10Ebernhardson: cirrus: Correct semantic builder config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248506 (https://phabricator.wikimedia.org/T413969) [15:13:00] (03CR) 10Dzahn: [C:03+1] gerrit: remove nftables_throttling::abusers [puppet] - 10https://gerrit.wikimedia.org/r/1248474 (https://phabricator.wikimedia.org/T417263) (owner: 10Jelto) [15:14:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248506 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [15:15:08] (03PS1) 10TChin: [eventgate] bump to v1.28.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248507 (https://phabricator.wikimedia.org/T409106) [15:15:30] (03Merged) 10jenkins-bot: cirrus: Correct semantic builder config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248506 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [15:15:59] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1248506|cirrus: Correct semantic builder config (T413969)]] [15:16:03] T413969: Make semantic search accessible through Action API - https://phabricator.wikimedia.org/T413969 [15:16:06] jhancock@cumin2002 provision (PID 3898017) is awaiting input [15:17:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [15:18:03] !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:1248506|cirrus: Correct semantic builder config (T413969)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:19:46] !log ebernhardson@deploy2002 ebernhardson: Continuing with sync [15:22:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:23:39] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248506|cirrus: Correct semantic builder config (T413969)]] (duration: 07m 39s) [15:23:42] T413969: Make semantic search accessible through Action API - https://phabricator.wikimedia.org/T413969 [15:25:03] (03PS1) 10Ebernhardson: cirrus: Align semanticsearch cluster group name with routing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248508 (https://phabricator.wikimedia.org/T413969) [15:25:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:25:30] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['contint2003'] [15:25:38] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['contint2003'] [15:25:43] (03PS1) 10Mszwarc: Disable custom JS for a moment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248509 [15:26:05] jouncebot: nowandnext [15:26:05] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [15:26:05] In 0 hour(s) and 3 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T1530) [15:26:06] (03CR) 10Dreamy Jazz: [C:03+1] Disable custom JS for a moment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248509 (owner: 10Mszwarc) [15:26:09] we're deploying a patch now [15:26:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248509 (owner: 10Mszwarc) [15:26:33] (03CR) 10CI reject: [V:04-1] Disable custom JS for a moment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248509 (owner: 10Mszwarc) [15:26:58] (03Abandoned) 10C. Scott Ananian: Localisation updates from https://translatewiki.net. [extensions/ParserMigration] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1247648 (owner: 10C. Scott Ananian) [15:27:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [15:27:15] (03PS2) 10Mszwarc: Disable custom JS for a moment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248509 [15:27:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248509 (owner: 10Mszwarc) [15:28:35] (03Merged) 10jenkins-bot: Disable custom JS for a moment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248509 (owner: 10Mszwarc) [15:29:05] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1248509|Disable custom JS for a moment]] [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T1530) [15:30:21] (03CR) 10Herron: [C:03+1] grafana: replace grafana read-only with IDP-authentication [puppet] - 10https://gerrit.wikimedia.org/r/1248419 (https://phabricator.wikimedia.org/T418671) (owner: 10Hnowlan) [15:30:29] (03CR) 10Herron: [C:03+2] rotate large (>50G/day) logs hourly [puppet] - 10https://gerrit.wikimedia.org/r/1245514 (https://phabricator.wikimedia.org/T418612) (owner: 10Herron) [15:30:37] 06SRE, 06ServiceOps new, 10ServiceOps-Mediawiki: Migrate Service Ops Docker images running in production away from Bullseye - https://phabricator.wikimedia.org/T418200#11677955 (10Blake) p:05Triage→03Medium [15:31:09] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1248509|Disable custom JS for a moment]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:31:15] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1178.eqiad.wmnet [15:31:44] !log mszwarc@deploy2002 mszwarc: Continuing with sync [15:32:04] !log taavi@cumin1003 dbctl commit (dc=all): 'set global ro', diff saved to https://phabricator.wikimedia.org/P89808 and previous config saved to /var/cache/conftool/dbconfig/20260305-153203-taavi.json [15:35:36] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248509|Disable custom JS for a moment]] (duration: 06m 31s) [15:35:37] mszwarc@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [15:36:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:37:08] FIRING: [2x] UdpIRCStreamThroughput: irc1003:16667 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/eb101795-c69e-4b9c-b848-f042d604f234/ircstream - https://alerts.wikimedia.org/?q=alertname%3DUdpIRCStreamThroughput [15:38:55] (03CR) 10Phuedx: [C:03+1] [eventgate] bump to v1.28.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248507 (https://phabricator.wikimedia.org/T409106) (owner: 10TChin) [15:39:13] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:39:50] kostajh: did you just have the one? I'm trying to ship one more mw-config patch if there is time/space [15:40:00] ebernhardson: not now [15:40:03] kk [15:40:27] PROBLEM - Check if active EventStreams endpoint is delivering messages. on alert1002 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [15:40:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11678027 (10Jclark-ctr) a:03Jclark-ctr [15:41:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:42:48] (03CR) 10JMeybohm: [C:03+1] deployment_server: add the kafka-mirrormaker kubeconfigs in the aux clusters [puppet] - 10https://gerrit.wikimedia.org/r/1248401 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [15:43:07] (03CR) 10JMeybohm: [C:03+1] aux-k8s: define the kafka-mirrormaker namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248404 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [15:44:37] 06SRE, 06ServiceOps new, 10Wikibase GraphQL, 06Wikibase Reuse Team, and 2 others: Create a rewrite for the GraphQL endpoint on wikidata.org - https://phabricator.wikimedia.org/T417026#11678061 (10Blake) Hey folks, would it be possible to get some more detail about what assistance is required here? Is apply... [15:47:18] (03PS1) 10Majavah: Enforce CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248515 [15:47:33] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1231.eqiad.wmnet [15:47:34] btullis@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:47:37] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1232.eqiad.wmnet [15:47:38] btullis@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:47:42] (03CR) 10Ladsgroup: [C:03+1] Enforce CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248515 (owner: 10Majavah) [15:47:51] (03CR) 10Kosta Harlan: [C:03+1] Enforce CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248515 (owner: 10Majavah) [15:47:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11678077 (10ops-monitoring-bot) Host an-worker1232.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [15:48:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248515 (owner: 10Majavah) [15:49:09] (03CR) 10JMeybohm: [C:04-1] aux-k8s: define the kafka-mirrormaker-jumbo-eqiad-to-test-eqiad releases (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248405 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [15:49:24] (03Merged) 10jenkins-bot: Enforce CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248515 (owner: 10Majavah) [15:49:56] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1248515|Enforce CSP]] [15:49:57] taavi@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [15:52:05] !log taavi@deploy2002 taavi: Backport for [[gerrit:1248515|Enforce CSP]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:52:08] taavi@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [15:52:45] FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [15:53:25] !log taavi@deploy2002 taavi: Continuing with sync [15:53:25] taavi@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [15:54:12] (03PS3) 10Ayounsi: Routed Ganeti: allow setting an explicit neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/1248504 (https://phabricator.wikimedia.org/T418993) [15:54:41] FIRING: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [15:54:47] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [15:54:50] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [15:55:53] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248504 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [15:57:14] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248515|Enforce CSP]] (duration: 07m 18s) [15:57:15] taavi@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [15:58:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host contint2003.wikimedia.org with OS bookworm [15:58:02] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [15:58:12] 10ops-codfw, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: codfw: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418545#11678128 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cont... [15:59:06] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1248504 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [15:59:23] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1232.eqiad.wmnet [15:59:24] btullis@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:59:27] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1233.eqiad.wmnet [15:59:28] btullis@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:59:41] RESOLVED: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [15:59:47] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [15:59:50] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [15:59:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11678135 (10ops-monitoring-bot) Host an-worker1233.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [16:01:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet [16:01:11] jmm@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [16:01:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:01:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'One sec', diff saved to https://phabricator.wikimedia.org/P89809 and previous config saved to /var/cache/conftool/dbconfig/20260305-160140-ladsgroup.json [16:01:42] ladsgroup@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [16:02:39] FIRING: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [16:03:49] !log oblivian@cumin1003 dbctl commit (dc=all): 'read only s6', diff saved to https://phabricator.wikimedia.org/P89810 and previous config saved to /var/cache/conftool/dbconfig/20260305-160348-oblivian.json [16:03:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Work done', diff saved to https://phabricator.wikimedia.org/P89811 and previous config saved to /var/cache/conftool/dbconfig/20260305-160354-ladsgroup.json [16:03:55] ladsgroup@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [16:05:03] (03PS1) 10Kevin Bazira: ml-services: increase memory in embeddings isvc to fix OOM issue caused AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248522 (https://phabricator.wikimedia.org/T418976) [16:07:08] FIRING: [2x] UdpIRCStreamThroughput: irc1003:16667 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/eb101795-c69e-4b9c-b848-f042d604f234/ircstream - https://alerts.wikimedia.org/?q=alertname%3DUdpIRCStreamThroughput [16:08:37] (03CR) 10Kevin Bazira: [C:03+2] ml-services: increase memory in embeddings isvc to fix OOM issue caused AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248522 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [16:08:54] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet [16:09:03] jmm@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [16:10:07] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [16:10:09] jiji@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [16:10:16] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1233.eqiad.wmnet [16:10:17] btullis@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [16:10:20] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1234.eqiad.wmnet [16:10:21] btullis@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [16:10:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11678176 (10ops-monitoring-bot) Host an-worker1234.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [16:10:48] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [16:10:49] jiji@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [16:11:07] (03Merged) 10jenkins-bot: ml-services: increase memory in embeddings isvc to fix OOM issue caused AITER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248522 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [16:11:12] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [16:11:13] jiji@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [16:11:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:11:41] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [16:11:41] jiji@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [16:12:08] FIRING: [2x] UdpIRCStreamThroughput: irc1003:16667 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/eb101795-c69e-4b9c-b848-f042d604f234/ircstream - https://alerts.wikimedia.org/?q=alertname%3DUdpIRCStreamThroughput [16:12:39] RESOLVED: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [16:13:31] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [16:13:32] kevinbazira@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [16:14:50] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti4006.ulsfo.wmnet [16:14:51] jmm@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [16:15:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti4006.ulsfo.wmnet [16:15:22] jmm@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [16:16:41] FIRING: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [16:16:42] jhancock@cumin2002 reimage (PID 3909438) is awaiting input [16:16:47] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [16:16:50] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [16:17:15] RECOVERY - Host an-worker1178 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [16:18:17] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1010.eqiad.wmnet with OS bookworm [16:18:18] btullis@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [16:18:42] 06SRE, 06ServiceOps new, 10Wikibase GraphQL, 06Wikibase Reuse Team, and 2 others: Create a rewrite for the GraphQL endpoint on wikidata.org - https://phabricator.wikimedia.org/T417026#11678196 (10Silvan_WMDE) Thanks for getting back on this. Yes, we need help with applying the rule, please: does the sugges... [16:19:12] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1234.eqiad.wmnet [16:19:13] btullis@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [16:19:15] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1235.eqiad.wmnet [16:19:17] btullis@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [16:19:33] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:19:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11678201 (10ops-monitoring-bot) Host an-worker1235.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [16:19:42] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for EMcFarland - https://phabricator.wikimedia.org/T419145 (10EMcFarland-WMF) 03NEW [16:19:55] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:20:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host contint2003.wikimedia.org with OS bookworm [16:20:02] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [16:20:11] 10ops-codfw, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: codfw: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418545#11678215 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host contint2... [16:21:18] (03PS1) 10Ladsgroup: Bump ResourceLoaderStorageVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248526 [16:21:43] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:22:02] (03CR) 10Krinkle: [C:03+1] Bump ResourceLoaderStorageVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248526 (owner: 10Ladsgroup) [16:22:49] (03CR) 10Kosta Harlan: [C:03+1] Bump ResourceLoaderStorageVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248526 (owner: 10Ladsgroup) [16:23:54] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for EMcFarland - https://phabricator.wikimedia.org/T419145#11678223 (10DMburugu) I Approve [16:24:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:24:07] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [16:27:05] (03CR) 10Ayounsi: [C:03+2] Routed Ganeti: allow setting an explicit neighbors_list [puppet] - 10https://gerrit.wikimedia.org/r/1248504 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [16:27:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:27:42] (03PS1) 10Muehlenhoff: netbox: Add ulsfo02 [puppet] - 10https://gerrit.wikimedia.org/r/1248528 (https://phabricator.wikimedia.org/T418993) [16:29:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248526 (owner: 10Ladsgroup) [16:30:21] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1235.eqiad.wmnet [16:30:22] btullis@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [16:30:24] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1236.eqiad.wmnet [16:30:25] btullis@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [16:30:42] (03Merged) 10jenkins-bot: Bump ResourceLoaderStorageVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248526 (owner: 10Ladsgroup) [16:30:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Unusually high disk errors on the an-worker nodes since upgrading the disks - https://phabricator.wikimedia.org/T415002#11678242 (10ops-monitoring-bot) Host an-worker1236.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebo... [16:31:10] !log taavi@deploy2002 Started scap sync-world: Backport for [[gerrit:1248526|Bump ResourceLoaderStorageVersion]] [16:31:11] taavi@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [16:32:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host contint2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:32:15] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [16:33:01] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:33:14] !log taavi@deploy2002 taavi, ladsgroup: Backport for [[gerrit:1248526|Bump ResourceLoaderStorageVersion]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:33:15] taavi@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [16:33:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host contint2003.wikimedia.org with OS bookworm [16:33:39] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [16:33:48] 10ops-codfw, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: codfw: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418545#11678273 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cont... [16:33:54] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:33:59] !log taavi@deploy2002 taavi, ladsgroup: Continuing with sync [16:33:59] taavi@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [16:34:13] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:34:29] PROBLEM - Host rpki2003 is DOWN: PING CRITICAL - Packet loss = 100% [16:35:19] PROBLEM - Host netflow2004 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:10] FIRING: [10x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [16:37:51] !log taavi@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248526|Bump ResourceLoaderStorageVersion]] (duration: 06m 40s) [16:37:52] taavi@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [16:38:54] FIRING: [5x] JobUnavailable: Reduced availability for job gnmic in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:40:27] RECOVERY - Check if active EventStreams endpoint is delivering messages. on alert1002 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [16:41:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1236.eqiad.wmnet [16:41:28] btullis@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [16:43:14] 06SRE, 06Traffic: Image Rate Limiting Issues For Future Audiences Project - https://phabricator.wikimedia.org/T418377#11678305 (10derenrich) given the recently announced increase in rate limiting i think this ticket is more urgent [16:43:50] (03PS1) 10Kevin Bazira: ml-services: bump llm limitranges to enable embeddings isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248531 (https://phabricator.wikimedia.org/T418976) [16:43:54] FIRING: [7x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:44:09] RESOLVED: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [16:44:09] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [16:44:09] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [16:46:37] RECOVERY - Host ms-fe1013 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [16:47:37] (03CR) 10Ssingh: [C:03+1] "Looks good, thanks! We should still be careful about the rollout." [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [16:48:05] FIRING: MediaWikiAccountCreationFailures: Elevated MediaWiki account creation failures: 90.19% - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=23 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiAccountCreationFailures [16:48:33] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: ms-fe1013 reports a backplane error - https://phabricator.wikimedia.org/T419010#11678340 (10VRiley-WMF) Hey @elukey I performed a flea power drain, and rebooted the server. It seems to have come back up cleanly. You should be able to... [16:49:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418939#11678341 (10VRiley-WMF) 05Open→03In progress Swapping this now [16:50:00] !log installing libpng1.6 security updates [16:50:01] moritzm: Failed to log message to wiki. Somebody should check the error logs. [16:50:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:53:39] FIRING: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [16:53:39] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [16:53:39] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [16:53:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on contint2003.wikimedia.org with reason: host reimage [16:53:43] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [16:55:03] 06SRE, 06Traffic: Image Rate Limiting Issues For Future Audiences Project - https://phabricator.wikimedia.org/T418377#11678360 (10CDanis) > The app involves fetching a large number of apps to present to the user. I'd definitely like to unblock development, but, I don't want to set us up for future trouble he... [16:56:33] PROBLEM - Host wikikube-worker1163 is DOWN: PING CRITICAL - Packet loss = 100% [16:57:10] FIRING: [12x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [16:58:03] RECOVERY - Host wikikube-worker1163 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [16:58:39] RESOLVED: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [16:58:39] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [16:58:39] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [16:59:11] FIRING: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [16:59:17] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [16:59:20] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [16:59:24] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:59:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on contint2003.wikimedia.org with reason: host reimage [16:59:46] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [17:00:05] jhathaway and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:52] PROBLEM - Confd vcl based reload on cp6012 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [17:04:24] FIRING: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [17:04:24] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=eqiad%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [17:04:24] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [17:04:46] 06SRE, 06ServiceOps new, 10Wikibase GraphQL, 06Wikibase Reuse Team, and 2 others: Create a rewrite for the GraphQL endpoint on wikidata.org - https://phabricator.wikimedia.org/T417026#11678377 (10Blake) p:05Triage→03Medium [17:05:57] !log taavi@cumin1003 dbctl commit (dc=all): 'enable writes', diff saved to https://phabricator.wikimedia.org/P89812 and previous config saved to /var/cache/conftool/dbconfig/20260305-170556-taavi.json [17:07:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418939#11678378 (10VRiley-WMF) 05In progress→03Resolved I have swapped the interface and the cable (just in case) th... [17:08:44] 06SRE, 06ServiceOps new, 10Wikibase GraphQL, 06Wikibase Reuse Team, and 2 others: Create a rewrite for the GraphQL endpoint on wikidata.org - https://phabricator.wikimedia.org/T417026#11678384 (10Clement_Goubert) Hi :) Depending on exactly what you want this rewrite to do, it may be that an apache rule i... [17:09:03] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1178.eqiad.wmnet [17:09:11] RESOLVED: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [17:09:17] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=eqiad%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [17:09:20] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [17:10:42] !log cgoubert@cumin1003 START - Cookbook sre.hosts.remove-downtime for wikikube-worker1162.eqiad.wmnet [17:10:43] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker1162.eqiad.wmnet [17:12:01] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1162.eqiad.wmnet [17:12:03] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1162.eqiad.wmnet [17:12:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418939#11678400 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 pool fo... [17:12:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: eno1 on wikikube-worker1162:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T418939#11678401 (10Clement_Goubert) Host back in the pool, thanks <3 [17:13:38] RESOLVED: UdpIRCStreamThroughput: irc2003:16667 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/eb101795-c69e-4b9c-b848-f042d604f234/ircstream - https://alerts.wikimedia.org/?q=alertname%3DUdpIRCStreamThroughput [17:13:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - fasw2-c8a-codfw:ge-0/0/0 (Core: fmsw-c8-codfw) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c8a-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:13:56] (03PS1) 10Krinkle: Enable wgUseSiteJs on donatewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248536 [17:14:34] (03PS2) 10Krinkle: Enable wgUseSiteJs on donatewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248536 [17:14:45] (03PS3) 10Krinkle: Enable wgUseSiteJs on donatewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248536 (https://phabricator.wikimedia.org/T419138) [17:14:50] greg-g: ^ [17:15:38] (03PS1) 10Vgutierrez: haproxy: Avoid logging ja3n/ja4h/res_ more than once [puppet] - 10https://gerrit.wikimedia.org/r/1248537 (https://phabricator.wikimedia.org/T419149) [17:16:07] (03CR) 10Greg Grossmeier: [C:03+1] "Yes please!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248536 (https://phabricator.wikimedia.org/T419138) (owner: 10Krinkle) [17:16:09] (03CR) 10Mszwarc: [C:03+1] Enable wgUseSiteJs on donatewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248536 (https://phabricator.wikimedia.org/T419138) (owner: 10Krinkle) [17:16:28] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:16:33] (03CR) 10Giuseppe Lavagetto: [C:03+1] Enable wgUseSiteJs on donatewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248536 (https://phabricator.wikimedia.org/T419138) (owner: 10Krinkle) [17:16:39] (03CR) 10Volans: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248536 (https://phabricator.wikimedia.org/T419138) (owner: 10Krinkle) [17:16:58] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1178.eqiad.wmnet [17:17:06] (03CR) 10BCornwall: [C:03+1] prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [17:17:08] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:17:46] (03PS2) 10Vgutierrez: haproxy: Avoid logging ja3n/ja4h/res_ more than once [puppet] - 10https://gerrit.wikimedia.org/r/1248537 (https://phabricator.wikimedia.org/T419149) [17:17:52] (03CR) 10Kosta Harlan: [C:03+1] Enable wgUseSiteJs on donatewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248536 (https://phabricator.wikimedia.org/T419138) (owner: 10Krinkle) [17:18:41] (03PS3) 10Vgutierrez: haproxy: Avoid logging ja3n/ja4h/res_ more than once [puppet] - 10https://gerrit.wikimedia.org/r/1248537 (https://phabricator.wikimedia.org/T419149) [17:19:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248536 (https://phabricator.wikimedia.org/T419138) (owner: 10Krinkle) [17:19:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:19:33] jhancock@cumin2002 reimage (PID 3916585) is awaiting input [17:19:56] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:20:29] (03Merged) 10jenkins-bot: Enable wgUseSiteJs on donatewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248536 (https://phabricator.wikimedia.org/T419138) (owner: 10Krinkle) [17:21:02] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1248536|Enable wgUseSiteJs on donatewiki (T419138)]] [17:21:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:23:06] !log mszwarc@deploy2002 mszwarc, krinkle: Backport for [[gerrit:1248536|Enable wgUseSiteJs on donatewiki (T419138)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:26:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:26:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host contint2003.wikimedia.org with OS bookworm [17:26:34] 10ops-codfw, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: codfw: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418545#11678438 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host contint2... [17:27:03] !log mszwarc@deploy2002 mszwarc, krinkle: Continuing with sync [17:27:04] 10ops-codfw, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: codfw: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418545#11678441 (10Jhancock.wm) @Dzahn this is finally done. all yours! [17:27:45] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:28:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - fasw2-c8a-codfw:ge-0/0/0 (Core: fmsw-c8-codfw) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c8a-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:29:08] (03CR) 10CDanis: [C:03+1] haproxy: Avoid logging ja3n/ja4h/res_ more than once [puppet] - 10https://gerrit.wikimedia.org/r/1248537 (https://phabricator.wikimedia.org/T419149) (owner: 10Vgutierrez) [17:30:15] RESOLVED: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [17:30:52] RECOVERY - Confd vcl based reload on cp6012 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [17:30:59] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248536|Enable wgUseSiteJs on donatewiki (T419138)]] (duration: 09m 57s) [17:31:07] (03PS1) 10JavierMonton: stream: mw-page-html-content-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248539 (https://phabricator.wikimedia.org/T360794) [17:31:29] 06SRE, 06Traffic: Image Rate Limiting Issues For Future Audiences Project - https://phabricator.wikimedia.org/T418377#11678447 (10derenrich) thanks for the reply. i think first it would be helpful to know what the current limits are. i've been given conflicting information. not being able to enforce the limit... [17:33:02] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:40:27] (03CR) 10BCornwall: [C:03+1] Revert^2 "dse-k8s: Enable active/active for dse-k8s clusters (1/2)" [dns] - 10https://gerrit.wikimedia.org/r/1248495 (owner: 10Bking) [17:42:24] (03CR) 10Vgutierrez: [C:03+2] haproxy: Avoid logging ja3n/ja4h/res_ more than once [puppet] - 10https://gerrit.wikimedia.org/r/1248537 (https://phabricator.wikimedia.org/T419149) (owner: 10Vgutierrez) [17:43:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - fasw2-c8a-codfw:ge-0/0/0 (Core: fmsw-c8-codfw) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c8a-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:43:54] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:44:43] (03PS1) 10Vgutierrez: haproxy: Fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1248541 (https://phabricator.wikimedia.org/T419149) [17:44:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-02-13 - 2026-03-06): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T419000#11678479 (10Jclark-ctr) Looking at the logs after Dell reviewed them, the error occurred back on December 3rd and is not related to the current errors sinc... [17:44:54] (03CR) 10BCornwall: [C:03+1] haproxy: Fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1248541 (https://phabricator.wikimedia.org/T419149) (owner: 10Vgutierrez) [17:46:44] (03CR) 10Vgutierrez: [C:03+2] haproxy: Fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1248541 (https://phabricator.wikimedia.org/T419149) (owner: 10Vgutierrez) [17:48:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - fasw2-c8a-codfw:ge-0/0/0 (Core: fmsw-c8-codfw) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c8a-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:51:31] (03CR) 10BCornwall: [C:03+1] wmnet: add linked-artifacts CNAME record for k8s ingress [dns] - 10https://gerrit.wikimedia.org/r/1247172 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [17:51:35] (03PS1) 10Ssingh: varnish: add CSP policy to VCL for text [puppet] - 10https://gerrit.wikimedia.org/r/1248544 [17:52:34] (03PS2) 10Ssingh: varnish: add CSP policy to VCL for text [puppet] - 10https://gerrit.wikimedia.org/r/1248544 [17:55:56] 10ops-codfw, 06DC-Ops: Inbound errors on interface pfw1-codfw:reth1 () - https://phabricator.wikimedia.org/T419150 (10phaultfinder) 03NEW [17:56:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:58:36] (03CR) 10BCornwall: [V:03+1] "0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1248544 (owner: 10Ssingh) [18:00:05] bd808: It is that lovely time of the day again! You are hereby commanded to deploy Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T1800) [18:00:34] * bd808 looks for things to deploy [18:00:56] Nothing for my window this week. [18:01:00] (03CR) 10Volans: [C:04-1] varnish: add CSP policy to VCL for text (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248544 (owner: 10Ssingh) [18:02:42] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:03:22] (03PS3) 10Ssingh: varnish: add CSP policy to VCL for text [puppet] - 10https://gerrit.wikimedia.org/r/1248544 [18:04:00] (03CR) 10Krinkle: varnish: add CSP policy to VCL for text (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248544 (owner: 10Ssingh) [18:04:09] (03CR) 10Krinkle: varnish: add CSP policy to VCL for text (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248544 (owner: 10Ssingh) [18:04:44] (03CR) 10Ssingh: varnish: add CSP policy to VCL for text (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248544 (owner: 10Ssingh) [18:06:18] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change frack mgmt vlan interface - pt1979@cumin2002" [18:06:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change frack mgmt vlan interface - pt1979@cumin2002" [18:06:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:07:01] (03CR) 10SBassett: varnish: add CSP policy to VCL for text (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1248544 (owner: 10Ssingh) [18:08:26] (03PS4) 10Ssingh: varnish: add CSP policy to VCL for text [puppet] - 10https://gerrit.wikimedia.org/r/1248544 [18:11:07] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1248544 (owner: 10Ssingh) [18:13:54] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:15:08] (03CR) 10BCornwall: [V:03+2 C:03+1] "0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1248544 (owner: 10Ssingh) [18:16:09] !log sudo cumin "A:cp" "disable-puppet 'rolling out 1248544'" [18:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:34] (03CR) 10Ssingh: [C:03+2] varnish: add CSP policy to VCL for text [puppet] - 10https://gerrit.wikimedia.org/r/1248544 (owner: 10Ssingh) [18:29:17] (03CR) 10Eevans: [C:03+2] wmnet: add linked-artifacts CNAME record for k8s ingress [dns] - 10https://gerrit.wikimedia.org/r/1247172 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [18:30:13] !log sudo cumin -b51 "A:cp" "run-puppet-agent --enable 'rolling out 1248544'" [18:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:33] !log eevans@dns1004 START - running authdns-update [18:31:52] !log eevans@dns1004 END - running authdns-update [18:34:04] 10ops-codfw, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: codfw: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418545#11678664 (10Dzahn) @Jhancock.wm Thank you very much! Taking over:) [18:34:22] AntiComposite can i appeal my stewards channel ban? [18:35:56] urbanecm please can i appeal? [18:37:45] Is this the main irc channel I want to appeal? [18:43:09] ItsLido You're looking for #wikimedia-ops [18:43:19] Deploying change 1240253 for refinery ( T414478 ), already hotfixed, should be no-op [18:43:20] T414478: Add 'first campaign' and 'first campaign status code' to CentralNotice banner_activity_minutely Turnilo cube and Druid source table - https://phabricator.wikimedia.org/T414478 [18:43:22] im banned there [18:43:50] 10ops-codfw, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: codfw: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418545#11678699 (10Dzahn) 05Open→03Resolved a:03Dzahn Can SSH to the machine. wfm:) Further puppet setup will be... [18:44:22] i am banned there too [18:45:52] (03PS1) 10Herron: mwlog: copy archives to trixie hosts [puppet] - 10https://gerrit.wikimedia.org/r/1248564 (https://phabricator.wikimedia.org/T417002) [18:46:11] Deploying change 1239200 for refinery ( T416481 ) [18:46:12] T416481: Adapt Sqoop for imagelinks schema changes - https://phabricator.wikimedia.org/T416481 [18:47:16] !log dr0ptp4kt@deploy2002 Started deploy [analytics/refinery@dd641b1] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@dd641b15] [18:47:42] !log Deploying change 1239200 for refinery ( T416481 ) [18:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:06] ! Deploying change 1240253 for refinery ( T414478 ), already hotfixed, should be no-op [18:49:13] !log dr0ptp4kt@deploy2002 Finished deploy [analytics/refinery@dd641b1] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@dd641b15] (duration: 01m 57s) [18:50:47] !log dr0ptp4kt@deploy2002 Started deploy [analytics/refinery@dd641b1]: Regular analytics weekly train [analytics/refinery@dd641b15] [18:55:05] !log dr0ptp4kt@deploy2002 Finished deploy [analytics/refinery@dd641b1]: Regular analytics weekly train [analytics/refinery@dd641b15] (duration: 04m 18s) [18:56:29] 06SRE, 06Traffic: Image Rate Limiting Issues For Future Audiences Project - https://phabricator.wikimedia.org/T418377#11678748 (10CDanis) Apologies for the conflicting information, that's partially my fault. But all the limits we've been discussing so far have been the limits we apply to bot (non-human) traff... [18:56:34] !log dr0ptp4kt@deploy2002 Started deploy [analytics/refinery@dd641b1] (thin): Regular analytics weekly train THIN [analytics/refinery@dd641b15] [18:58:24] (03CR) 10Pmiazga: [C:03+1] rest-gateway: test: check for www-authenticate error [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248398 (https://phabricator.wikimedia.org/T419034) (owner: 10Daniel Kinzler) [18:58:36] !log dr0ptp4kt@deploy2002 Finished deploy [analytics/refinery@dd641b1] (thin): Regular analytics weekly train THIN [analytics/refinery@dd641b15] (duration: 02m 02s) [18:58:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - fasw2-c8a-codfw:ge-0/0/0 (Core: fmsw-c8-codfw) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c8a-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [18:59:35] (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-content-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248539 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [19:00:04] jeena and dduvall: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T1900). [19:03:29] dduvall: jeena: if you could please hold before moving the train, that would be appreciated. there are a couple of config patches incoming that we will need to deploy. [19:03:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - fasw2-c8a-codfw:ge-0/0/0 (Core: fmsw-c8-codfw) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c8a-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:03:54] I was about to ask the same… we had a security incident and I'd like to get one patch in before the train if possible [19:03:56] !log Deployed refinery change 1240253 ( T414478 ), 1240253 (no-op) for refinery ( T414478 ) using scap, then deployed onto hdfs [19:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:01] T414478: Add 'first campaign' and 'first campaign status code' to CentralNotice banner_activity_minutely Turnilo cube and Druid source table - https://phabricator.wikimedia.org/T414478 [19:04:36] !log Deploying change 1239200 for refinery ( T416481 ) using scap, then deployed onto hdfs [19:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:39] T416481: Adapt Sqoop for imagelinks schema changes - https://phabricator.wikimedia.org/T416481 [19:06:18] (03PS1) 10SBassett: Re-enable Site JS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248571 (https://phabricator.wikimedia.org/T419137) [19:06:24] swfrench-wmf: got it [19:08:17] (03CR) 10Krinkle: [C:03+1] Re-enable Site JS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248571 (https://phabricator.wikimedia.org/T419137) (owner: 10SBassett) [19:08:24] (03CR) 10Scott French: [C:03+1] Re-enable Site JS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248571 (https://phabricator.wikimedia.org/T419137) (owner: 10SBassett) [19:09:30] (03CR) 10Ottomata: "LGTM! Good to merge once schema is also merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148322 (owner: 10Abban Dunne) [19:09:37] (03CR) 10Rsilvola: [C:03+1] Re-enable Site JS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248571 (https://phabricator.wikimedia.org/T419137) (owner: 10SBassett) [19:10:08] (03PS1) 10Krinkle: Allow toolforge APIs in enforced CSP mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248574 (https://phabricator.wikimedia.org/T419137) [19:13:03] Hey all ^ going to deploy Site JS enablement config patch ^ unless there are any objections... [19:13:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248571 (https://phabricator.wikimedia.org/T419137) (owner: 10SBassett) [19:13:59] (03PS2) 10Krinkle: Allow toolforge APIs in enforced CSP mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248574 (https://phabricator.wikimedia.org/T135963) [19:14:16] (03Merged) 10jenkins-bot: Re-enable Site JS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248571 (https://phabricator.wikimedia.org/T419137) (owner: 10SBassett) [19:14:47] !log sbassett@deploy2002 Started scap sync-world: Backport for [[gerrit:1248571|Re-enable Site JS (T419137 T419138)]] [19:16:32] (please ignore my comment above; no train blockers on my end) [19:16:55] !log sbassett@deploy2002 sbassett: Backport for [[gerrit:1248571|Re-enable Site JS (T419137 T419138)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:17:02] okay thanks musikanimal [19:17:48] !log sbassett@deploy2002 sbassett: Continuing with sync [19:21:44] !log sbassett@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248571|Re-enable Site JS (T419137 T419138)]] (duration: 06m 57s) [19:22:10] FIRING: [12x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [19:23:33] sbassett: swfrench-wmf lmk when it's fine to deploy [19:27:10] FIRING: [12x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [19:29:15] (03CR) 10Scott French: [C:03+1] Allow toolforge APIs in enforced CSP mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248574 (https://phabricator.wikimedia.org/T135963) (owner: 10Krinkle) [19:31:33] (03CR) 10SBassett: [C:03+1] Allow toolforge APIs in enforced CSP mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248574 (https://phabricator.wikimedia.org/T135963) (owner: 10Krinkle) [19:31:50] 06SRE, 06Traffic: Image Rate Limiting Issues For Future Audiences Project - https://phabricator.wikimedia.org/T418377#11678861 (10HSwan-WMF) Hey Chris, Perhaps we can talk live about this. I'm concerned about you mentioning that there will be no version of the limits that can facilitate an acceptable UX. I t... [19:35:05] (03CR) 10Majavah: "fwiw since I98146245206d82d1889648c3754441d36216f84a this is set in varnish config (puppet) to bypass caches, so this would at the moment " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248574 (https://phabricator.wikimedia.org/T135963) (owner: 10Krinkle) [19:37:10] FIRING: [12x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [19:38:42] (03CR) 10Thcipriani: [C:03+1] create role skeleton for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1248082 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [19:39:13] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:40:38] (03CR) 10Krinkle: "Ack, I'm using it to have it be formatted and verified via WikimediaDebug (under non-standard URLs), and then SRE is propagating it from t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248574 (https://phabricator.wikimedia.org/T135963) (owner: 10Krinkle) [19:41:15] jeena: one more config patch incoming, but I believe that will be all we have planned for the moment. I'll keep you posted. [19:41:39] 👍 thanks for the update! [19:45:05] (03PS1) 10BCornwall: varnish: Set CSP via vmod_var [puppet] - 10https://gerrit.wikimedia.org/r/1248581 [19:45:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:47:56] (03PS2) 10BCornwall: varnish: Set CSP via vmod_var [puppet] - 10https://gerrit.wikimedia.org/r/1248581 [19:48:50] (03CR) 10Dzahn: create role skeleton for jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248082 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [19:50:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:50:25] (03CR) 10Ssingh: "Good idea, one comment:" [puppet] - 10https://gerrit.wikimedia.org/r/1248581 (owner: 10BCornwall) [19:51:00] (03PS3) 10BCornwall: varnish: Set CSP via vmod_var [puppet] - 10https://gerrit.wikimedia.org/r/1248581 [19:51:18] (03CR) 10BCornwall: varnish: Set CSP via vmod_var (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248581 (owner: 10BCornwall) [19:53:08] (03CR) 10Ssingh: [C:03+1] "Thank you, nice and clean." [puppet] - 10https://gerrit.wikimedia.org/r/1248581 (owner: 10BCornwall) [19:53:21] (03CR) 10Ssingh: [C:03+1] "(I am assuming we have run the VTC tests)" [puppet] - 10https://gerrit.wikimedia.org/r/1248581 (owner: 10BCornwall) [19:54:11] !issync [19:54:46] (03CR) 10BCornwall: [V:03+2] "0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1248581 (owner: 10BCornwall) [19:55:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248574 (https://phabricator.wikimedia.org/T135963) (owner: 10Krinkle) [19:56:17] (03Merged) 10jenkins-bot: Allow toolforge APIs in enforced CSP mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248574 (https://phabricator.wikimedia.org/T135963) (owner: 10Krinkle) [19:56:22] !issync [19:56:23] Syncing #wikimedia-operations (requested by JJMC89) [19:56:24] No updates for #wikimedia-operations [19:56:49] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1248574|Allow toolforge APIs in enforced CSP mode (T135963 T419137 T220475)]] [19:56:55] T135963: Add support for Content-Security-Policy (CSP) headers in MediaWiki - https://phabricator.wikimedia.org/T135963 [19:56:55] T220475: XTools' ArticleInfo gadget will be blocked by CSP - https://phabricator.wikimedia.org/T220475 [19:58:18] (03CR) 10Ryan Kemper: [C:03+1] wdqs: remove query-legay-full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247947 (https://phabricator.wikimedia.org/T415073) (owner: 10Gehel) [19:58:55] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1248574|Allow toolforge APIs in enforced CSP mode (T135963 T419137 T220475)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:00:02] (03PS1) 10Krinkle: varnish: Sync CSP rule with MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/1248588 (https://phabricator.wikimedia.org/T419137) [20:00:04] !log krinkle@deploy2002 krinkle: Continuing with sync [20:00:16] (03CR) 10Ssingh: [C:03+2] varnish: Set CSP via vmod_var [puppet] - 10https://gerrit.wikimedia.org/r/1248581 (owner: 10BCornwall) [20:03:45] (03PS1) 10Ssingh: Revert "varnish: Set CSP via vmod_var" [puppet] - 10https://gerrit.wikimedia.org/r/1248591 [20:04:27] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248574|Allow toolforge APIs in enforced CSP mode (T135963 T419137 T220475)]] (duration: 07m 37s) [20:04:32] T135963: Add support for Content-Security-Policy (CSP) headers in MediaWiki - https://phabricator.wikimedia.org/T135963 [20:04:33] T220475: XTools' ArticleInfo gadget will be blocked by CSP - https://phabricator.wikimedia.org/T220475 [20:04:53] (03CR) 10Ssingh: [C:03+2] Revert "varnish: Set CSP via vmod_var" [puppet] - 10https://gerrit.wikimedia.org/r/1248591 (owner: 10Ssingh) [20:05:23] (03PS2) 10Krinkle: varnish: Sync CSP rule with MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/1248588 (https://phabricator.wikimedia.org/T419137) [20:06:28] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1248588 (https://phabricator.wikimedia.org/T419137) (owner: 10Krinkle) [20:07:06] (03CR) 10Ssingh: [C:03+2] varnish: Sync CSP rule with MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/1248588 (https://phabricator.wikimedia.org/T419137) (owner: 10Krinkle) [20:07:25] (03PS1) 10Ayounsi: ulsfo route ganeti use core routers interface IPs [puppet] - 10https://gerrit.wikimedia.org/r/1248592 (https://phabricator.wikimedia.org/T418993) [20:07:41] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248592 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [20:08:42] jeena: I believe we're done with config patches for now, so you should be good to move the train forward. [20:08:47] thank you [20:09:33] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248593 (https://phabricator.wikimedia.org/T413809) [20:09:35] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248593 (https://phabricator.wikimedia.org/T413809) (owner: 10TrainBranchBot) [20:10:24] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248593 (https://phabricator.wikimedia.org/T413809) (owner: 10TrainBranchBot) [20:14:36] (03CR) 10Ayounsi: [V:03+1] "ganeti4005 currently have puppet disabled with those values manually configured, and it's working fine." [puppet] - 10https://gerrit.wikimedia.org/r/1248592 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [20:15:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:14] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.18 refs T413809 [20:16:18] T413809: 1.46.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T413809 [20:20:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:23:48] (03PS1) 10Herron: mwlog: copy archives to trixie hosts [puppet] - 10https://gerrit.wikimedia.org/r/1248564 (https://phabricator.wikimedia.org/T417002) [20:28:05] !log apt built and imported jwt-authorizer 1.3.0-1 [20:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:13] (03PS1) 10CDanis: docker-registry: jwt-authorizer: fix ws issues [puppet] - 10https://gerrit.wikimedia.org/r/1248598 [20:32:22] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248598 (owner: 10CDanis) [20:34:26] (03CR) 10CDanis: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1248598/5991/registry1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1248598 (owner: 10CDanis) [20:34:35] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1248599 [20:36:03] (03PS1) 10JHathaway: wikipedia.org: add a dkim key for VRTS email [dns] - 10https://gerrit.wikimedia.org/r/1248601 (https://phabricator.wikimedia.org/T418700) [20:40:53] (03PS1) 10JHathaway: postfix: DKIM sign VRTS wikipedia.org email [puppet] - 10https://gerrit.wikimedia.org/r/1248603 (https://phabricator.wikimedia.org/T418700) [20:42:12] (03PS2) 10JHathaway: wikipedia.org: add a DKIM key for VRTS email [dns] - 10https://gerrit.wikimedia.org/r/1248601 (https://phabricator.wikimedia.org/T418700) [20:43:54] FIRING: [5x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:45:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:47:06] (03CR) 10Majavah: [C:03+1] docker-registry: jwt-authorizer: fix ws issues [puppet] - 10https://gerrit.wikimedia.org/r/1248598 (owner: 10CDanis) [20:47:29] !log jasmine@cumin2002 START - Cookbook sre.dns.netbox [20:47:32] (03PS2) 10CDanis: docker-registry: jwt-authorizer: fix ws issues [puppet] - 10https://gerrit.wikimedia.org/r/1248598 [20:48:00] musikanimal: I donno if there was another patch you wanted to deploy but I've finished deploying the train [20:48:54] thanks for letting me know :) [20:49:59] (03CR) 10Ayounsi: [C:03+1] netbox: Add ulsfo02 [puppet] - 10https://gerrit.wikimedia.org/r/1248528 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [20:50:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:52:10] (03CR) 10CDanis: [C:03+2] docker-registry: jwt-authorizer: fix ws issues [puppet] - 10https://gerrit.wikimedia.org/r/1248598 (owner: 10CDanis) [20:52:24] (03PS1) 10Bking: WIP: Add new active-active discovery service for dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1248605 (https://phabricator.wikimedia.org/T417698) [20:52:52] !log jasmine@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new service IPs for sophroid - jasmine@cumin2002" [20:52:58] !log jasmine@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new service IPs for sophroid - jasmine@cumin2002" [20:52:58] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:53:04] (03CR) 10BCornwall: [C:03+1] wikipedia.org: add a DKIM key for VRTS email (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1248601 (https://phabricator.wikimedia.org/T418700) (owner: 10JHathaway) [20:57:26] (03PS3) 10JHathaway: wikipedia.org: add a DKIM key for VRTS email [dns] - 10https://gerrit.wikimedia.org/r/1248601 (https://phabricator.wikimedia.org/T418700) [20:58:03] (03CR) 10JHathaway: wikipedia.org: add a DKIM key for VRTS email (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1248601 (https://phabricator.wikimedia.org/T418700) (owner: 10JHathaway) [20:59:14] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b3-magru and cr2-magru (195.200.68.146) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T2100). [21:00:05] cscott and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:57] \o [21:02:21] (03CR) 10JHathaway: [C:03+2] wikipedia.org: add a DKIM key for VRTS email [dns] - 10https://gerrit.wikimedia.org/r/1248601 (https://phabricator.wikimedia.org/T418700) (owner: 10JHathaway) [21:02:42] !log jhathaway@dns1004 START - running authdns-update [21:02:45] I am around if a deployer is needed [21:04:01] !log jhathaway@dns1004 END - running authdns-update [21:04:58] (03PS2) 10JHathaway: postfix: DKIM sign VRTS wikipedia.org email [puppet] - 10https://gerrit.wikimedia.org/r/1248603 (https://phabricator.wikimedia.org/T418700) [21:05:08] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248603 (https://phabricator.wikimedia.org/T418700) (owner: 10JHathaway) [21:05:08] i can deploy mine, was waiting to see if cscott wants to go first, i'm not in a big rush [21:05:17] 👍 [21:05:55] sadly mine depends on something that helm is reporting is in a failed state, so i can't quite test mine yet... [21:07:33] I can offer you my condolences lol [21:07:50] yea :) [21:11:23] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1248603 (https://phabricator.wikimedia.org/T418700) (owner: 10JHathaway) [21:15:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:15:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248508 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [21:16:55] (03PS1) 10Jasmine: conftool: add sophroid etcd data [puppet] - 10https://gerrit.wikimedia.org/r/1248611 (https://phabricator.wikimedia.org/T418748) [21:20:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:21:46] 06SRE, 10Hiddenparma: FY25/26 WE4.3.2: support JA4H - https://phabricator.wikimedia.org/T406990#11679248 (10CDanis) 05Open→03Resolved [21:24:38] (03PS1) 10Jasmine: wmnet: Add sophroid svc IPs [dns] - 10https://gerrit.wikimedia.org/r/1248617 (https://phabricator.wikimedia.org/T418748) [21:25:24] (03CR) 10JHathaway: [C:03+2] postfix: DKIM sign VRTS wikipedia.org email [puppet] - 10https://gerrit.wikimedia.org/r/1248603 (https://phabricator.wikimedia.org/T418700) (owner: 10JHathaway) [21:31:26] 06SRE: Audit production for systemd parse warnings - https://phabricator.wikimedia.org/T419166 (10CDanis) 03NEW [21:31:52] 06SRE: Audit production for systemd parse warnings - https://phabricator.wikimedia.org/T419166#11679301 (10CDanis) [21:32:22] Hi, when will the change that re-enables personal JS be deployed? [21:32:36] ebernhardson [21:32:51] Neriah: i don't have any official information, but i saw someone mention it might be as long as early next week [21:33:34] but that's at best a random guess, i'm not involved in that issue [21:33:51] (03PS1) 10Krinkle: varnish: Sync CSP rule with MediaWiki (again) [puppet] - 10https://gerrit.wikimedia.org/r/1248620 (https://phabricator.wikimedia.org/T419137) [21:34:45] (03PS2) 10Ebernhardson: cirrus: Align semanticsearch cluster group name with routing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248508 (https://phabricator.wikimedia.org/T413969) [21:34:58] (03CR) 10TrainBranchBot: "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248508 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [21:35:47] (03Merged) 10jenkins-bot: cirrus: Align semanticsearch cluster group name with routing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248508 (https://phabricator.wikimedia.org/T413969) (owner: 10Ebernhardson) [21:35:52] (03CR) 10Krinkle: varnish: Sync CSP rule with MediaWiki (again) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248620 (https://phabricator.wikimedia.org/T419137) (owner: 10Krinkle) [21:36:05] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1248508|cirrus: Align semanticsearch cluster group name with routing (T413969)]] [21:36:08] T413969: Make semantic search accessible through Action API - https://phabricator.wikimedia.org/T413969 [21:37:59] !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:1248508|cirrus: Align semanticsearch cluster group name with routing (T413969)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:38:27] (03CR) 10Scott French: [C:03+1] "Confirmed this matches the value being served upstream." [puppet] - 10https://gerrit.wikimedia.org/r/1248620 (https://phabricator.wikimedia.org/T419137) (owner: 10Krinkle) [21:39:22] !log ebernhardson@deploy2002 ebernhardson: Continuing with sync [21:40:55] 06SRE: Audit production for systemd parse warnings - https://phabricator.wikimedia.org/T419166#11679363 (10CDanis) [21:43:25] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248508|cirrus: Align semanticsearch cluster group name with routing (T413969)]] (duration: 07m 20s) [21:43:28] T413969: Make semantic search accessible through Action API - https://phabricator.wikimedia.org/T413969 [21:43:43] cscott: deploy is available now, can help ship if you need [21:43:54] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:47:11] (03CR) 10SBassett: [C:03+1] "Fine to deploy now I think, we may go with something a little more generous in the very near future." [puppet] - 10https://gerrit.wikimedia.org/r/1248620 (https://phabricator.wikimedia.org/T419137) (owner: 10Krinkle) [21:49:55] (03PS3) 10Herron: grafana: limit api methods on public vhost [puppet] - 10https://gerrit.wikimedia.org/r/1248599 (https://phabricator.wikimedia.org/T418671) [21:53:02] (03CR) 10Scott French: [C:03+2] varnish: Sync CSP rule with MediaWiki (again) [puppet] - 10https://gerrit.wikimedia.org/r/1248620 (https://phabricator.wikimedia.org/T419137) (owner: 10Krinkle) [21:53:55] (03CR) 10Herron: [C:03+2] grafana: limit api methods on public vhost [puppet] - 10https://gerrit.wikimedia.org/r/1248599 (https://phabricator.wikimedia.org/T418671) (owner: 10Herron) [21:54:02] (03PS1) 10Bking: WIP: Add new active-active discovery records for dse-k8s [dns] - 10https://gerrit.wikimedia.org/r/1248625 (https://phabricator.wikimedia.org/T417698) [21:55:12] (03CR) 10CI reject: [V:04-1] WIP: Add new active-active discovery records for dse-k8s [dns] - 10https://gerrit.wikimedia.org/r/1248625 (https://phabricator.wikimedia.org/T417698) (owner: 10Bking) [21:56:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T2200) [22:02:17] (03PS1) 10Cwhite: grafana: remove access to swagger endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1248626 (https://phabricator.wikimedia.org/T418671) [22:06:07] jouncebot: nowandnext [22:06:07] For the next 0 hour(s) and 53 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260305T2200) [22:06:07] In 8 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260306T0700) [22:06:24] (03CR) 10Herron: [C:03+1] grafana: remove access to swagger endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1248626 (https://phabricator.wikimedia.org/T418671) (owner: 10Cwhite) [22:06:46] (03CR) 10Andrea Denisse: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1248626 (https://phabricator.wikimedia.org/T418671) (owner: 10Cwhite) [22:07:11] (03CR) 10Zabe: [C:03+2] SpecialWantedFiles: Use lt_title instead of lt_to [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248483 (https://phabricator.wikimedia.org/T299953) (owner: 10Zabe) [22:07:16] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1248108 (https://phabricator.wikimedia.org/T418914) (owner: 10Scott French) [22:07:31] (03CR) 10Scott French: [C:03+2] Add new conf200[789] hosts [puppet] - 10https://gerrit.wikimedia.org/r/1248108 (https://phabricator.wikimedia.org/T418914) (owner: 10Scott French) [22:09:38] preparing to do a security deploy [22:09:46] is anyone deploying anything right now? [22:10:58] maryum: I +2'ed a backport, but it will take another ten minutes for CI to finish. So if you are fast, you can sneak in. :) [22:11:22] zabe: thank you! I'll see if I can get in there [22:12:06] (03CR) 10Cwhite: [C:03+2] grafana: remove access to swagger endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1248626 (https://phabricator.wikimedia.org/T418671) (owner: 10Cwhite) [22:13:51] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#11679584 (10Scott_French) a:05Scott_French→03None New hosts have been added to site.pp and preseed.yaml in https://gerrit.wikimedia.org/r/1248108. Thanks, folks! [22:13:54] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:20:47] 06SRE, 07Security: Audit production for systemd parse warnings - https://phabricator.wikimedia.org/T419166#11679620 (10taavi) [22:21:23] (03Merged) 10jenkins-bot: SpecialWantedFiles: Use lt_title instead of lt_to [core] (wmf/1.46.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1248483 (https://phabricator.wikimedia.org/T299953) (owner: 10Zabe) [22:24:24] (03PS2) 10Zabe: Using Hadoop for MostTranscludedPages on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238028 (https://phabricator.wikimedia.org/T416927) [22:24:30] 06SRE, 07Security: Audit production for systemd parse warnings - https://phabricator.wikimedia.org/T419166#11679631 (10Volans) Sigh... it looks likes we're mostly ok, apart the ones fixed by your patch we have some ganeti host complaining about `ConditionPathExists` (cc @MoritzMuehlenhoff ) `lang=shell $ sud... [22:28:49] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1248483|SpecialWantedFiles: Use lt_title instead of lt_to (T299953)]] [22:28:52] T299953: Normalize imagelinks table - https://phabricator.wikimedia.org/T299953 [22:29:14] (03PS1) 10Kosta Harlan: Re-enable AllowUserJs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248628 (https://phabricator.wikimedia.org/T419137) [22:29:32] (03CR) 10Kosta Harlan: [C:04-2] "Wait until other mitigations are in place" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248628 (https://phabricator.wikimedia.org/T419137) (owner: 10Kosta Harlan) [22:29:35] (03CR) 10JHathaway: [C:03+1] dist-upgrade: Remove support for Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248463 (owner: 10Muehlenhoff) [22:30:37] !log zabe@deploy2002 zabe: Backport for [[gerrit:1248483|SpecialWantedFiles: Use lt_title instead of lt_to (T299953)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:30:53] zabe: didn't make it, I'll wait [22:31:02] !log zabe@deploy2002 zabe: Continuing with sync [22:35:01] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248483|SpecialWantedFiles: Use lt_title instead of lt_to (T299953)]] (duration: 06m 12s) [22:35:04] T299953: Normalize imagelinks table - https://phabricator.wikimedia.org/T299953 [22:35:22] (03PS4) 10Scott French: envoy: Support using envoy-drain-tool [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247185 (https://phabricator.wikimedia.org/T364245) [22:35:23] maryum: alright, feel free to do it now [22:35:30] zabe: thanks! [22:38:18] I want to deploy a private settings change [22:38:27] can I have 5 mins plz? [22:38:45] Did you just start scap? [22:38:50] I just committed the changes [22:38:55] yes [22:39:52] I've undone the changes temporarily. Can you ping when done? [22:39:55] yes will do [22:43:59] (03CR) 10Scott French: "I've applied some additional tweaks to drain-envoy.sh:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1247185 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [22:45:16] Dreamy_Jazz: scap finished [22:45:30] Thanks. Starting now [22:45:48] !log Deployed security fix for T418254 [22:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:32] (03PS1) 10SBassett: varnish: Introduce updated enforcing CSP with broad domain support [puppet] - 10https://gerrit.wikimedia.org/r/1248630 (https://phabricator.wikimedia.org/T419137) [22:47:41] (03CR) 10SBassett: [C:03+1] "(for when we're ready to deploy)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248628 (https://phabricator.wikimedia.org/T419137) (owner: 10Kosta Harlan) [22:48:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11679677 (10Jgreen) [22:49:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11679680 (10Jgreen) >>! In T418928#11670501, @RobH wrote: > Jeff, > > I made assumptions on this since we didn't have racking details on the parent ordering task. Please doubl... [22:58:06] I've finished my private settings deploy [23:06:09] dzahn@cumin2002 reimage (PID 4003139) is awaiting input [23:06:18] 06SRE, 10Wikimedia-Mailing-lists: X-spam-score header missing on obvious spam delivered to wikitech-l - https://phabricator.wikimedia.org/T386559#11679719 (10bd808) [23:06:29] (03CR) 10Zabe: Using Hadoop for MostTranscludedPages on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238028 (https://phabricator.wikimedia.org/T416927) (owner: 10Zabe) [23:07:05] (03CR) 10Zabe: [C:03+2] Using Hadoop for MostTranscludedPages on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238028 (https://phabricator.wikimedia.org/T416927) (owner: 10Zabe) [23:08:07] (03Merged) 10jenkins-bot: Using Hadoop for MostTranscludedPages on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238028 (https://phabricator.wikimedia.org/T416927) (owner: 10Zabe) [23:08:32] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11679722 (10VRiley-WMF) So, after looking for that specific server, it doesn't seem to be here and netbox may not re... [23:08:33] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1238028|Using Hadoop for MostTranscludedPages on commonswiki (T416927)]] [23:08:36] T416927: Move MostTranscludedPages computation to Hadoop for commonswiki - https://phabricator.wikimedia.org/T416927 [23:09:19] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host contint2003.wikimedia.org with OS trixie [23:10:25] !log zabe@deploy2002 zabe: Backport for [[gerrit:1238028|Using Hadoop for MostTranscludedPages on commonswiki (T416927)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:11:04] !log zabe@deploy2002 zabe: Continuing with sync [23:12:12] (03CR) 10Kosta Harlan: Re-enable AllowUserJs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248628 (https://phabricator.wikimedia.org/T419137) (owner: 10Kosta Harlan) [23:15:00] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1238028|Using Hadoop for MostTranscludedPages on commonswiki (T416927)]] (duration: 06m 27s) [23:15:03] T416927: Move MostTranscludedPages computation to Hadoop for commonswiki - https://phabricator.wikimedia.org/T416927 [23:15:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:16:10] (03CR) 10Dreamy Jazz: [C:03+1] Re-enable AllowUserJs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248628 (https://phabricator.wikimedia.org/T419137) (owner: 10Kosta Harlan) [23:16:56] We'll probably want to re-enable AllowUserJs shortly [23:17:04] (03CR) 10Kosta Harlan: varnish: Introduce updated enforcing CSP with broad domain support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248630 (https://phabricator.wikimedia.org/T419137) (owner: 10SBassett) [23:17:08] If someone wants to deploy, could they ping first to check? [23:17:34] (03PS2) 10SBassett: varnish: Introduce updated enforcing CSP with broad domain support [puppet] - 10https://gerrit.wikimedia.org/r/1248630 (https://phabricator.wikimedia.org/T419137) [23:18:40] 06SRE, 10Wikimedia-Mailing-lists: X-spam-score header missing on obvious spam delivered to wikitech-l - https://phabricator.wikimedia.org/T386559#11679762 (10bd808) [23:18:41] 06SRE, 10Wikimedia-Mailing-lists: Spam filtering rules for mediawiki-api@lists.wikimedia.org failing - https://phabricator.wikimedia.org/T418028#11679764 (10bd808) →14Duplicate dup:03T386559 [23:19:03] 06SRE, 10Wikimedia-Mailing-lists: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11679778 (10bd808) [23:19:31] (03CR) 10Kosta Harlan: varnish: Introduce updated enforcing CSP with broad domain support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248630 (https://phabricator.wikimedia.org/T419137) (owner: 10SBassett) [23:19:58] 10ops-eqiad, 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, and 2 others: eqiad: request for a decom'ed R440 - Config C - https://phabricator.wikimedia.org/T418544#11679780 (10Dzahn) @VRiley-WMF Thank you! Yea, that works too, provided you can bump RAM and disk. Sounds good. [23:20:25] FIRING: [4x] SystemdUnitFailed: user@100982.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:21:20] 06SRE, 06Infrastructure-Foundations, 10Wikimedia-Mailing-lists: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists - https://phabricator.wikimedia.org/T386559#11679792 (10bd808) [23:29:51] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on contint2003.wikimedia.org with reason: host reimage [23:33:17] (03PS2) 10Dzahn: create role skeleton for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1248082 (https://phabricator.wikimedia.org/T418521) [23:33:37] (03CR) 10CDanis: [C:03+1] varnish: Introduce updated enforcing CSP with broad domain support [puppet] - 10https://gerrit.wikimedia.org/r/1248630 (https://phabricator.wikimedia.org/T419137) (owner: 10SBassett) [23:33:42] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on contint2003.wikimedia.org with reason: host reimage [23:34:09] (03CR) 10Dzahn: create role skeleton for jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1248082 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [23:34:27] (03CR) 10Scott French: [C:03+2] varnish: Introduce updated enforcing CSP with broad domain support [puppet] - 10https://gerrit.wikimedia.org/r/1248630 (https://phabricator.wikimedia.org/T419137) (owner: 10SBassett) [23:35:05] (03CR) 10Dzahn: [C:03+2] create role skeleton for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1248082 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [23:37:25] FIRING: [8x] GanetiBGPDown: BGP session down between ganeti2033 and lsw1-b7-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [23:39:13] FIRING: CertAlmostExpired: Certificate for service lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-e8-eqiad.mgmt.eqiad.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:39:48] (03PS1) 10Dzahn: site: apply jenkins stub role on contint2003 [puppet] - 10https://gerrit.wikimedia.org/r/1248635 (https://phabricator.wikimedia.org/T418521) [23:41:35] (03PS1) 10Catrope: CSP: Update false positives list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248636 [23:42:32] (03CR) 10SBassett: [C:03+1] CSP: Update false positives list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248636 (owner: 10Catrope) [23:42:50] (03CR) 10CDanis: [C:03+1] CSP: Update false positives list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248636 (owner: 10Catrope) [23:43:27] (03CR) 10Dreamy Jazz: [C:03+1] CSP: Update false positives list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248636 (owner: 10Catrope) [23:43:45] (03CR) 10Scott French: [C:03+1] CSP: Update false positives list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248636 (owner: 10Catrope) [23:44:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248636 (owner: 10Catrope) [23:45:13] (03Merged) 10jenkins-bot: CSP: Update false positives list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248636 (owner: 10Catrope) [23:45:30] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1248636|CSP: Update false positives list]] [23:47:20] !log catrope@deploy2002 catrope: Backport for [[gerrit:1248636|CSP: Update false positives list]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:47:54] !log catrope@deploy2002 catrope: Continuing with sync [23:49:24] (03CR) 10CDanis: [C:03+1] Re-enable AllowUserJs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248628 (https://phabricator.wikimedia.org/T419137) (owner: 10Kosta Harlan) [23:50:01] (03CR) 10Scott French: [C:03+1] Re-enable AllowUserJs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248628 (https://phabricator.wikimedia.org/T419137) (owner: 10Kosta Harlan) [23:51:16] (03CR) 10Krinkle: [C:03+1] Re-enable AllowUserJs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248628 (https://phabricator.wikimedia.org/T419137) (owner: 10Kosta Harlan) [23:52:02] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host contint2003.wikimedia.org with OS trixie [23:52:05] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248636|CSP: Update false positives list]] (duration: 06m 34s) [23:55:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248628 (https://phabricator.wikimedia.org/T419137) (owner: 10Kosta Harlan) [23:56:35] (03Merged) 10jenkins-bot: Re-enable AllowUserJs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248628 (https://phabricator.wikimedia.org/T419137) (owner: 10Kosta Harlan) [23:56:55] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1248628|Re-enable AllowUserJs (T419137)]] [23:58:46] !log catrope@deploy2002 catrope, kharlan: Backport for [[gerrit:1248628|Re-enable AllowUserJs (T419137)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.