[00:08:09] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4049.ulsfo.wmnet [00:08:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1197366 [00:08:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1197366 (owner: 10TrainBranchBot) [00:33:30] !log sukhe@cumin1003 END (ERROR) - Cookbook sre.cdn.roll-reboot (exit_code=97) rolling reboot on A:cp-ulsfo and not P{cp4037*} and A:cp [00:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:00:44] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11292131 (10Krinkle) [01:07:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.24 [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197370 (https://phabricator.wikimedia.org/T405680) [01:08:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.24 [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197370 (https://phabricator.wikimedia.org/T405680) (owner: 10TrainBranchBot) [01:09:10] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1197366 (owner: 10TrainBranchBot) [01:24:30] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.24 [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197370 (https://phabricator.wikimedia.org/T405680) (owner: 10TrainBranchBot) [01:29:12] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:58] (03PS3) 10Krinkle: varnish: Simplify m-dot rewrite and fix m.wikipedia.org bug [puppet] - 10https://gerrit.wikimedia.org/r/1197343 (https://phabricator.wikimedia.org/T405931) [01:34:58] (03PS2) 10Krinkle: varnish: Implement enable_m_redir and enable on test wikis [puppet] - 10https://gerrit.wikimedia.org/r/1197351 (https://phabricator.wikimedia.org/T405931) [01:34:58] (03PS1) 10Krinkle: varnish: Add test for m.wikisource.org x-dt-host rewrite and POST [puppet] - 10https://gerrit.wikimedia.org/r/1197372 (https://phabricator.wikimedia.org/T405931) [01:35:27] (03CR) 10RLazarus: [C:03+1] shellbox: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196771 (owner: 10Scott French) [01:35:56] (03CR) 10CI reject: [V:04-1] varnish: Add test for m.wikisource.org x-dt-host rewrite and POST [puppet] - 10https://gerrit.wikimedia.org/r/1197372 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [01:36:11] (03PS3) 10Krinkle: varnish: Add test for m.wikisource.org x-dt-host rewrite and POST [puppet] - 10https://gerrit.wikimedia.org/r/1197341 (https://phabricator.wikimedia.org/T405931) [01:36:11] (03PS4) 10Krinkle: varnish: Simplify m-dot rewrite and fix m.wikipedia.org bug [puppet] - 10https://gerrit.wikimedia.org/r/1197343 (https://phabricator.wikimedia.org/T405931) [01:36:11] (03PS3) 10Krinkle: varnish: Implement enable_m_redir and enable on test wikis [puppet] - 10https://gerrit.wikimedia.org/r/1197351 (https://phabricator.wikimedia.org/T405931) [01:36:26] (03PS2) 10Krinkle: varnish: Add test for m.wikisource.org x-dt-host rewrite and POST [puppet] - 10https://gerrit.wikimedia.org/r/1197372 [01:36:39] (03Abandoned) 10Krinkle: varnish: Add test for m.wikisource.org x-dt-host rewrite and POST [puppet] - 10https://gerrit.wikimedia.org/r/1197372 (owner: 10Krinkle) [01:39:15] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T0200) [02:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:34:05] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197351 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [02:38:35] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [02:39:27] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 1.772 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T0300) [03:02:08] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197374 (https://phabricator.wikimedia.org/T405680) [03:02:11] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197374 (https://phabricator.wikimedia.org/T405680) (owner: 10TrainBranchBot) [03:02:59] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197374 (https://phabricator.wikimedia.org/T405680) (owner: 10TrainBranchBot) [03:03:44] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.45.0-wmf.24 refs T405680 [03:03:48] T405680: 1.45.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T405680 [03:09:15] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:13:17] (03PS4) 10Krinkle: varnish: Implement enable_m_redir and enable on test wikis [puppet] - 10https://gerrit.wikimedia.org/r/1197351 (https://phabricator.wikimedia.org/T405931) [03:43:35] (03PS5) 10Krinkle: varnish: Implement enable_m_redir and enable on test wikis [puppet] - 10https://gerrit.wikimedia.org/r/1197351 (https://phabricator.wikimedia.org/T405931) [03:48:05] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:48:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:50:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:54:12] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:58:05] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:59:12] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T0400) [04:01:13] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:10:46] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.45.0-wmf.24 refs T405680 (duration: 67m 03s) [04:10:51] T405680: 1.45.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T405680 [04:17:20] 06SRE, 06Traffic, 06MediaWiki-Platform-Team (Radar): Have CDN edge set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976#11292290 (10tstarling) {T407826} may be related. [04:20:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:23:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:41:41] fceratto@cumin1003 clone_es (PID 1381498) is awaiting input [04:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:59:12] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:01:13] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:29:15] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:39:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:39:21] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:55:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P84133 and previous config saved to /var/cache/conftool/dbconfig/20251021-055543-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T0600). [06:04:35] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:09:33] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 8.613 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:10:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P84134 and previous config saved to /var/cache/conftool/dbconfig/20251021-061049-root.json [06:13:35] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:14:29] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 3.520 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:16:01] (03PS1) 10Marostegui: db1232: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1197386 (https://phabricator.wikimedia.org/T407463) [06:16:43] (03CR) 10Marostegui: [C:03+2] db1232: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1197386 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [06:17:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1232.eqiad.wmnet with reason: Maintenance [06:17:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1232 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84135 and previous config saved to /var/cache/conftool/dbconfig/20251021-061748-marostegui.json [06:19:14] (03PS1) 10Marostegui: instances.yaml: Add sretest2003 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197387 (https://phabricator.wikimedia.org/T407352) [06:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:20:06] (03PS2) 10Marostegui: instances.yaml: Add sretest2003 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197387 (https://phabricator.wikimedia.org/T407352) [06:21:06] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add sretest2003 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197387 (https://phabricator.wikimedia.org/T407352) (owner: 10Marostegui) [06:29:15] (03PS1) 10Marostegui: es1028: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197389 (https://phabricator.wikimedia.org/T407720) [06:29:56] (03CR) 10Marostegui: [C:03+2] es1028: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197389 (https://phabricator.wikimedia.org/T407720) (owner: 10Marostegui) [06:31:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1028 from dbctl T407720', diff saved to https://phabricator.wikimedia.org/P84136 and previous config saved to /var/cache/conftool/dbconfig/20251021-063134-marostegui.json [06:31:39] T407720: decommission es1028.eqiad.wmnet - https://phabricator.wikimedia.org/T407720 [06:31:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1232 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84137 and previous config saved to /var/cache/conftool/dbconfig/20251021-063142-root.json [06:31:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 7%: Repooling', diff saved to https://phabricator.wikimedia.org/P84138 and previous config saved to /var/cache/conftool/dbconfig/20251021-063143-root.json [06:32:38] !log Add sretest2003 to dbctl depooled T407352 [06:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:41] T407352: Test config H 1P in external store - https://phabricator.wikimedia.org/T407352 [06:33:50] (03PS1) 10Marostegui: mariadb: Remove es1028 [puppet] - 10https://gerrit.wikimedia.org/r/1197390 (https://phabricator.wikimedia.org/T407720) [06:34:08] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1028.eqiad.wmnet [06:34:47] (03CR) 10Marostegui: [C:03+2] mariadb: Remove es1028 [puppet] - 10https://gerrit.wikimedia.org/r/1197390 (https://phabricator.wikimedia.org/T407720) (owner: 10Marostegui) [06:39:50] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [06:44:03] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [06:44:04] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts es1028.eqiad.wmnet [06:44:17] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1028.eqiad.wmnet [06:44:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 21 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197284 (https://phabricator.wikimedia.org/T407357) (owner: 10Esanders) [06:46:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 21 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196064 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [06:46:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1232 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84139 and previous config saved to /var/cache/conftool/dbconfig/20251021-064648-root.json [06:46:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P84140 and previous config saved to /var/cache/conftool/dbconfig/20251021-064649-root.json [06:48:58] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [06:50:22] 06SRE: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#11292475 (10Joe) 05Open→03Resolved [06:52:34] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1028.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:53:51] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1028.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:53:51] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:53:52] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts es1028.eqiad.wmnet [06:54:19] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1028.eqiad.wmnet - https://phabricator.wikimedia.org/T407720#11292482 (10Marostegui) a:05Marostegui→03None [06:55:39] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1028.eqiad.wmnet - https://phabricator.wikimedia.org/T407720#11292489 (10Marostegui) This is ready for DC-Ops. The first failure was due to some connection glitches I had so I wasn't able to reply to the question three times and hence t... [07:00:05] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T0700) [07:00:05] edsanders and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:12] o/ [07:00:33] o/ [07:01:00] I can deploy [07:01:00] I can self deploy [07:01:23] edsanders: sure, please go ahead :) [07:01:50] thanks [07:01:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1232 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84141 and previous config saved to /var/cache/conftool/dbconfig/20251021-070154-root.json [07:01:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P84142 and previous config saved to /var/cache/conftool/dbconfig/20251021-070155-root.json [07:02:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197284 (https://phabricator.wikimedia.org/T407357) (owner: 10Esanders) [07:07:37] (03PS1) 10Marostegui: db2246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197470 (https://phabricator.wikimedia.org/T406551) [07:08:09] (03CR) 10Marostegui: [C:03+2] db2246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197470 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [07:10:05] (03Merged) 10jenkins-bot: Follow-up Iedb6361: Set insert-ignore on all insertSelect queries [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197284 (https://phabricator.wikimedia.org/T407357) (owner: 10Esanders) [07:10:20] (03PS1) 10Marostegui: instances.yaml: Add db2246 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197530 (https://phabricator.wikimedia.org/T406551) [07:10:54] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1197284|Follow-up Iedb6361: Set insert-ignore on all insertSelect queries (T407357)]] [07:10:59] T407357: Ignore duplicate key errors when creating Flow posts from LQT - https://phabricator.wikimedia.org/T407357 [07:11:02] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2246 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197530 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [07:13:47] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:14:12] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:15:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db2246 depooled T406551', diff saved to https://phabricator.wikimedia.org/P84143 and previous config saved to /var/cache/conftool/dbconfig/20251021-071503-marostegui.json [07:15:09] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [07:15:50] !log esanders@deploy2002 esanders: Backport for [[gerrit:1197284|Follow-up Iedb6361: Set insert-ignore on all insertSelect queries (T407357)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:16:20] !log esanders@deploy2002 esanders: Continuing with sync [07:16:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 1%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84144 and previous config saved to /var/cache/conftool/dbconfig/20251021-071632-root.json [07:17:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1232 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84145 and previous config saved to /var/cache/conftool/dbconfig/20251021-071700-root.json [07:17:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P84146 and previous config saved to /var/cache/conftool/dbconfig/20251021-071701-root.json [07:22:39] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197284|Follow-up Iedb6361: Set insert-ignore on all insertSelect queries (T407357)]] (duration: 11m 45s) [07:22:43] T407357: Ignore duplicate key errors when creating Flow posts from LQT - https://phabricator.wikimedia.org/T407357 [07:23:59] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-operator: watch the growthbook namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197272 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [07:24:01] (03CR) 10Brouberol: [C:03+2] Deploy a postgresql-growthbook cluster in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197273 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [07:24:51] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2033 gradually with 4 steps - Pool es2033.codfw.wmnet in after cloning [07:25:47] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197271 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [07:25:56] dcausse: all done [07:26:07] edsanders: thanks, shipping mine [07:27:52] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T407833 (10LSobanski) 03NEW [07:30:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196064 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:30:24] (03PS2) 10Alexandros Kosiaris: Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 [07:30:40] (03CR) 10Mszwarc: [C:03+1] Define CheckUser Suggested Investigations event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197278 (https://phabricator.wikimedia.org/T404177) (owner: 10Dreamy Jazz) [07:32:03] (03Merged) 10jenkins-bot: cirrus: prepare completion search with defaultsort A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196064 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:32:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P84148 and previous config saved to /var/cache/conftool/dbconfig/20251021-073207-root.json [07:32:38] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1196064|cirrus: prepare completion search with defaultsort A/B test (T404858)]] [07:32:42] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [07:34:00] (03PS1) 10Joely Rooke WMDE: Revert^2 "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197552 [07:37:13] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1196064|cirrus: prepare completion search with defaultsort A/B test (T404858)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:38:08] (03Merged) 10jenkins-bot: cloudnative-pg-operator: watch the growthbook namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197272 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [07:38:30] !log dcausse@deploy2002 dcausse: Continuing with sync [07:38:49] (03Merged) 10jenkins-bot: Deploy a postgresql-growthbook cluster in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197273 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [07:39:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es1052 to es1 master and depool es1029 T407832', diff saved to https://phabricator.wikimedia.org/P84149 and previous config saved to /var/cache/conftool/dbconfig/20251021-073904-marostegui.json [07:39:09] T407832: decommission es1029.eqiad.wmnet - https://phabricator.wikimedia.org/T407832 [07:39:57] (03PS1) 10Marostegui: es1029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197553 (https://phabricator.wikimedia.org/T407832) [07:41:39] (03CR) 10Marostegui: [C:03+2] es1029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197553 (https://phabricator.wikimedia.org/T407832) (owner: 10Marostegui) [07:41:55] (03CR) 10Brouberol: [C:03+2] deployment_server: create kubeconfigs to deploy postgresql-growthbook [puppet] - 10https://gerrit.wikimedia.org/r/1197271 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [07:42:36] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196064|cirrus: prepare completion search with defaultsort A/B test (T404858)]] (duration: 09m 58s) [07:42:41] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [07:43:51] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [07:46:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 5%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84151 and previous config saved to /var/cache/conftool/dbconfig/20251021-074604-root.json [07:47:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P84152 and previous config saved to /var/cache/conftool/dbconfig/20251021-074713-root.json [07:47:57] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ipv6 from sretest2003 - marostegui@cumin1003" [07:48:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ipv6 from sretest2003 - marostegui@cumin1003" [07:48:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:49:58] (03CR) 10Brouberol: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1196734 (https://phabricator.wikimedia.org/T309738) (owner: 10Scott French) [07:53:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply [07:53:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook: apply [07:54:12] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:56:15] (03CR) 10Cathal Mooney: [C:03+2] homer-diff-checker: move execution from cumin1002 to cumin1003 [puppet] - 10https://gerrit.wikimedia.org/r/1197321 (https://phabricator.wikimedia.org/T389380) (owner: 10Cathal Mooney) [07:56:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 07Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11292629 (10elukey) >>! In T406656#11289937, @bking wrote: > {F66767261} Please note that the examples that you posted above are not r... [07:57:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Pool sretest2003 with minimal weight T407352', diff saved to https://phabricator.wikimedia.org/P84154 and previous config saved to /var/cache/conftool/dbconfig/20251021-075741-marostegui.json [07:57:47] T407352: Test config H 1P in external store - https://phabricator.wikimedia.org/T407352 [07:58:35] (03PS3) 10Alexandros Kosiaris: Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 [07:59:41] (03PS2) 10Slyngshede: CAS version 7.2.6. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1149665 [08:01:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 7%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84155 and previous config saved to /var/cache/conftool/dbconfig/20251021-080110-root.json [08:02:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P84156 and previous config saved to /var/cache/conftool/dbconfig/20251021-080219-root.json [08:03:01] (03PS3) 10Slyngshede: CAS version 7.2.7 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1149665 (https://phabricator.wikimedia.org/T406455) [08:04:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Increase sretest2003 weight in es1 T407352', diff saved to https://phabricator.wikimedia.org/P84157 and previous config saved to /var/cache/conftool/dbconfig/20251021-080412-marostegui.json [08:04:18] T407352: Test config H 1P in external store - https://phabricator.wikimedia.org/T407352 [08:04:58] (03PS4) 10Slyngshede: CAS version 7.2.7 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1149665 (https://phabricator.wikimedia.org/T406455) [08:06:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:06:55] 06SRE, 06Infrastructure-Foundations: Increase net.nf_conntrack_max on kerberos hosts if needed - https://phabricator.wikimedia.org/T407726#11292650 (10cmooney) >>! In T407726#11290651, @jhathaway wrote: > Since we seem to be able to handle the load okay, I think we should bump the max conntrack setting. Ok.... [08:07:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Increase sretest2003 weight in es1 T407352', diff saved to https://phabricator.wikimedia.org/P84158 and previous config saved to /var/cache/conftool/dbconfig/20251021-080733-marostegui.json [08:07:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:07:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 07Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11292651 (10elukey) >>! In T406656#11290513, @Dzahn wrote: > I just wanted to add that I still just see a logical conflict between 2 st... [08:09:12] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:09:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply [08:09:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook: apply [08:10:21] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2033 gradually with 4 steps - Pool es2033.codfw.wmnet in after cloning [08:10:22] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es2033.codfw.wmnet onto es2056.codfw.wmnet [08:13:47] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:16:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 10%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84160 and previous config saved to /var/cache/conftool/dbconfig/20251021-081616-root.json [08:16:35] 06SRE, 06Infrastructure-Foundations: Increase net.nf_conntrack_max on kerberos hosts if needed - https://phabricator.wikimedia.org/T407726#11292678 (10cmooney) Hmm so the plot thickens, seems someone already tried this: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/m... [08:16:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Increase sretest2003 weight in es1 T407352', diff saved to https://phabricator.wikimedia.org/P84161 and previous config saved to /var/cache/conftool/dbconfig/20251021-081644-marostegui.json [08:16:49] T407352: Test config H 1P in external store - https://phabricator.wikimedia.org/T407352 [08:17:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P84162 and previous config saved to /var/cache/conftool/dbconfig/20251021-081725-root.json [08:19:20] (03PS1) 10David Caro: p:toolforge::prometheus: add logs api [puppet] - 10https://gerrit.wikimedia.org/r/1197587 (https://phabricator.wikimedia.org/T127367) [08:21:22] (03CR) 10CI reject: [V:04-1] p:toolforge::prometheus: add logs api [puppet] - 10https://gerrit.wikimedia.org/r/1197587 (https://phabricator.wikimedia.org/T127367) (owner: 10David Caro) [08:21:33] (03PS1) 10Brouberol: growthbook: remove all traces of mongoDB from the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197589 (https://phabricator.wikimedia.org/T406579) [08:23:27] (03CR) 10Majavah: thanos-rule: add support for multiple instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [08:24:16] (03PS2) 10David Caro: p:toolforge::prometheus: add logs api [puppet] - 10https://gerrit.wikimedia.org/r/1197587 (https://phabricator.wikimedia.org/T127367) [08:27:33] (03PS1) 10Majavah: thanos::rule: Cleanup firewall handling [puppet] - 10https://gerrit.wikimedia.org/r/1197590 (https://phabricator.wikimedia.org/T407837) [08:27:35] (03PS1) 10Majavah: P:wmcs::metricsinfra: Fix thanos::rule usage [puppet] - 10https://gerrit.wikimedia.org/r/1197591 (https://phabricator.wikimedia.org/T407837) [08:28:36] (03CR) 10Majavah: [C:03+1] p:toolforge::prometheus: add logs api [puppet] - 10https://gerrit.wikimedia.org/r/1197587 (https://phabricator.wikimedia.org/T127367) (owner: 10David Caro) [08:29:12] (03CR) 10Cathal Mooney: [C:03+2] Add new Nokia switches to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1196926 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [08:30:05] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7334/co" [puppet] - 10https://gerrit.wikimedia.org/r/1197590 (https://phabricator.wikimedia.org/T407837) (owner: 10Majavah) [08:30:47] (03CR) 10Elukey: [C:03+2] multirootca: add the client auth usage to the dse_k8s discovery issuer profile [puppet] - 10https://gerrit.wikimedia.org/r/1196920 (https://phabricator.wikimedia.org/T406876) (owner: 10Brouberol) [08:31:02] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7335/console" [puppet] - 10https://gerrit.wikimedia.org/r/1197591 (https://phabricator.wikimedia.org/T407837) (owner: 10Majavah) [08:31:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 20%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84163 and previous config saved to /var/cache/conftool/dbconfig/20251021-083122-root.json [08:32:01] (03CR) 10David Caro: [C:03+2] p:toolforge::prometheus: add logs api [puppet] - 10https://gerrit.wikimedia.org/r/1197587 (https://phabricator.wikimedia.org/T127367) (owner: 10David Caro) [08:32:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 1000%: Repooling', diff saved to https://phabricator.wikimedia.org/P84164 and previous config saved to /var/cache/conftool/dbconfig/20251021-083231-root.json [08:39:25] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11292758 (10jijiki) confirmed oob [08:39:36] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11292759 (10jijiki) confirmed oob [08:39:36] !log restart cfssl-multirootca on pki nodes to pick up new discovery settings (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1196920) [08:39:37] (03PS1) 10Federico Ceratto: instances.yaml, es2056.yaml: prepare es2056 [puppet] - 10https://gerrit.wikimedia.org/r/1197594 (https://phabricator.wikimedia.org/T402859) [08:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:09] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [08:44:47] (03CR) 10Cathal Mooney: [C:03+2] sudoers: allow members of datacenter-ops group run homer [puppet] - 10https://gerrit.wikimedia.org/r/1196090 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [08:44:51] (03PS4) 10Alexandros Kosiaris: Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 [08:44:53] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:45:46] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "run sync to add new nokia switches - cmooney@cumin1003 - T405558" [08:45:46] !log urbanecm@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [08:45:50] T405558: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558 [08:46:03] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "run sync to add new nokia switches - cmooney@cumin1003 - T405558" [08:46:21] !log urbanecm@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [08:46:23] (03CR) 10CI reject: [V:04-1] Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 (owner: 10Alexandros Kosiaris) [08:46:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 25%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84166 and previous config saved to /var/cache/conftool/dbconfig/20251021-084628-root.json [08:46:54] (03CR) 10Marostegui: [C:03+1] instances.yaml, es2056.yaml: prepare es2056 [puppet] - 10https://gerrit.wikimedia.org/r/1197594 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [08:48:07] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 60 hosts with reason: downtime new nokia devices in case they alert during tests [08:48:20] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11292788 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d70417af-8325-49e7-a880-7a0cd37bd2d2) set by cmo... [08:51:50] (03CR) 10Federico Ceratto: [C:03+2] clone_es.py: clone readonly es* hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 (owner: 10Federico Ceratto) [08:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:54:36] (03PS1) 10Marostegui: db2245: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197597 (https://phabricator.wikimedia.org/T406551) [08:54:45] FIRING: Emergency syslog message: Alert for device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [08:55:30] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml, es2056.yaml: prepare es2056 [puppet] - 10https://gerrit.wikimedia.org/r/1197594 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [08:57:54] (03PS5) 10Alexandros Kosiaris: Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 [08:59:29] (03CR) 10Marostegui: [C:03+2] db2245: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197597 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [08:59:45] RESOLVED: Emergency syslog message: Device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [09:01:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 30%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84167 and previous config saved to /var/cache/conftool/dbconfig/20251021-090134-root.json [09:02:46] (03CR) 10CI reject: [V:04-1] Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 (owner: 10Alexandros Kosiaris) [09:02:49] (03PS1) 10Marostegui: db-test*: Change section [puppet] - 10https://gerrit.wikimedia.org/r/1197599 (https://phabricator.wikimedia.org/T400056) [09:02:56] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for es2056.codfw.wmnet [09:02:57] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es2056.codfw.wmnet [09:03:32] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2056 slowly with 10 steps - Pooling in new host [09:03:47] 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#11292847 (10Marostegui) What is pending here? Is this now a duplicate of https://phabricator.wikimedia.org/T400056? [09:04:19] (03CR) 10Federico Ceratto: [C:03+1] db-test*: Change section [puppet] - 10https://gerrit.wikimedia.org/r/1197599 (https://phabricator.wikimedia.org/T400056) (owner: 10Marostegui) [09:04:45] (03CR) 10Marostegui: [C:03+2] db-test*: Change section [puppet] - 10https://gerrit.wikimedia.org/r/1197599 (https://phabricator.wikimedia.org/T400056) (owner: 10Marostegui) [09:04:59] 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#11292849 (10Ladsgroup) >>! In T389089#11292847, @Marostegui wrote: > What is pending here? Is this now a duplicate of https://phabricator.wikimedia.org/T400056? This is not a duplicate. This VM is for... [09:05:32] 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#11292851 (10Marostegui) Ah cool, thanks [09:07:54] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db-test1003.eqiad.wmnet with OS trixie [09:11:00] (03PS1) 10Marostegui: instances.yaml: Add db2245 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197600 (https://phabricator.wikimedia.org/T406551) [09:12:20] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2245 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197600 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [09:14:13] (03PS6) 10Alexandros Kosiaris: Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 [09:14:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db2245 depooled T406551', diff saved to https://phabricator.wikimedia.org/P84168 and previous config saved to /var/cache/conftool/dbconfig/20251021-091418-marostegui.json [09:14:24] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [09:14:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:14:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:15:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:15:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:16:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 50%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84169 and previous config saved to /var/cache/conftool/dbconfig/20251021-091640-root.json [09:16:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:16:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:17:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:17:43] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db-test1003.eqiad.wmnet with reason: host reimage [09:18:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:18:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 1%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84170 and previous config saved to /var/cache/conftool/dbconfig/20251021-091817-root.json [09:18:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:19:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:19:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:20:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:20:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:22:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:23:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:23:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:23:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:23:48] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db-test1003.eqiad.wmnet with reason: host reimage [09:23:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:24:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:24:09] (03PS1) 10Elukey: profile::amd_gpu: upgrade trixie hosts to ROCm 7.0.2 repos [puppet] - 10https://gerrit.wikimedia.org/r/1197602 (https://phabricator.wikimedia.org/T403697) [09:24:56] (03PS1) 10Ladsgroup: api: Fix incorrect templatelinks query in ApiQueryInfo [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197603 (https://phabricator.wikimedia.org/T407842) [09:25:25] (03PS2) 10Elukey: profile::amd_gpu: upgrade trixie hosts to ROCm 7.0.2 repos [puppet] - 10https://gerrit.wikimedia.org/r/1197602 (https://phabricator.wikimedia.org/T403697) [09:27:13] (03PS1) 10Marostegui: db1234: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1197604 (https://phabricator.wikimedia.org/T407463) [09:28:00] (03CR) 10Marostegui: [C:03+2] db1234: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1197604 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [09:28:38] (03PS1) 10Elukey: role::maps::master_bookworm: fix EG stream name in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1197605 (https://phabricator.wikimedia.org/T381565) [09:29:07] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197605 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:29:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1234.eqiad.wmnet with reason: Maintenance [09:29:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1234 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84171 and previous config saved to /var/cache/conftool/dbconfig/20251021-092911-marostegui.json [09:29:20] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:29:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:29:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:29:49] FIRING: HelmReleaseBadStatus: Helm release growthbook/ferretdb-growthbook on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=growthbook - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:31:00] (03PS1) 10Slyngshede: data.yaml: record LDAP access for dpogorzelski [puppet] - 10https://gerrit.wikimedia.org/r/1197606 [09:31:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 60%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84172 and previous config saved to /var/cache/conftool/dbconfig/20251021-093146-root.json [09:32:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:32:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:32:47] (03PS2) 10Elukey: role::maps::master_bookworm: fix EG stream name in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1197605 (https://phabricator.wikimedia.org/T381565) [09:32:55] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197605 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:33:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 5%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84173 and previous config saved to /var/cache/conftool/dbconfig/20251021-093323-root.json [09:34:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:34:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:35:32] (03CR) 10Elukey: [C:03+2] role::maps::master_bookworm: fix EG stream name in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1197605 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:36:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1234 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84175 and previous config saved to /var/cache/conftool/dbconfig/20251021-093652-root.json [09:40:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:40:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:44:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:44:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:44:49] RESOLVED: HelmReleaseBadStatus: Helm release growthbook/ferretdb-growthbook on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=growthbook - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:44:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:45:01] (03CR) 10Tim Starling: [C:03+1] "Approved for deployment" [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197603 (https://phabricator.wikimedia.org/T407842) (owner: 10Ladsgroup) [09:46:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [09:46:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [09:46:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 75%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84176 and previous config saved to /var/cache/conftool/dbconfig/20251021-094652-root.json [09:48:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 7%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84177 and previous config saved to /var/cache/conftool/dbconfig/20251021-094829-root.json [09:48:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [09:48:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [09:49:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [09:49:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [09:50:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [09:50:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [09:51:44] (03PS1) 10Tiziano Fogli: sre/zookeeper: trigger a page for lost quorum on main cluster [alerts] - 10https://gerrit.wikimedia.org/r/1197607 (https://phabricator.wikimedia.org/T309012) [09:51:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1234 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84179 and previous config saved to /var/cache/conftool/dbconfig/20251021-095157-root.json [09:52:22] (03CR) 10Lucas Werkmeister (WMDE): "I don’t think we should revert anything on wmf.23 at this point." [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197552 (owner: 10Joely Rooke WMDE) [09:52:27] jouncebot: nowandnext [09:52:27] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [09:52:27] In 0 hour(s) and 7 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T1000) [09:52:51] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197608 (https://phabricator.wikimedia.org/T401290) [09:53:02] SREs: may I deploy ^ ASAP? [09:53:32] I deployed this revert to wmf.23 yesterday, but then forgot to +2 it on the master branch, so now wmf.24 has broken code again and I really ought to fix that before the train rolls out [09:55:14] (03PS4) 10Elukey: Revert workarounds to exclude elasticsearch_cluster.py on Bookworm+ [software/spicerack] - 10https://gerrit.wikimedia.org/r/1196923 (https://phabricator.wikimedia.org/T390860) [09:56:08] (03PS1) 10Federico Ceratto: preseed.yaml, site.pp, es2057.yaml: Prepare es2057 for es3 [puppet] - 10https://gerrit.wikimedia.org/r/1197609 (https://phabricator.wikimedia.org/T402859) [09:57:17] (03CR) 10Tiziano Fogli: "I noticed on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1192855 that the "Zookeeper server" checks on nodes conf.* were triggeri" [alerts] - 10https://gerrit.wikimedia.org/r/1197607 (https://phabricator.wikimedia.org/T309012) (owner: 10Tiziano Fogli) [09:58:25] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1197607 (https://phabricator.wikimedia.org/T309012) (owner: 10Tiziano Fogli) [09:58:34] (03CR) 10Tiziano Fogli: [C:03+2] sre/zookeeper: trigger a page for lost quorum on main cluster [alerts] - 10https://gerrit.wikimedia.org/r/1197607 (https://phabricator.wikimedia.org/T309012) (owner: 10Tiziano Fogli) [09:58:42] (03CR) 10Tiziano Fogli: [C:03+2] zookeeper: remove check_prometheus, disable nrpe [puppet] - 10https://gerrit.wikimedia.org/r/1192855 (https://phabricator.wikimedia.org/T309012) (owner: 10Tiziano Fogli) [09:59:45] (03CR) 10Marostegui: preseed.yaml, site.pp, es2057.yaml: Prepare es2057 for es3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1197609 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T1000) [10:00:33] reiterating my question from above, to whoever is responsible for this window [10:00:41] can I deploy a MediaWiki revert? [10:01:50] if it's an emergency I'd say go ahead [10:01:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 100%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84180 and previous config saved to /var/cache/conftool/dbconfig/20251021-100158-root.json [10:02:11] mention in _security also [10:02:53] Lucas_WMDE: go [10:03:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 10%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84181 and previous config saved to /var/cache/conftool/dbconfig/20251021-100335-root.json [10:03:42] thanks [10:04:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197608 (https://phabricator.wikimedia.org/T401290) (owner: 10Lucas Werkmeister (WMDE)) [10:04:56] (03CR) 10Elukey: [C:03+2] services: move tegola and kartotherian's eqiad configs to the new stack [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196803 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [10:05:02] +2ed, I’ll let gate-and-submit run its course (not urgent enough for a force-merge imho) [10:05:10] and see if I can reproduce it on testwiki in the meantime [10:06:09] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [10:07:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1234 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84183 and previous config saved to /var/cache/conftool/dbconfig/20251021-100703-root.json [10:07:07] (03PS2) 10Tiziano Fogli: dbbackups: enable nrpe2nodexp wrapper on mariadb_${type}_... checks [puppet] - 10https://gerrit.wikimedia.org/r/1196939 (https://phabricator.wikimedia.org/T315866) [10:09:03] (03PS13) 10Brouberol: opensearch-cluster: enable external ingress with TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196700 (https://phabricator.wikimedia.org/T406876) [10:09:12] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:09:33] once you're done, please ping me. I have a train blocker to deploy :D [10:12:36] Amir1: how urgent is it? we could swap :P [10:12:50] I think my revert isn’t super urgent, it should just happen before the train proceeds to group0 (beyond test wikis) [10:13:05] it's not that urgent but also gonna take a while to merge [10:13:17] ok [10:13:21] (03CR) 10Ladsgroup: [C:03+2] api: Fix incorrect templatelinks query in ApiQueryInfo [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197603 (https://phabricator.wikimedia.org/T407842) (owner: 10Ladsgroup) [10:13:34] (03CR) 10Hnowlan: [C:03+2] Route transform/wikitext/to/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [10:14:05] I've +2'ed it, it's quite straightforward, if it gets merged together or sooner than your patch, would you mind bundling it? [10:14:31] I mean, it makes the scap output more confusing [10:14:40] I could cancel the current spiderpig and start another one with both patches [10:14:47] shouldn’t affect the ongoing gate-and-submit biulds [10:14:54] does that sound okay? [10:16:15] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [10:16:26] (03PS1) 10Elukey: kubernetes: add the maps bookworm eqiad external service config [puppet] - 10https://gerrit.wikimedia.org/r/1197611 (https://phabricator.wikimedia.org/T381565) [10:16:36] (03Merged) 10jenkins-bot: Revert "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197608 (https://phabricator.wikimedia.org/T401290) (owner: 10Lucas Werkmeister (WMDE)) [10:16:46] too late [10:16:55] ok, so this scap is just for the Wikibase revert then [10:16:59] (03PS2) 10Elukey: kubernetes: add the maps bookworm eqiad external service config [puppet] - 10https://gerrit.wikimedia.org/r/1197611 (https://phabricator.wikimedia.org/T381565) [10:17:13] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1197608|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] [10:17:21] T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290 [10:17:21] T407684: Lua's ipairs() function can no longer iterate over Wikidata references - https://phabricator.wikimedia.org/T407684 [10:17:21] T407744: Wikibase\DataModel\Entity\EntityIdParsingException: The serialization "Q42902012 " is not recognized by the configured id builders - https://phabricator.wikimedia.org/T407744 [10:17:28] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197611 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [10:18:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 20%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84184 and previous config saved to /var/cache/conftool/dbconfig/20251021-101841-root.json [10:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:21:28] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1197608|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:21:49] testing [10:22:06] yay, works [10:22:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1234 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84186 and previous config saved to /var/cache/conftool/dbconfig/20251021-102209-root.json [10:22:12] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [10:23:56] (03PS3) 10Superpes15: Throttle exemption for Editathon by Wikimedistas en Cruce - 6/7 November 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196930 (https://phabricator.wikimedia.org/T407630) [10:25:57] (03Merged) 10jenkins-bot: api: Fix incorrect templatelinks query in ApiQueryInfo [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197603 (https://phabricator.wikimedia.org/T407842) (owner: 10Ladsgroup) [10:26:06] (03CR) 10Federico Ceratto: preseed.yaml, site.pp, es2057.yaml: Prepare es2057 for es3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1197609 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:26:28] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197608|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] (duration: 09m 15s) [10:26:36] T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290 [10:26:36] T407684: Lua's ipairs() function can no longer iterate over Wikidata references - https://phabricator.wikimedia.org/T407684 [10:26:36] T407744: Wikibase\DataModel\Entity\EntityIdParsingException: The serialization "Q42902012 " is not recognized by the configured id builders - https://phabricator.wikimedia.org/T407744 [10:26:49] Amir1: over to you [10:26:53] thanks! [10:27:36] !log oblivian@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply [10:27:43] !log oblivian@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply [10:28:10] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1197603|api: Fix incorrect templatelinks query in ApiQueryInfo (T407842)]] [10:28:14] T407842: PHP Warning: Undefined property: stdClass::$tl_namespace - https://phabricator.wikimedia.org/T407842 [10:30:30] (03PS1) 10Giuseppe Lavagetto: jaeger: fix CIDR for idp1005 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197612 [10:30:59] (03CR) 10Elukey: [C:03+1] jaeger: fix CIDR for idp1005 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197612 (owner: 10Giuseppe Lavagetto) [10:31:35] (03CR) 10Jcrespo: [C:03+1] dbbackups: enable nrpe2nodexp wrapper on mariadb_${type}_... checks [puppet] - 10https://gerrit.wikimedia.org/r/1196939 (https://phabricator.wikimedia.org/T315866) (owner: 10Tiziano Fogli) [10:32:07] (03PS1) 10Arthur taylor: Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) [10:32:21] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1197603|api: Fix incorrect templatelinks query in ApiQueryInfo (T407842)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:32:50] (03CR) 10Arthur taylor: [C:04-1] "Still need to update the parser cache options to have a user-specific wbui2025-sensitive flag." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [10:33:22] (03PS1) 10Superpes15: [hsbwiktionary] Enable importing from enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197614 (https://phabricator.wikimedia.org/T407713) [10:33:42] (03PS2) 10Giuseppe Lavagetto: jaeger: fix CIDR for idp1005 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197612 [10:33:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 25%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84187 and previous config saved to /var/cache/conftool/dbconfig/20251021-103347-root.json [10:34:05] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:37:39] (03CR) 10Giuseppe Lavagetto: [C:03+2] jaeger: fix CIDR for idp1005 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197612 (owner: 10Giuseppe Lavagetto) [10:38:01] (03PS12) 10CDanis: haproxy: add JA4H support [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (https://phabricator.wikimedia.org/T406990) [10:38:01] (03PS3) 10CDanis: haproxy: enable ja4h on cp7008 [puppet] - 10https://gerrit.wikimedia.org/r/1195234 (https://phabricator.wikimedia.org/T406990) [10:38:17] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197603|api: Fix incorrect templatelinks query in ApiQueryInfo (T407842)]] (duration: 10m 07s) [10:38:21] T407842: PHP Warning: Undefined property: stdClass::$tl_namespace - https://phabricator.wikimedia.org/T407842 [10:38:30] !log oblivian@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply [10:38:36] !log oblivian@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply [10:38:45] !log oblivian@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply [10:39:01] !log oblivian@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply [10:43:50] (03PS1) 10Superpes15: [dawikisource] Enable RC Patrol [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197615 (https://phabricator.wikimedia.org/T407790) [10:44:56] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195234 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [10:46:36] (03CR) 10Elukey: [C:03+2] kubernetes: add the maps bookworm eqiad external service config [puppet] - 10https://gerrit.wikimedia.org/r/1197611 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [10:46:50] (03CR) 10Marostegui: preseed.yaml, site.pp, es2057.yaml: Prepare es2057 for es3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1197609 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:47:17] (03CR) 10Vgutierrez: [C:03+1] haproxy: enable ja4h on cp7008 [puppet] - 10https://gerrit.wikimedia.org/r/1195234 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [10:47:37] (03CR) 10CDanis: [C:03+2] haproxy: enable ja4h on cp7008 [puppet] - 10https://gerrit.wikimedia.org/r/1195234 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [10:47:41] (03CR) 10CDanis: [C:03+2] haproxy: add JA4H support [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [10:48:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 30%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84189 and previous config saved to /var/cache/conftool/dbconfig/20251021-104853-root.json [10:52:13] (03PS5) 10Btullis: Migrate data_check refinery job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195780 (https://phabricator.wikimedia.org/T402943) [10:52:13] (03PS5) 10Btullis: Migrate the hdfs_cleaner refinery jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195781 (https://phabricator.wikimedia.org/T402943) [10:52:13] (03PS5) 10Btullis: Migrate the import_*_dumps systemd jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195782 (https://phabricator.wikimedia.org/T402943) [10:52:13] (03PS5) 10Btullis: Migrate the project_namespace_map refinery job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195783 (https://phabricator.wikimedia.org/T402943) [10:52:14] (03PS6) 10Btullis: Migrate the data_purge jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195785 (https://phabricator.wikimedia.org/T402943) [10:52:15] (03PS6) 10Btullis: Migrate the refine_netflow job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) [10:52:19] (03PS6) 10Btullis: Migrate sqoop jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195784 (https://phabricator.wikimedia.org/T402943) [10:52:23] (03PS6) 10Btullis: Migrate refine_sanitize jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195786 (https://phabricator.wikimedia.org/T402943) [10:55:49] (03PS6) 10Btullis: Remove the data_check refinery job from both an-launcher hosts [puppet] - 10https://gerrit.wikimedia.org/r/1195780 (https://phabricator.wikimedia.org/T402943) [10:55:49] (03PS6) 10Btullis: Migrate the hdfs_cleaner refinery jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195781 (https://phabricator.wikimedia.org/T402943) [10:55:49] (03PS6) 10Btullis: Migrate the import_*_dumps systemd jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195782 (https://phabricator.wikimedia.org/T402943) [10:55:49] (03PS6) 10Btullis: Migrate the project_namespace_map refinery job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195783 (https://phabricator.wikimedia.org/T402943) [10:55:50] (03PS7) 10Btullis: Migrate the data_purge jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195785 (https://phabricator.wikimedia.org/T402943) [10:55:52] (03PS7) 10Btullis: Migrate the refine_netflow job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) [10:55:56] (03PS7) 10Btullis: Migrate sqoop jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195784 (https://phabricator.wikimedia.org/T402943) [10:56:00] (03PS7) 10Btullis: Migrate refine_sanitize jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195786 (https://phabricator.wikimedia.org/T402943) [10:57:41] FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:59:20] (03PS1) 10CDanis: haproxy: ja4h: actually load the Lua file [puppet] - 10https://gerrit.wikimedia.org/r/1197616 (https://phabricator.wikimedia.org/T406990) [10:59:35] (03PS2) 10CDanis: haproxy: ja4h: actually load the Lua file [puppet] - 10https://gerrit.wikimedia.org/r/1197616 (https://phabricator.wikimedia.org/T406990) [10:59:42] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/admin 'sync'. [10:59:43] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197616 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [10:59:57] (03PS1) 10Slyngshede: C:openldap extend wikimediaPerson schema for Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1197617 (https://phabricator.wikimedia.org/T406495) [11:00:00] (03CR) 10Btullis: [C:03+2] Remove the data_check refinery job from both an-launcher hosts [puppet] - 10https://gerrit.wikimedia.org/r/1195780 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [11:00:17] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [11:00:35] (03PS7) 10Btullis: Migrate the hdfs_cleaner refinery jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195781 (https://phabricator.wikimedia.org/T402943) [11:01:09] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [11:03:14] (03CR) 10Btullis: [C:03+2] Migrate the hdfs_cleaner refinery jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195781 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [11:03:34] (03CR) 10Btullis: [C:03+2] Migrate the import_*_dumps systemd jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195782 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [11:03:43] (03PS7) 10Btullis: Migrate the import_*_dumps systemd jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195782 (https://phabricator.wikimedia.org/T402943) [11:04:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 50%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84191 and previous config saved to /var/cache/conftool/dbconfig/20251021-110359-root.json [11:04:26] (03CR) 10Btullis: [C:03+2] Migrate the import_*_dumps systemd jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195782 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [11:04:43] (03PS7) 10Btullis: Migrate the project_namespace_map refinery job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195783 (https://phabricator.wikimedia.org/T402943) [11:06:16] (03PS1) 10Clément Goubert: mesh.configuration: Update to 1.14.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197619 (https://phabricator.wikimedia.org/T407826) [11:06:37] (03CR) 10CDanis: [C:03+2] haproxy: ja4h: actually load the Lua file [puppet] - 10https://gerrit.wikimedia.org/r/1197616 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [11:07:02] btullis: okay if I puppet-merge yours? [11:07:10] (03CR) 10Btullis: [C:03+2] Migrate the project_namespace_map refinery job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195783 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [11:07:20] (03PS1) 10Elukey: services: move tegola eqiad to the 'tegola' username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197620 (https://phabricator.wikimedia.org/T381565) [11:07:35] (03PS8) 10Btullis: Migrate the data_purge jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195785 (https://phabricator.wikimedia.org/T402943) [11:08:03] (03PS2) 10Elukey: services: move tegola eqiad to the 'tegola' username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197620 (https://phabricator.wikimedia.org/T381565) [11:08:17] (03CR) 10Btullis: [C:03+2] Migrate the data_purge jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195785 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [11:08:54] (03CR) 10CI reject: [V:04-1] mesh.configuration: Update to 1.14.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197619 (https://phabricator.wikimedia.org/T407826) (owner: 10Clément Goubert) [11:10:06] (03PS2) 10Clément Goubert: mesh.configuration: Update to 1.14.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197619 (https://phabricator.wikimedia.org/T407826) [11:11:15] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [11:11:55] Yeah works better if I actually add the files [11:13:16] claime: permission to !bash that? :> [11:13:31] Lucas_WMDE: for sure [11:13:41] !bash Yeah works better if I actually add the files [11:13:41] Lucas_WMDE: Stored quip at https://bash.toolforge.org/quip/8dF5BpoBvg159pQrjyCf [11:17:41] RESOLVED: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [11:19:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 60%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84193 and previous config saved to /var/cache/conftool/dbconfig/20251021-111905-root.json [11:19:44] (03CR) 10Elukey: [C:03+2] services: move tegola eqiad to the 'tegola' username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197620 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [11:22:05] (03CR) 10Alexandros Kosiaris: [C:03+1] "Went through the CI output, it's version bumps and comments removals for the most part, so noop. There's an addition of 2 configmaps for s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 (owner: 10Alexandros Kosiaris) [11:22:23] (03CR) 10Hnowlan: [C:03+1] "lgtm, thanks! looks like CI is stalled." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197619 (https://phabricator.wikimedia.org/T407826) (owner: 10Clément Goubert) [11:23:25] hnowlan: CI is still going [11:23:34] Aaaan it's done [11:24:49] ah, last timestamp in the UI I saw was like 10 minutes in the past [11:25:12] yeah it got bogged down a bit [11:25:30] (03CR) 10Clément Goubert: [C:03+2] mesh.configuration: Update to 1.14.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197619 (https://phabricator.wikimedia.org/T407826) (owner: 10Clément Goubert) [11:26:27] (03PS1) 10Marostegui: installserver: Remove db2245 from preseed [puppet] - 10https://gerrit.wikimedia.org/r/1197621 [11:28:31] (03CR) 10Marostegui: [C:03+2] installserver: Remove db2245 from preseed [puppet] - 10https://gerrit.wikimedia.org/r/1197621 (owner: 10Marostegui) [11:29:01] (03PS8) 10Btullis: Migrate sqoop jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195784 (https://phabricator.wikimedia.org/T402943) [11:29:01] (03PS8) 10Btullis: Migrate the refine_netflow job to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) [11:29:01] (03PS8) 10Btullis: Migrate refine_sanitize jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195786 (https://phabricator.wikimedia.org/T402943) [11:29:57] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [11:30:35] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [11:30:57] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [11:31:17] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [11:31:21] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:32:03] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:33:29] (03CR) 10Btullis: Migrate the refine_netflow job to an-launcher1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1195779 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [11:33:49] (03CR) 10Effie Mouzeli: [C:03+2] "> tested against the 4 realservers using `curl --connect-to ::$(dig +short hcaptcha1001.wikimedia.org):4260 https://hcaptcha.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1197230 (https://phabricator.wikimedia.org/T407615) (owner: 10Effie Mouzeli) [11:33:54] (03CR) 10Btullis: [C:03+2] Migrate sqoop jobs to an-launcher1003 [puppet] - 10https://gerrit.wikimedia.org/r/1195784 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [11:34:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 75%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84195 and previous config saved to /var/cache/conftool/dbconfig/20251021-113411-root.json [11:36:00] (03Merged) 10jenkins-bot: mesh.configuration: Update to 1.14.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197619 (https://phabricator.wikimedia.org/T407826) (owner: 10Clément Goubert) [11:39:38] (03CR) 10CDanis: [C:03+1] mesh.configuration: Update to 1.14.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197619 (https://phabricator.wikimedia.org/T407826) (owner: 10Clément Goubert) [11:40:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Increase sretest2003 weight in es1 T407352', diff saved to https://phabricator.wikimedia.org/P84197 and previous config saved to /var/cache/conftool/dbconfig/20251021-114005-marostegui.json [11:40:10] T407352: Test config H 1P in external store - https://phabricator.wikimedia.org/T407352 [11:41:48] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:42:32] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:44:42] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:47:09] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:47:11] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:47:25] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:49:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 1000%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84198 and previous config saved to /var/cache/conftool/dbconfig/20251021-114917-root.json [11:49:46] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2056 slowly with 10 steps - Pooling in new host [11:51:15] (03PS1) 10Clément Goubert: mesh.configuration: Fix request_id extension config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197623 [11:51:27] (03CR) 10CI reject: [V:04-1] mesh.configuration: Fix request_id extension config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197623 (owner: 10Clément Goubert) [11:52:57] (03PS2) 10Clément Goubert: mesh.configuration: Fix request_id extension config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197623 [11:53:14] (03CR) 10CI reject: [V:04-1] mesh.configuration: Fix request_id extension config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197623 (owner: 10Clément Goubert) [11:53:18] (03PS3) 10Clément Goubert: mesh.configuration: Fix request_id extension config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197623 [11:53:35] (03CR) 10Clément Goubert: [V:03+1 C:03+1] mesh.configuration: Fix request_id extension config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197623 (owner: 10Clément Goubert) [11:53:56] (03CR) 10Clément Goubert: [V:03+2 C:03+1] mesh.configuration: Fix request_id extension config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197623 (owner: 10Clément Goubert) [11:53:58] (03CR) 10Clément Goubert: [V:03+2 C:03+2] mesh.configuration: Fix request_id extension config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197623 (owner: 10Clément Goubert) [11:54:15] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:59:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T1200) [12:01:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [12:02:36] brouberol: If you're deploying something that uses mesh.configuration and tracing, it's currently broken, patch is in CI [12:06:17] (03Merged) 10jenkins-bot: mesh.configuration: Fix request_id extension config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197623 (owner: 10Clément Goubert) [12:06:42] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:07:45] 07sre-alert-triage, 06serviceops, 13Patch-For-Review: Alert in need of triage: ProbeDown (instance proxoid:4260) - https://phabricator.wikimedia.org/T407615#11293385 (10jijiki) 05Open→03Resolved a:03jijiki [12:09:12] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:17] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [12:09:37] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [12:09:42] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:10:22] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [12:12:35] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [12:13:22] (03Abandoned) 10Joely Rooke WMDE: Revert^2 "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197552 (owner: 10Joely Rooke WMDE) [12:16:58] (03PS1) 10CDanis: benthos: webrequest_live: add ja4h field [puppet] - 10https://gerrit.wikimedia.org/r/1197627 (https://phabricator.wikimedia.org/T406990) [12:17:56] (03PS1) 10CDanis: turnilo: webrequest: add ja4h field [puppet] - 10https://gerrit.wikimedia.org/r/1197628 (https://phabricator.wikimedia.org/T406990) [12:19:49] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [12:20:28] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1028.eqiad.wmnet - https://phabricator.wikimedia.org/T407720#11293425 (10Jclark-ctr) [12:20:34] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1028.eqiad.wmnet - https://phabricator.wikimedia.org/T407720#11293426 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [12:21:07] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [12:21:16] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [12:22:03] (03PS1) 10Marostegui: instances.yaml: Remove es1029 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197631 (https://phabricator.wikimedia.org/T407832) [12:22:16] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [12:22:25] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [12:22:29] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [12:23:07] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [12:23:10] (03PS2) 10Federico Ceratto: aptrepo: enable wmfmariadbpy for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1197629 (https://phabricator.wikimedia.org/T407472) [12:23:10] (03CR) 10Federico Ceratto: "Similar to past changes for T397305" [puppet] - 10https://gerrit.wikimedia.org/r/1197629 (https://phabricator.wikimedia.org/T407472) (owner: 10Federico Ceratto) [12:23:24] brouberol: should be good now [12:23:48] (03PS3) 10Federico Ceratto: aptrepo: enable wmfmariadbpy for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1197629 (https://phabricator.wikimedia.org/T407472) [12:27:09] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [12:27:19] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [12:28:15] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [12:29:00] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [12:29:24] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [12:29:35] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [12:29:40] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:29:46] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [12:30:08] jouncebot: nowandnext [12:30:08] For the next 0 hour(s) and 29 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T1200) [12:30:08] In 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T1300) [12:30:26] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [12:30:44] Anyone mind if I deploy via scap? [12:31:21] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad racks for variance from Netbox - https://phabricator.wikimedia.org/T407851 (10Jclark-ctr) 03NEW [12:31:27] (03CR) 10Awight: "Go ahead and merge at any time, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1197356 (owner: 10Awight) [12:31:34] (03CR) 10Awight: [C:03+1] Temporarily revoke ssh access for awight [puppet] - 10https://gerrit.wikimedia.org/r/1197356 (owner: 10Awight) [12:31:43] Dreamy_Jazz: gimme a minute please [12:31:48] Sure [12:31:54] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp40[50-52].ulsfo.wmnet} and A:cp [12:31:57] I'm finishing up deploying an envoy config change [12:31:59] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [12:32:05] (03PS2) 10Awight: Temporarily revoke ssh access for awight [puppet] - 10https://gerrit.wikimedia.org/r/1197356 [12:32:07] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [12:33:03] (03CR) 10Marostegui: [C:03+1] aptrepo: enable wmfmariadbpy for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1197629 (https://phabricator.wikimedia.org/T407472) (owner: 10Federico Ceratto) [12:34:00] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:34:23] (03CR) 10Tiziano Fogli: [C:03+2] dbbackups: enable nrpe2nodexp wrapper on mariadb_${type}_... checks [puppet] - 10https://gerrit.wikimedia.org/r/1196939 (https://phabricator.wikimedia.org/T315866) (owner: 10Tiziano Fogli) [12:35:24] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:35:29] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1029 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197631 (https://phabricator.wikimedia.org/T407832) (owner: 10Marostegui) [12:35:30] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [12:36:03] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [12:36:11] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_eqiad and A:cp [12:36:29] Dreamy_Jazz: ok I've deployed it for eqiad, but it takes quite a long time... So you can go ahead and do your scap deployment, it will bring my change in as well, I've confirmed it works so it won't interfere [12:36:29] !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_eqiad and A:cp [12:36:46] Thanks. I'll start now [12:36:59] (03CR) 10Dreamy Jazz: [C:03+2] Define CheckUser Suggested Investigations event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197278 (https://phabricator.wikimedia.org/T404177) (owner: 10Dreamy Jazz) [12:37:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1029 from dbctl T407832', diff saved to https://phabricator.wikimedia.org/P84200 and previous config saved to /var/cache/conftool/dbconfig/20251021-123706-marostegui.json [12:37:07] (03CR) 10Dreamy Jazz: [C:03+2] CheckUser UserInfoCard: Enable XTools menu link on SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196914 (https://phabricator.wikimedia.org/T406012) (owner: 10Dreamy Jazz) [12:37:10] T407832: decommission es1029.eqiad.wmnet - https://phabricator.wikimedia.org/T407832 [12:37:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197278 (https://phabricator.wikimedia.org/T404177) (owner: 10Dreamy Jazz) [12:37:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196914 (https://phabricator.wikimedia.org/T406012) (owner: 10Dreamy Jazz) [12:38:00] (03Merged) 10jenkins-bot: Define CheckUser Suggested Investigations event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197278 (https://phabricator.wikimedia.org/T404177) (owner: 10Dreamy Jazz) [12:38:02] (03Merged) 10jenkins-bot: CheckUser UserInfoCard: Enable XTools menu link on SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196914 (https://phabricator.wikimedia.org/T406012) (owner: 10Dreamy Jazz) [12:38:15] (03PS3) 10Awight: Temporarily revoke ssh access for awight [puppet] - 10https://gerrit.wikimedia.org/r/1197356 [12:38:18] (03CR) 10Ladsgroup: [C:03+2] Temporarily revoke ssh access for awight [puppet] - 10https://gerrit.wikimedia.org/r/1197356 (owner: 10Awight) [12:38:19] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Temporarily revoke ssh access for awight [puppet] - 10https://gerrit.wikimedia.org/r/1197356 (owner: 10Awight) [12:38:34] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1197278|Define CheckUser Suggested Investigations event stream (T404177)]], [[gerrit:1196914|CheckUser UserInfoCard: Enable XTools menu link on SUL wikis (T406012)]] [12:38:41] T404177: Instrumentation for Suggested Investigations - https://phabricator.wikimedia.org/T404177 [12:38:41] T406012: UserInfoCard: Add link to view XTools - https://phabricator.wikimedia.org/T406012 [12:38:51] 10SRE-swift-storage, 10Observability-Alerting: Remove load_average check for ms-be/thanos-be - https://phabricator.wikimedia.org/T370526#11293475 (10tappof) a:03tappof [12:42:18] !log stopping netbox service on netbox-dev2003 to update db from live netbox [12:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:34] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4050.ulsfo.wmnet [12:42:57] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1197278|Define CheckUser Suggested Investigations event stream (T404177)]], [[gerrit:1196914|CheckUser UserInfoCard: Enable XTools menu link on SUL wikis (T406012)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:43:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11293483 (10Jclark-ctr) a:03VRiley-WMF [12:44:55] (03PS1) 10DCausse: cirrus: enable completion search with defaultsort A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197642 (https://phabricator.wikimedia.org/T404858) [12:45:46] (03CR) 10CI reject: [V:04-1] cirrus: enable completion search with defaultsort A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197642 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [12:46:07] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad racks for variance from Netbox - https://phabricator.wikimedia.org/T407851#11293496 (10Jclark-ctr) T380564 ganeti1014 found in Rack B3 U15 [12:46:10] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [12:46:30] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [12:47:06] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1100.eqiad.wmnet [12:47:14] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1101.eqiad.wmnet [12:50:17] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197278|Define CheckUser Suggested Investigations event stream (T404177)]], [[gerrit:1196914|CheckUser UserInfoCard: Enable XTools menu link on SUL wikis (T406012)]] (duration: 11m 42s) [12:50:23] T404177: Instrumentation for Suggested Investigations - https://phabricator.wikimedia.org/T404177 [12:50:23] T406012: UserInfoCard: Add link to view XTools - https://phabricator.wikimedia.org/T406012 [12:50:34] Deploy went fine from what I can see [12:51:11] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad racks for variance from Netbox - https://phabricator.wikimedia.org/T407851#11293517 (10Jclark-ctr) es1057 T400198 Racked in u14 netbox list u15 [12:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:54:59] (03PS2) 10DCausse: cirrus: enable completion search with defaultsort A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197642 (https://phabricator.wikimedia.org/T404858) [12:55:34] lvs2013 pybal [12:56:02] hm seems like it was missed in the other rounds [12:56:17] effie: ^ we should also restart on lvs2013 [12:56:52] I did so [12:57:12] oh no [12:57:15] sorry, I missed it [12:57:17] sigh [12:57:19] it seems like we skipped 2013 [12:57:23] all good! [12:57:49] ok sorted [12:59:12] <3 [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T1300). [13:00:05] seanleong-wmde and Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:31] o/ [13:00:43] I’m pretty busy right now, hopefully someone else can run the window… [13:04:40] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [13:05:11] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [13:06:30] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [13:06:39] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [13:07:32] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [13:14:28] (03PS1) 10Cathal Mooney: Update provision and interface validator to support Nokia TOR [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1197647 (https://phabricator.wikimedia.org/T405637) [13:15:51] (03PS2) 10Cathal Mooney: Update provision and interface validator to support Nokia TOR [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1197647 (https://phabricator.wikimedia.org/T405637) [13:17:00] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad racks for variance from Netbox - https://phabricator.wikimedia.org/T407851#11293602 (10Jclark-ctr) dbprov1007 was found to have the model description listed as 1U, but it is actually a 2U server. I’ve updated the 760sx custom entry in Netbox to reflect 2U. T400412 T... [13:17:18] I've patches scheduled and postponed twice yesterday due to the absence of a deployer... I hope someone will be here soon :/ [13:22:33] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad racks for variance from Netbox - https://phabricator.wikimedia.org/T407851#11293616 (10Jclark-ctr) finished auditing all racks excluding new machine learning racks and new fundraising racks e 9- e 16 [13:23:39] (03PS1) 10D3r1ck01: user: Log user ID and name when Setup isn't fully initialized [core] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197649 (https://phabricator.wikimedia.org/T406433) [13:24:02] (03PS1) 10D3r1ck01: user: Log user ID and name when Setup isn't fully initialized [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197650 (https://phabricator.wikimedia.org/T406433) [13:24:42] (03PS3) 10Cathal Mooney: Update provision and interface validator to support Nokia TOR [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1197647 (https://phabricator.wikimedia.org/T405637) [13:25:27] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4051.ulsfo.wmnet [13:27:59] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1102.eqiad.wmnet [13:28:06] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1103.eqiad.wmnet [13:29:15] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:31:40] alright, I guess I can deploy anyway [13:31:52] no sign of seanlong-wmde yet so let’s do Superpes [13:32:42] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] [igwiki] Create 'autopatrolled' and 'rollbacker' usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196921 (https://phabricator.wikimedia.org/T407439) (owner: 10Superpes15) [13:32:49] Thanks :) These are all minor changes! Feel free to choose how you want to merge them! [13:32:56] yeah I’m looking at them now [13:33:01] can probably be squashed at least partially [13:34:05] I notice that throttling exception has an IP that already occurs in the file o_O [13:34:21] I guess https://phabricator.wikimedia.org/T406655 and https://phabricator.wikimedia.org/T407630 are related ^^ [13:34:26] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Throttle exemption for Editathon by Wikimedistas en Cruce - 6/7 November 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196930 (https://phabricator.wikimedia.org/T407630) (owner: 10Superpes15) [13:34:57] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] [specieswiki] Enable USERLANGUAGE magic word [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197004 (https://phabricator.wikimedia.org/T406583) (owner: 10Superpes15) [13:35:01] 10SRE-tools, 10Observability-Alerting, 10Spicerack, 06Traffic: Alertmanager triggers an alert on IRC and email after the alert has resolved - https://phabricator.wikimedia.org/T407787#11293662 (10ssingh) > It looks like spicerack should check that alerts for the downtimed host have been resolved (not in fi... [13:35:20] Yep They create different tasks for every edit-a-thon [13:35:53] See https://phabricator.wikimedia.org/people/tasks/authored/40475/ [13:36:03] neat :) [13:36:22] google translate claims https://hsb.wiktionary.org/wiki/Wikis%C5%82ownik:Portal#Import wants to enable imports from *Wikibooks* [13:36:35] but I can see that the project namespace is Wikisłownik [13:36:39] so that’s just google being wrong [13:36:45] and it must mean wiktionary after all [13:36:59] My translator says "Wiktionary" lol [13:37:03] (it’s translating from Slovak and I don’t know how related that even is to Upper Sorbian) [13:37:18] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] [hsbwiktionary] Enable importing from enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197614 (https://phabricator.wikimedia.org/T407713) (owner: 10Superpes15) [13:37:39] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] [dawikisource] Enable RC Patrol [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197615 (https://phabricator.wikimedia.org/T407790) (owner: 10Superpes15) [13:37:48] okay, I think we can do all of those together in one deploy [13:37:49] does that sound okay? [13:38:03] Using "detect language" it is considered either Czech or Polish and says wiktionary Lol [13:38:04] (03PS1) 10Tiziano Fogli: nrpe2nodexp: add randomized delay to timers [puppet] - 10https://gerrit.wikimedia.org/r/1197655 (https://phabricator.wikimedia.org/T395446) [13:38:07] Yep absolutely [13:38:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196921 (https://phabricator.wikimedia.org/T407439) (owner: 10Superpes15) [13:38:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196930 (https://phabricator.wikimedia.org/T407630) (owner: 10Superpes15) [13:38:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197004 (https://phabricator.wikimedia.org/T406583) (owner: 10Superpes15) [13:38:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197614 (https://phabricator.wikimedia.org/T407713) (owner: 10Superpes15) [13:38:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197615 (https://phabricator.wikimedia.org/T407790) (owner: 10Superpes15) [13:40:21] (03Merged) 10jenkins-bot: [igwiki] Create 'autopatrolled' and 'rollbacker' usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196921 (https://phabricator.wikimedia.org/T407439) (owner: 10Superpes15) [13:40:23] (03Merged) 10jenkins-bot: Throttle exemption for Editathon by Wikimedistas en Cruce - 6/7 November 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196930 (https://phabricator.wikimedia.org/T407630) (owner: 10Superpes15) [13:40:25] (03Merged) 10jenkins-bot: [specieswiki] Enable USERLANGUAGE magic word [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197004 (https://phabricator.wikimedia.org/T406583) (owner: 10Superpes15) [13:40:28] (03Merged) 10jenkins-bot: [hsbwiktionary] Enable importing from enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197614 (https://phabricator.wikimedia.org/T407713) (owner: 10Superpes15) [13:40:30] (03Merged) 10jenkins-bot: [dawikisource] Enable RC Patrol [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197615 (https://phabricator.wikimedia.org/T407790) (owner: 10Superpes15) [13:41:06] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1196921|[igwiki] Create 'autopatrolled' and 'rollbacker' usergroups (T407439)]], [[gerrit:1196930|Throttle exemption for Editathon by Wikimedistas en Cruce - 6/7 November 2025 (T407630)]], [[gerrit:1197004|[specieswiki] Enable USERLANGUAGE magic word (T406583)]], [[gerrit:1197614|[hsbwiktionary] Enable importing from enwiktionary (T407713 [13:41:06] )]], [[gerrit:1197615|[dawikisource] Enable RC Patrol (T407790)]] [13:41:16] T407439: Add missing user groups on Igbo Wikipedia - https://phabricator.wikimedia.org/T407439 [13:41:17] T407630: Lift IP cap on these dates 2025-11-06 and 2025-11-07 for edit-a-thon for eswiki and commons - https://phabricator.wikimedia.org/T407630 [13:41:17] T406583: Enable USERLANGUAGE magic word for Wikispecies - https://phabricator.wikimedia.org/T406583 [13:41:17] T407713: Change import sources for hsb.wiktionary.org - https://phabricator.wikimedia.org/T407713 [13:41:18] T407790: Enable RC Patrol on Danish Wikisource - https://phabricator.wikimedia.org/T407790 [13:43:33] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [13:44:26] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [13:45:28] !log lucaswerkmeister-wmde@deploy2002 superpes, lucaswerkmeister-wmde: Backport for [[gerrit:1196921|[igwiki] Create 'autopatrolled' and 'rollbacker' usergroups (T407439)]], [[gerrit:1196930|Throttle exemption for Editathon by Wikimedistas en Cruce - 6/7 November 2025 (T407630)]], [[gerrit:1197004|[specieswiki] Enable USERLANGUAGE magic word (T406583)]], [[gerrit:1197614|[hsbwiktionary] Enable importing from enwiktionary [13:45:28] (T407713)]], [[gerrit:1197615|[dawikisource] Enable RC Patrol (T407790)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:45:43] Testing! Just give me a minute to check everything lol [13:45:53] !log (cont.) (T407713)]], [[gerrit:1197615|[dawikisource] Enable RC Patrol (T407790)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:03] good luck :D [13:46:44] (03PS1) 10Jelto: conftool-data: add etcd data for gerrit* services [puppet] - 10https://gerrit.wikimedia.org/r/1197657 (https://phabricator.wikimedia.org/T365259) [13:47:10] ^ interesting [13:48:19] Lucas_WMDE Everything is fine :D [13:48:22] (03PS1) 10Dreamy Jazz: mediawiki: Run sendVerifyEmailReminderNotification.php monthly [puppet] - 10https://gerrit.wikimedia.org/r/1197658 (https://phabricator.wikimedia.org/T58074) [13:48:31] (03PS4) 10Effie Mouzeli: kube-state-metrics: collect metrics for metadata.labels.username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209 [13:48:31] !log lucaswerkmeister-wmde@deploy2002 superpes, lucaswerkmeister-wmde: Continuing with sync [13:48:32] yay! [13:48:47] (03CR) 10Herron: [C:03+1] nrpe2nodexp: add randomized delay to timers [puppet] - 10https://gerrit.wikimedia.org/r/1197655 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [13:50:08] (03CR) 10Herron: [C:03+1] k8s/client_cert: adjust Prometheus certificate renewal timing [puppet] - 10https://gerrit.wikimedia.org/r/1197303 (https://phabricator.wikimedia.org/T407484) (owner: 10Tiziano Fogli) [13:50:20] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11293761 (10elukey) @MoritzMuehlenhoff to keep archives happy, I had to do T381565#11225146 on maps1011 as well: ` elukey@maps2011:~$ sudo -u postgres psql -f /usr/local/bin/maps-gra... [13:51:36] (03PS1) 10Phuedx: EventStreamConfig: Remove mediawiki.reference_previews stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197659 (https://phabricator.wikimedia.org/T242127) [13:51:40] Don't know if it was my connection but the mwdebug servers were slow to load [13:52:07] Not excessively but more than usual [13:53:00] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196921|[igwiki] Create 'autopatrolled' and 'rollbacker' usergroups (T407439)]], [[gerrit:1196930|Throttle exemption for Editathon by Wikimedistas en Cruce - 6/7 November 2025 (T407630)]], [[gerrit:1197004|[specieswiki] Enable USERLANGUAGE magic word (T406583)]], [[gerrit:1197614|[hsbwiktionary] Enable importing from enwiktionary (T40771 [13:53:00] 3)]], [[gerrit:1197615|[dawikisource] Enable RC Patrol (T407790)]] (duration: 11m 54s) [13:53:09] T407439: Add missing user groups on Igbo Wikipedia - https://phabricator.wikimedia.org/T407439 [13:53:09] T407630: Lift IP cap on these dates 2025-11-06 and 2025-11-07 for edit-a-thon for eswiki and commons - https://phabricator.wikimedia.org/T407630 [13:53:10] T406583: Enable USERLANGUAGE magic word for Wikispecies - https://phabricator.wikimedia.org/T406583 [13:53:10] T40771: New Pages Feed - Filters flyout - https://phabricator.wikimedia.org/T40771 [13:53:10] T407790: Enable RC Patrol on Danish Wikisource - https://phabricator.wikimedia.org/T407790 [13:54:24] I’m having some connection issues too but I thought they were on my end [13:54:28] (03CR) 10Jelto: "In I96a32f57a3dc2fff3bdf8f0510898392537e9ee8 (abandoned) I tried to create dedicated services for ssh and https. I'm not sure if that's st" [puppet] - 10https://gerrit.wikimedia.org/r/1197657 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [13:54:30] (IPv6 seems slow, IPv4 working better) [13:55:18] “Damn and blast British Telecom,” shouted Dirk, the words coming easily from force of habit. [13:55:32] (03CR) 10Tiziano Fogli: [C:03+2] nrpe2nodexp: add randomized delay to timers [puppet] - 10https://gerrit.wikimedia.org/r/1197655 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [13:55:58] (03PS15) 10Herron: thanos-rule: add pilot instance [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) [13:56:02] Wonderful! Many thanks for your assistance :3 [13:56:10] I don’t see Sean around and we don’t really have time for another deploy anyway [13:56:12] so I’ll just close the window [13:56:16] !log UTC afternoon backport+config window done [13:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:28] 5/6 changes deployed is a decent quota, even if it was just one deployment :P [13:56:49] (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197658 (https://phabricator.wikimedia.org/T58074) (owner: 10Dreamy Jazz) [13:57:18] (03PS2) 10Dreamy Jazz: mediawiki: Run sendVerifyEmailReminderNotification.php monthly [puppet] - 10https://gerrit.wikimedia.org/r/1197658 (https://phabricator.wikimedia.org/T58074) [13:57:22] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1197629 (https://phabricator.wikimedia.org/T407472) (owner: 10Federico Ceratto) [13:58:34] (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197658 (https://phabricator.wikimedia.org/T58074) (owner: 10Dreamy Jazz) [13:59:04] (03CR) 10Mooeypoo: mediawiki-engineering: Add REST API alerts with thresholds (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse) [14:00:05] Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T1400) [14:05:38] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11293878 (10elukey) There seems something not working in the eqiad replication, I don't see the expected/new perms on 1012: ` elukey@maps1012:~$ sudo -u postgres psql -d gis psql (15... [14:06:39] 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack, 06Traffic: Alertmanager triggers an alert on IRC and email after the alert has resolved - https://phabricator.wikimedia.org/T407787#11293886 (10Volans) For some related historical context on the lack of parity between the I... [14:06:47] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4052.ulsfo.wmnet [14:06:47] !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp40[50-52].ulsfo.wmnet} and A:cp [14:08:44] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1105.eqiad.wmnet [14:08:44] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1104.eqiad.wmnet [14:13:10] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 46997 [14:14:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197649 (https://phabricator.wikimedia.org/T406433) (owner: 10D3r1ck01) [14:14:29] 07Puppet, 10MW-on-K8s, 10Observability-Alerting: Clean up "git repo needs merge" checks - https://phabricator.wikimedia.org/T370530#11293938 (10tappof) a:03tappof [14:14:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197650 (https://phabricator.wikimedia.org/T406433) (owner: 10D3r1ck01) [14:14:51] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 46997 [14:17:15] jouncebot: nowandnext [14:17:15] For the next 0 hour(s) and 12 minute(s): Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T1400) [14:17:15] In 0 hour(s) and 12 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T1430) [14:17:55] Anyone mind if I backport? [14:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:20:06] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11293962 (10elukey) Of course I am stupid, I re-executed the grants on maps2011 that was already ok. This was the correct action: ` elukey@maps1011:~$ sudo -u postgres psql -f /usr/l... [14:23:18] (03CR) 10Effie Mouzeli: [C:03+2] kube-state-metrics: collect metrics for metadata.labels.username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209 (owner: 10Effie Mouzeli) [14:24:28] Dreamy_Jazz: I think it's fine yeah [14:24:36] Thanks [14:24:46] I will deploy shortly [14:25:17] (03PS1) 10Dreamy Jazz: Update sendVerifyEmailReminderNotification to use relative timestamp [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197663 (https://phabricator.wikimedia.org/T58074) [14:25:30] (03PS1) 10Dreamy Jazz: Update sendVerifyEmailReminderNotification to use relative timestamp [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197665 (https://phabricator.wikimedia.org/T58074) [14:26:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197665 (https://phabricator.wikimedia.org/T58074) (owner: 10Dreamy Jazz) [14:26:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197663 (https://phabricator.wikimedia.org/T58074) (owner: 10Dreamy Jazz) [14:27:19] (03CR) 10Ericmill: Update sendVerifyEmailReminderNotification to use relative timestamp (031 comment) [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197665 (https://phabricator.wikimedia.org/T58074) (owner: 10Dreamy Jazz) [14:28:11] (03PS2) 10Dreamy Jazz: Update sendVerifyEmailReminderNotification to use relative timestamp [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197665 (https://phabricator.wikimedia.org/T58074) [14:28:20] (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197665 (https://phabricator.wikimedia.org/T58074) (owner: 10Dreamy Jazz) [14:28:41] (03PS2) 10Dreamy Jazz: Update sendVerifyEmailReminderNotification to use relative timestamp [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197663 (https://phabricator.wikimedia.org/T58074) [14:28:43] (03CR) 10Cathal Mooney: [C:03+2] netops: add new BGP group names to CoreBGPDwon alert [alerts] - 10https://gerrit.wikimedia.org/r/1196900 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [14:28:46] (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197663 (https://phabricator.wikimedia.org/T58074) (owner: 10Dreamy Jazz) [14:29:31] (03PS3) 10Dreamy Jazz: Update sendVerifyEmailReminderNotification to use relative timestamp [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197665 (https://phabricator.wikimedia.org/T58074) [14:29:35] (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197665 (https://phabricator.wikimedia.org/T58074) (owner: 10Dreamy Jazz) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T1430) [14:30:19] (03Merged) 10jenkins-bot: netops: add new BGP group names to CoreBGPDwon alert [alerts] - 10https://gerrit.wikimedia.org/r/1196900 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [14:30:43] (03Merged) 10jenkins-bot: kube-state-metrics: collect metrics for metadata.labels.username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176209 (owner: 10Effie Mouzeli) [14:31:27] jouncebot: refresh [14:31:28] I refreshed my knowledge about deployments. [14:31:45] jouncebot: nowandnext [14:31:45] For the next 0 hour(s) and 28 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T1430) [14:31:45] In 0 hour(s) and 28 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T1500) [14:37:57] (03Merged) 10jenkins-bot: Update sendVerifyEmailReminderNotification to use relative timestamp [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197663 (https://phabricator.wikimedia.org/T58074) (owner: 10Dreamy Jazz) [14:42:05] !log dancy@deploy2002 Pruned MediaWiki: 1.45.0-wmf.21 (duration: 02m 24s) [14:42:06] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/apertium: apply [14:42:22] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/apertium: apply [14:42:29] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:42:45] (03Merged) 10jenkins-bot: Update sendVerifyEmailReminderNotification to use relative timestamp [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197665 (https://phabricator.wikimedia.org/T58074) (owner: 10Dreamy Jazz) [14:42:49] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:43:21] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1197665|Update sendVerifyEmailReminderNotification to use relative timestamp (T58074)]], [[gerrit:1197663|Update sendVerifyEmailReminderNotification to use relative timestamp (T58074)]] [14:43:25] T58074: Echo: Generate periodic web notification to nudge users to confirm an unverified email address - https://phabricator.wikimedia.org/T58074 [14:43:26] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11294068 (10elukey) Started the cache warm-up for eqiad, using tegola-swift-codfw-v003 as reference. To check progress: * Events being sent to Kafka: [[ https://grafana.wikimedia.or... [14:47:02] (03CR) 10Volans: [C:03+1] "LGTM, thx!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1196923 (https://phabricator.wikimedia.org/T390860) (owner: 10Elukey) [14:47:14] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:47:22] (03CR) 10Elukey: [C:03+2] Revert workarounds to exclude elasticsearch_cluster.py on Bookworm+ [software/spicerack] - 10https://gerrit.wikimedia.org/r/1196923 (https://phabricator.wikimedia.org/T390860) (owner: 10Elukey) [14:47:29] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1197665|Update sendVerifyEmailReminderNotification to use relative timestamp (T58074)]], [[gerrit:1197663|Update sendVerifyEmailReminderNotification to use relative timestamp (T58074)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:47:43] Nothing to verify for this change, so proceeding [14:47:44] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:47:51] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [14:48:03] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:48:23] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:49:12] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [14:49:12] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [14:49:24] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1106.eqiad.wmnet [14:49:25] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1107.eqiad.wmnet [14:50:49] (03PS2) 10Federico Ceratto: preseed.yaml, site.pp, es2057.yaml: Prepare es2057 for es3 [puppet] - 10https://gerrit.wikimedia.org/r/1197609 (https://phabricator.wikimedia.org/T402859) [14:51:10] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1197669 [14:52:13] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197665|Update sendVerifyEmailReminderNotification to use relative timestamp (T58074)]], [[gerrit:1197663|Update sendVerifyEmailReminderNotification to use relative timestamp (T58074)]] (duration: 08m 52s) [14:52:18] T58074: Echo: Generate periodic web notification to nudge users to confirm an unverified email address - https://phabricator.wikimedia.org/T58074 [14:52:27] (03CR) 10Federico Ceratto: [C:03+2] aptrepo: enable wmfmariadbpy for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1197629 (https://phabricator.wikimedia.org/T407472) (owner: 10Federico Ceratto) [14:52:54] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [14:53:06] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [14:53:13] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [14:53:25] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [14:53:39] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [14:53:52] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sre1005 add dns entries - cmooney@cumin1003" [14:53:57] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sre1005 add dns entries - cmooney@cumin1003" [14:53:57] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:54:10] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [14:54:53] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:55:01] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:55:10] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [14:55:28] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [14:55:34] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [14:55:45] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [14:55:51] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [14:56:06] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:56:16] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:56:18] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [14:56:56] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/data-gateway: apply [14:57:10] FIRING: [2x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:57:12] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [14:57:17] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [14:57:31] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [14:57:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:57:43] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply [14:58:06] topranks: expected? [14:58:11] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [14:58:21] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply [14:58:28] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply [14:58:55] claime: just checking the maintenance calendar, it's our Telxius transport back to eqiad from drmrs [14:59:39] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [14:59:49] topranks: ack [14:59:53] nothing in the calendar or email that I see [14:59:53] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [15:00:02] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [15:00:05] jelto, arnoldokoth, and mutante: It is that lovely time of the day again! You are hereby commanded to deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T1500). [15:00:12] traffic has re-routed over GTT so ok for now, will keep an eye on it and raise a fault with them if it doesn't come back soon [15:00:20] thanks for the heads up <3 [15:00:22] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [15:00:24] yw <3 [15:00:30] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [15:00:38] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [15:00:44] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [15:00:52] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [15:00:57] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [15:01:04] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [15:01:10] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [15:01:19] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [15:01:32] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [15:02:06] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [15:02:10] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:02:18] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [15:02:39] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [15:02:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:02:43] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [15:03:01] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [15:03:11] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [15:03:16] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab1004.eqiad.wmnet with reason: reboot for kernel [15:03:28] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [15:04:08] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [15:04:15] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [15:04:20] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [15:04:27] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [15:04:33] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on P{sessionstore2*} and P{P:Cassandra} [15:05:17] (03PS1) 10Dpogorzelski: feat: add dpogorzelski user [puppet] - 10https://gerrit.wikimedia.org/r/1197672 [15:06:00] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [15:06:11] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [15:06:26] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [15:06:43] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [15:06:55] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [15:07:41] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [15:07:49] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [15:08:00] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [15:08:10] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:08:11] (03PS16) 10Herron: thanos-rule: add pilot instance [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) [15:08:17] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:08:38] !log rebooting phabricator prod server [15:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:19] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [15:09:29] (03PS17) 10Herron: thanos-rule: add pilot instance [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) [15:09:37] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [15:09:42] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:09:51] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:09:55] (03CR) 10Federico Ceratto: "Updated" [puppet] - 10https://gerrit.wikimedia.org/r/1197609 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [15:10:24] is phabricator not responding for anyone else? [15:10:31] taavi: yes [15:10:33] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [15:10:39] taavi: having the same issue [15:10:39] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [15:10:43] scrapped again? [15:10:49] no [15:10:52] apparently planned [15:10:57] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [15:11:08] * taavi sees no relevant mails or IRC messages [15:11:09] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [15:11:12] (03PS3) 10Aaron Schulz: Set wgRestSandboxSpecs['wmf-restbase'] to use the static specs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190743 (https://phabricator.wikimedia.org/T396805) [15:11:20] mu.tante just !logged a reboot [15:11:37] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [15:11:51] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [15:11:54] taavi: someone referenced a slack message I haven't yet found [15:11:56] (03PS1) 10Scott French: Use request variables for internal headers in known-client DSL [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1197673 [15:12:22] (03CR) 10Aaron Schulz: "Thanks for the work on https://phabricator.wikimedia.org/T397203 . Once this is out, I can do merge the sandbox config changes." [puppet] - 10https://gerrit.wikimedia.org/r/1177515 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [15:12:37] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [15:12:49] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [15:13:00] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:13:11] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:13:36] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [15:13:48] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [15:13:56] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [15:13:58] phabricator is back for me [15:14:03] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [15:14:41] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [15:14:52] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [15:14:58] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [15:15:05] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [15:15:09] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [15:15:19] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:15:23] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [15:15:36] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [15:15:42] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [15:16:00] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [15:16:06] (03CR) 10Dzahn: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1192636/7340/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1192636 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn) [15:16:09] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [15:16:29] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [15:16:45] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:17:29] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:17:35] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [15:17:55] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [15:19:46] !log Fix for envoy x-request-id and tracing deployed in all envs in staging and mw-on-k8s prod - T407826 [15:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:50] T407826: X-Request-Id response header off by 5000 - https://phabricator.wikimedia.org/T407826 [15:20:36] (03PS2) 10Dzahn: phabricator: drop cluster_search config [puppet] - 10https://gerrit.wikimedia.org/r/1192636 (https://phabricator.wikimedia.org/T403948) [15:21:27] (03PS2) 10Dpogorzelski: feat: add dpogorzelski user [puppet] - 10https://gerrit.wikimedia.org/r/1197672 [15:22:15] (03PS2) 10Scott French: Improvements to known-client DSL and entity deletion [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1197673 [15:22:25] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1192636/7342/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1192636 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn) [15:22:27] (03CR) 10Brennen Bearnes: [C:04-1] "Just realized there's an issue here when running `scap deploy`, will fix and report back." [puppet] - 10https://gerrit.wikimedia.org/r/1192636 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn) [15:23:59] (03CR) 10Scott French: [V:03+2] "Tested locally." [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1197673 (owner: 10Scott French) [15:25:44] (03CR) 10Scott French: [V:03+2 C:03+2] Improvements to known-client DSL and entity deletion [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1197673 (owner: 10Scott French) [15:27:50] !log swfrench@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Deploy: Improvements to known-client DSL and entity deletion - swfrench@cumin2002" [15:27:52] !log swfrench@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Improvements to known-client DSL and entity deletion - swfrench@cumin2002 [15:28:41] (03CR) 10Elukey: sre.hardware.upgrade-firmware: improve matching for SSD checks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [15:28:45] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Improvements to known-client DSL and entity deletion - swfrench@cumin2002 [15:28:47] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Deploy: Improvements to known-client DSL and entity deletion - swfrench@cumin2002" [15:30:13] 06SRE: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#11294247 (10BCornwall) Should the broken tests as mentioned in T398161#11227347 be brought up in a new ticket? [15:30:17] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1108.eqiad.wmnet [15:30:23] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp1109.eqiad.wmnet [15:30:42] (03CR) 10BCornwall: [V:03+1] "Thanks, I totally derped on that." [puppet] - 10https://gerrit.wikimedia.org/r/1192230 (https://phabricator.wikimedia.org/T399688) (owner: 10BCornwall) [15:32:35] (03CR) 10Elukey: [C:03+1] Update provision and interface validator to support Nokia TOR (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1197647 (https://phabricator.wikimedia.org/T405637) (owner: 10Cathal Mooney) [15:35:32] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on P{sessionstore2*} and P{P:Cassandra} [15:36:56] (03PS1) 10Jcrespo: [WIP] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1197676 (https://phabricator.wikimedia.org/T393692) [15:37:52] (03CR) 10CI reject: [V:04-1] [WIP] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1197676 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [15:38:00] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2078.codfw.wmnet [15:39:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:41:28] elukey@cumin2002 upgrade-firmware (PID 3720618) is awaiting input [15:42:17] 06SRE: Allow Puppet to pull in XCHEESESCORE git repo - https://phabricator.wikimedia.org/T404688#11294344 (10SLyngshede-WMF) 05Open→03Resolved [15:42:55] 07Puppet, 10MobileFrontend (Tracking): Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425#11294353 (10Jdlrobson-WMF) @Jdforrester-WMF I believe this can now be resolved. Visit en.m.wikipedia.org on my de... [15:44:14] (03PS1) 10Cathal Mooney: preseed.yaml: expand regex for sretest100x to include 1005/1006 [puppet] - 10https://gerrit.wikimedia.org/r/1197678 (https://phabricator.wikimedia.org/T405560) [15:44:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:45:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964#11294373 (10elukey) @Jhancock.wm if you have time I'd ask you some help to try both: * Clear the disk config: we can do it since the host is depooled, I... [15:45:37] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1197308 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [15:45:44] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on P{sessionstore1*} and P{P:Cassandra} [15:46:53] (03PS2) 10Jcrespo: [WIP] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1197676 (https://phabricator.wikimedia.org/T393692) [15:47:10] FIRING: [6x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:47:48] (03CR) 10Cathal Mooney: "Thanks for the review!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1197647 (https://phabricator.wikimedia.org/T405637) (owner: 10Cathal Mooney) [15:48:38] (03CR) 10Cathal Mooney: [C:03+2] Update provision and interface validator to support Nokia TOR [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1197647 (https://phabricator.wikimedia.org/T405637) (owner: 10Cathal Mooney) [15:48:55] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host zuul2001.codfw.wmnet with OS trixie [15:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:49:25] (03CR) 10D3r1ck01: Add virtual domain mapping for OAuth (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196441 (https://phabricator.wikimedia.org/T348485) (owner: 10D3r1ck01) [15:49:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:50:38] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [15:50:43] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [15:50:48] (03Merged) 10jenkins-bot: Update provision and interface validator to support Nokia TOR [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1197647 (https://phabricator.wikimedia.org/T405637) (owner: 10Cathal Mooney) [15:51:00] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/mathoid: apply [15:51:04] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/mathoid: apply [15:51:13] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply [15:51:29] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [15:51:45] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [15:52:10] FIRING: [6x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:52:10] (03PS6) 10Herron: profile::thanos::query::store_config: add define [puppet] - 10https://gerrit.wikimedia.org/r/1197669 (https://phabricator.wikimedia.org/T406054) [15:52:35] !log urbanecm@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [15:52:42] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:52:53] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:52:58] !log urbanecm@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [15:53:13] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:53:34] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [15:53:38] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [15:53:40] (03CR) 10Herron: "true! thanks I've rebased this on top of a change to abstract the thanos query store config a bit to set targets from each of the instance" [puppet] - 10https://gerrit.wikimedia.org/r/1192209 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [15:54:15] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:55:05] (03CR) 10Vgutierrez: [C:03+1] benthos: webrequest_live: add ja4h field [puppet] - 10https://gerrit.wikimedia.org/r/1197627 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis)