[00:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:08:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1189377 [00:08:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1189377 (owner: 10TrainBranchBot) [00:13:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:31:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:32:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1189377 (owner: 10TrainBranchBot) [00:36:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:54:50] (03CR) 10Btullis: [C:03+1] opensearch-operator: fix pod security settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) (owner: 10Bking) [00:56:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:00:44] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:12:30] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 46s) [01:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:55:48] (03PS1) 10KartikMistry: Update Recommendation API to 2025-09-15-194552-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189380 (https://phabricator.wikimedia.org/T404223) [03:03:18] (03PS1) 10KartikMistry: Update cxserver to 2025-09-16-161231-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189381 (https://phabricator.wikimedia.org/T394008) [03:19:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:41:21] 06SRE: Update Wikitech "Search Console Data" doc to align with current ITS-first request process - https://phabricator.wikimedia.org/T404927#11192414 (10nshahquinn-wmf) 05Open→03Resolved a:03nshahquinn-wmf I was actually coincidentally working on search console documentation, so I've gone ahead and mad... [03:44:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:56:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:03:17] FIRING: ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:08:17] RESOLVED: ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:09:00] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:37] RECOVERY - mysqld processes on es2027 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:16:43] RECOVERY - MariaDB read only es3 on es2027 is OK: Version 10.11.13-MariaDB-log, Uptime 7s, read_only: True, event_scheduler: True, 4.15 QPS, connection latency: 0.028914s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:19:30] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es2027 gradually with 4 steps - Pool es2027.codfw.wmnet in after cloning [05:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:32:49] (03PS4) 10Arnaudb: gerrit: toggle mod_qos log_only off [puppet] - 10https://gerrit.wikimedia.org/r/1189386 (https://phabricator.wikimedia.org/T402611) [05:32:49] (03CR) 10Arnaudb: "I'll send a notice on IRC and slack before merging this" [puppet] - 10https://gerrit.wikimedia.org/r/1189386 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [05:34:00] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:06] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:54:58] Deploying cxserver.. [05:56:16] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-09-16-161231-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189381 (https://phabricator.wikimedia.org/T394008) (owner: 10KartikMistry) [05:57:56] (03Merged) 10jenkins-bot: Update cxserver to 2025-09-16-161231-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189381 (https://phabricator.wikimedia.org/T394008) (owner: 10KartikMistry) [05:59:56] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T0600) [06:00:05] marostegui, Amir1, and federico3: Time to snap out of that daydream and deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T0600). [06:00:21] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:04:36] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:05:10] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:05:20] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2027 gradually with 4 steps - Pool es2027.codfw.wmnet in after cloning [06:05:21] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es2027.codfw.wmnet onto es2050.codfw.wmnet [06:05:29] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:06:02] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:08:23] !log Updated cxserver to 2025-09-16-161231-production (T394008, T404567, T404298, T404181) [06:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:32] T394008: CXServer doesn't support section suggestions for "be-tarask" language code - https://phabricator.wikimedia.org/T394008 [06:08:33] T404567: Post-creation work for tokwiki - https://phabricator.wikimedia.org/T404567 [06:08:35] T404298: Can't translate en:Tokyo in Gujarati - https://phabricator.wikimedia.org/T404298 [06:08:35] T404181: When templatedata is missing cxserver fails to extract template params from template source code - https://phabricator.wikimedia.org/T404181 [06:12:23] (03CR) 10Brouberol: opensearch-operator: fix pod security settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) (owner: 10Bking) [06:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:34:37] !log jynus@cumin1003 dbctl commit (dc=all): 'Depool es2027 T404940', diff saved to https://phabricator.wikimedia.org/P83420 and previous config saved to /var/cache/conftool/dbconfig/20250918-063436-jynus.json [06:34:42] T404940: es2027 database unhealthy - https://phabricator.wikimedia.org/T404940 [06:36:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:38:16] (03CR) 10Muehlenhoff: [C:03+2] Apply installserver role to install1005 [puppet] - 10https://gerrit.wikimedia.org/r/1189169 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [06:38:54] (03CR) 10Majavah: [C:03+2] backy2: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188825 (owner: 10Majavah) [06:45:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson) [06:46:38] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 4.912 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:46:40] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.072 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:51:33] (03PS1) 10Muehlenhoff: homer: Update the DHCP server in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1189389 (https://phabricator.wikimedia.org/T396487) [06:56:22] (03PS1) 10Slyngshede: Release version 0.1.13 [software/bitu] - 10https://gerrit.wikimedia.org/r/1189390 (https://phabricator.wikimedia.org/T403691) [07:00:05] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T0700) [07:00:05] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:10] o/ [07:00:13] I can deploy [07:04:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson) [07:05:10] (03Merged) 10jenkins-bot: cirrus: Reduce galleries weight in search on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson) [07:06:00] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1182186|cirrus: Reduce galleries weight in search on commons (T401590)]] [07:06:04] (03PS2) 10Majavah: P:toolforge::prometheus: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188829 [07:06:05] T401590: Adjust CirrusSearchNamespaceWeights for Commons - https://phabricator.wikimedia.org/T401590 [07:06:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:06:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:11:27] (03CR) 10Giuseppe Lavagetto: [C:03+2] Add deprecations to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1180712 (https://phabricator.wikimedia.org/T398161) (owner: 10Giuseppe Lavagetto) [07:12:10] !log dcausse@deploy1003 dcausse, ebernhardson: Backport for [[gerrit:1182186|cirrus: Reduce galleries weight in search on commons (T401590)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:12:15] T401590: Adjust CirrusSearchNamespaceWeights for Commons - https://phabricator.wikimedia.org/T401590 [07:16:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:16:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:17:06] !log dcausse@deploy1003 dcausse, ebernhardson: Continuing with sync [07:18:55] (03CR) 10Stevemunene: Add a dummy Ceph user keys for the cephcsi plugin to use (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [07:19:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:20:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:21:46] (03PS4) 10Stevemunene: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) [07:22:21] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182186|cirrus: Reduce galleries weight in search on commons (T401590)]] (duration: 16m 20s) [07:22:25] T401590: Adjust CirrusSearchNamespaceWeights for Commons - https://phabricator.wikimedia.org/T401590 [07:26:09] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1189390 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [07:27:37] (03CR) 10Muehlenhoff: [C:03+2] Update DHCP server in eqiad to install1005 [puppet] - 10https://gerrit.wikimedia.org/r/1189170 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:28:53] (03CR) 10Filippo Giunchedi: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1188829 (owner: 10Majavah) [07:29:00] (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188829 (owner: 10Majavah) [07:32:15] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6980/co" [puppet] - 10https://gerrit.wikimedia.org/r/1188827 (owner: 10Majavah) [07:32:45] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::checker: Remove absent checks [puppet] - 10https://gerrit.wikimedia.org/r/1188827 (owner: 10Majavah) [07:34:33] (03CR) 10Majavah: [V:03+2 C:03+2] "ignoring typos false positive" [puppet] - 10https://gerrit.wikimedia.org/r/1188828 (owner: 10Majavah) [07:36:12] (03PS1) 10Majavah: openstack: Drop obsolete linuxbridge config files [puppet] - 10https://gerrit.wikimedia.org/r/1189393 [07:36:12] (03PS1) 10Majavah: P:openstack: nova: Drop obsolete settings [puppet] - 10https://gerrit.wikimedia.org/r/1189394 [07:38:03] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6981/console" [puppet] - 10https://gerrit.wikimedia.org/r/1189394 (owner: 10Majavah) [07:40:34] (03PS1) 10Majavah: O:aptly::server: Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/1189395 (https://phabricator.wikimedia.org/T399076) [07:40:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:41:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:42:05] (03CR) 10Majavah: [C:03+2] O:aptly::server: Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/1189395 (https://phabricator.wikimedia.org/T399076) (owner: 10Majavah) [07:42:18] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:44:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:45:02] (03CR) 10Slyngshede: "Right now the nda group isn't sync'ed because it's not listed as one of the groups Netbox needs. We did talk about it at a previous Infras" [puppet] - 10https://gerrit.wikimedia.org/r/1189142 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [07:46:41] (03CR) 10Slyngshede: [C:03+2] Release version 0.1.13 [software/bitu] - 10https://gerrit.wikimedia.org/r/1189390 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [07:48:36] (03CR) 10Muehlenhoff: [C:03+2] Point webproxy in eqiad to install1005 [dns] - 10https://gerrit.wikimedia.org/r/1189173 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:48:39] (03CR) 10Jelto: [C:03+1] "lgtm but we should closely monitor metrics and user reports. I recall cloning repos over https opens several connections. So we should mak" [puppet] - 10https://gerrit.wikimedia.org/r/1189386 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [07:48:50] !log jmm@dns1004 START - running authdns-update [07:49:20] (03Merged) 10jenkins-bot: Release version 0.1.13 [software/bitu] - 10https://gerrit.wikimedia.org/r/1189390 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [07:50:02] !log jmm@dns1004 END - running authdns-update [07:50:41] (03CR) 10Stevemunene: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [07:55:53] (03CR) 10Arnaudb: [C:03+2] "100%! I'll progressively rollout from spare to primary with puppet-agent disabled" [puppet] - 10https://gerrit.wikimedia.org/r/1189386 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [07:57:07] (03CR) 10Muehlenhoff: [C:03+2] Update the proxies used by cloudcumin to install1005 [puppet] - 10https://gerrit.wikimedia.org/r/1189171 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:08:37] (03CR) 10Elukey: [C:03+1] homer: Update the DHCP server in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1189389 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:09:21] (03CR) 10Muehlenhoff: [C:03+2] homer: Update the DHCP server in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1189389 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:12:56] (03PS1) 10Slyngshede: IDM: Failover for 0.1.13 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1189434 [08:19:25] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: SwitchCoreInterfaceDown (instance ssw1-f1-codfw:9804) - https://phabricator.wikimedia.org/T404946 (10LSobanski) 03NEW [08:21:22] (03CR) 10Novem Linguae: "As a recently added volunteer NDA, I find these IDP-protected tools a lot like a second Wikitech. There's great info in some of them, and " [puppet] - 10https://gerrit.wikimedia.org/r/1189142 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [08:25:51] (03CR) 10Elukey: "I am super sorry to review this only now, thanks a lot for the patch :) I totally understand that this is a poc and it needs more refineme" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1173425 (https://phabricator.wikimedia.org/T397696) (owner: 10CDanis) [08:31:00] (03PS1) 10Gmodena: admin: add sk-ssh-ed25519 key for gmodena [puppet] - 10https://gerrit.wikimedia.org/r/1189435 [08:32:18] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:33:14] (03CR) 10Slyngshede: [C:03+2] IDM: Failover for 0.1.13 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1189434 (owner: 10Slyngshede) [08:33:35] !log slyngshede@dns1004 START - running authdns-update [08:34:53] !log slyngshede@dns1004 END - running authdns-update [08:36:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:40:30] (03CR) 10Btullis: [C:03+2] Fix the webhook TLS configuration for the spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189279 (https://phabricator.wikimedia.org/T318712) (owner: 10Btullis) [08:41:12] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6982/co" [puppet] - 10https://gerrit.wikimedia.org/r/1189221 (https://phabricator.wikimedia.org/T404871) (owner: 10Joal) [08:46:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:46:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:48:26] (03Merged) 10jenkins-bot: Fix the webhook TLS configuration for the spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189279 (https://phabricator.wikimedia.org/T318712) (owner: 10Btullis) [08:55:02] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:55:28] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:56:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:11:00] (03CR) 10Muehlenhoff: [C:03+2] Remove installserver role from install1004 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1189176 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [09:16:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:16:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:17:40] PROBLEM - TFTP service on install1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [09:18:20] PROBLEM - Squid on install1004 is CRITICAL: connect to address 208.80.154.74 and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/HTTP_proxy [09:18:22] PROBLEM - HTTP on install1004 is CRITICAL: connect to address 208.80.154.74 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Install_servers [09:19:00] FIRING: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:21:06] 10SRE-SLO, 10EditCheck, 10Lift-Wing, 06Machine-Learning-Team, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11192908 (10isarantopoulos) 05Open→03Resolved I'm resolving this task as the work to define the SLO and implemen... [09:21:34] (03PS1) 10David Caro: prometheus: add memorymax parameter [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) [09:23:42] (03CR) 10David Caro: "Tested in toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) (owner: 10David Caro) [09:24:00] RESOLVED: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:24:18] (03CR) 10Filippo Giunchedi: prometheus: add memorymax parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) (owner: 10David Caro) [09:24:24] (03CR) 10David Caro: "Hmm... I probably should make that memorymax setting variable, depending on the available memory in the node, as toolsbeta vm is way small" [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) (owner: 10David Caro) [09:25:45] (03PS1) 10Daniel Kinzler: api-gateway: Remove .tpl extension [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440 [09:27:11] (03PS2) 10David Caro: prometheus: add memorymax parameter [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) [09:27:20] (03CR) 10David Caro: prometheus: add memorymax parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) (owner: 10David Caro) [09:27:38] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) (owner: 10David Caro) [09:29:23] (03CR) 10CI reject: [V:04-1] api-gateway: Remove .tpl extension [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440 (owner: 10Daniel Kinzler) [09:30:57] (03CR) 10CI reject: [V:04-1] prometheus: add memorymax parameter [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) (owner: 10David Caro) [09:31:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:31:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:32:04] (03PS3) 10David Caro: prometheus: add memorymax parameter [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) [09:32:38] (03PS4) 10David Caro: prometheus: add memorymax parameter [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) [09:32:40] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) (owner: 10David Caro) [09:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:45:04] (03PS5) 10David Caro: prometheus: add memorymax parameter [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) [09:45:07] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) (owner: 10David Caro) [09:46:39] train blocker https://phabricator.wikimedia.org/T404902 is *mostly* handled; I'm waiting for one of my USTZ teammates to wake up and have a look at it, but I'm reasonably confident it'll be done before group2 planned time [09:48:56] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188437 (https://phabricator.wikimedia.org/T404639) (owner: 10Bearloga) [09:51:37] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 3.915 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:51:37] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 3.926 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:52:24] (03CR) 10David Caro: "The experimental failure is because alert1002 does not seem to have the facts in the puppet7 pcc setup:" [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) (owner: 10David Caro) [09:57:09] (03PS6) 10David Caro: prometheus: add memorymax parameter [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) [09:57:30] (03PS7) 10David Caro: prometheus: add memorymax parameter [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) [09:57:33] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) (owner: 10David Caro) [09:57:35] (03PS1) 10Jelto: gitlab: enable object storage for packages [puppet] - 10https://gerrit.wikimedia.org/r/1189444 (https://phabricator.wikimedia.org/T378922) [09:58:05] (03PS4) 10Federico Ceratto: clone_es.py: clone readonly es* hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 [09:59:34] (03CR) 10Arnaudb: [C:03+1] gitlab: enable object storage for packages [puppet] - 10https://gerrit.wikimedia.org/r/1189444 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1000) [10:00:15] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6983/co" [puppet] - 10https://gerrit.wikimedia.org/r/1189444 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [10:02:33] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus: add memorymax parameter [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) (owner: 10David Caro) [10:03:01] (03CR) 10David Caro: [C:03+2] prometheus: add memorymax parameter [puppet] - 10https://gerrit.wikimedia.org/r/1189439 (https://phabricator.wikimedia.org/T404199) (owner: 10David Caro) [10:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:16:00] (03PS1) 10JMeybohm: Add known-client and ipblock-source entities [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1189446 [10:16:18] (03CR) 10JMeybohm: [V:03+2 C:03+2] Add known-client and ipblock-source entities [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1189446 (owner: 10JMeybohm) [10:17:55] (03CR) 10Btullis: [C:03+1] opensearch-operator: fix pod security settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) (owner: 10Bking) [10:18:49] !log jayme@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - jayme@cumin1002" [10:18:50] !log jayme@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - jayme@cumin1002 [10:19:39] !log jayme@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - jayme@cumin1002 [10:19:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - jayme@cumin1002" [10:26:23] !log btullis@deploy1003 Started deploy [analytics/refinery@5feb53f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@5feb53f9] [10:27:13] !log btullis@deploy1003 Finished deploy [analytics/refinery@5feb53f] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@5feb53f9] (duration: 00m 50s) [10:27:26] (03CR) 10Btullis: [V:03+1 C:03+1] Add resource preemption to Hadoop Yarn scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1189221 (https://phabricator.wikimedia.org/T404871) (owner: 10Joal) [10:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:31:44] (03CR) 10Btullis: [V:03+1 C:03+1] "We're going to test this on the analytics-test-hadoop cluster first, by disabling puppet on an-master100[3-4] temporarily." [puppet] - 10https://gerrit.wikimedia.org/r/1189221 (https://phabricator.wikimedia.org/T404871) (owner: 10Joal) [10:32:15] !log btullis@deploy1003 Started deploy [analytics/refinery@5feb53f]: Regular analytics weekly train [analytics/refinery@5feb53f9] [10:36:51] (03CR) 10Btullis: [V:03+1 C:03+2] Add resource preemption to Hadoop Yarn scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1189221 (https://phabricator.wikimedia.org/T404871) (owner: 10Joal) [10:37:23] !log btullis@deploy1003 Finished deploy [analytics/refinery@5feb53f]: Regular analytics weekly train [analytics/refinery@5feb53f9] (duration: 05m 08s) [10:39:06] !log btullis@deploy1003 Started deploy [analytics/refinery@5feb53f] (thin): Regular analytics weekly train THIN [analytics/refinery@5feb53f9] [10:39:17] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on install1004.wikimedia.org with reason: being shut down [10:40:02] !log btullis@deploy1003 Finished deploy [analytics/refinery@5feb53f] (thin): Regular analytics weekly train THIN [analytics/refinery@5feb53f9] (duration: 00m 55s) [10:40:46] 06SRE, 10Hiddenparma, 06Traffic: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11193102 (10Joe) Coming to @SLyngshede-WMF's concern, I think some of them are valid, like having disjoint configuration going besides the actual content of a file, including t... [10:45:37] !log drain ssw1-f1-eqiad of traffic to perform reboot T400783 [10:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:41] T400783: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783 [10:54:33] (03PS2) 10Daniel Kinzler: api-gateway: Remove .tpl extension [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440 [10:55:27] PROBLEM - OpenSearch health check for shards on 9400 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [10:55:41] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1009.eqiad.wmnet are marked down but pooled: wdqs-heavy-queries_8888: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled: inference_30443: Servers ml-serve1008.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1020.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but p [10:55:41] dqs_80: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled: k8s-ingress-ml-serve_31443: Servers ml-serve1010.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:55:41] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1009.eqiad.wmnet are marked down but pooled: wdqs-heavy-queries_8888: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled: inference_30443: Servers ml-serve1006.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but p [10:55:41] dqs_80: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled: k8s-ingress-ml-serve_31443: Servers ml-serve1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:56:11] PROBLEM - MariaDB Replica IO: analytics-meta-replica on an-mariadb1002 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@10.64.138.8:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on 10.64.138.8 (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:56:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 1.728% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:56:16] PROBLEM - MariaDB Replica IO: s8 #page on db1192 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1193.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1193.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:56:16] (03CR) 10CI reject: [V:04-1] api-gateway: Remove .tpl extension [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440 (owner: 10Daniel Kinzler) [10:56:20] PROBLEM - MariaDB Replica IO: s8 #page on db1203 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1193.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1193.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:56:20] PROBLEM - MariaDB Replica IO: es6 #page on es1037 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@es1038.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es1038.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:56:21] PROBLEM - MariaDB Replica IO: es7 #page on es1039 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@es1035.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es1035.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:56:22] PROBLEM - MariaDB Replica IO: es6 #page on es1036 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@es1038.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es1038.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:56:22] PROBLEM - MariaDB Replica IO: s8 on dbstore1009 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1193.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1193.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:56:24] PROBLEM - MariaDB Replica IO: es7 #page on es1040 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@es1035.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es1035.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:56:36] PROBLEM - MariaDB Replica IO: backup1-eqiad on db1205 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1204.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1204.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:56:44] what happened? [10:57:00] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:57:05] s8, backup-1, es7 ? [10:57:16] looking [10:57:20] es6 too [10:57:32] why so many different [10:57:35] Same rack perhaps? [10:57:36] Amir1? [10:57:48] <_joe_> here if needed [10:57:49] ah [10:57:52] here now [10:58:00] thanks for the ping [10:58:02] hey [10:58:07] I drained traffic from ssw1-f1-eqiad [10:58:20] I don't believe it should be related.... [10:58:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [10:58:29] <_joe_> topranks: can you rollback? just to be sure [10:58:35] no probs [10:58:37] <_joe_> yeah we're down [10:58:50] mw latency shoot up [10:58:51] ok done [10:59:00] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:59:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool db1192', diff saved to https://phabricator.wikimedia.org/P83425 and previous config saved to /var/cache/conftool/dbconfig/20250918-105905-ladsgroup.json [10:59:08] I'm looking at the lacks [10:59:12] *racks [10:59:20] RECOVERY - OpenSearch health check for shards on 9400 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: green, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 823, active_shards: 1648, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_t [10:59:20] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:59:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool db1203', diff saved to https://phabricator.wikimedia.org/P83426 and previous config saved to /var/cache/conftool/dbconfig/20250918-105922-ladsgroup.json [10:59:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:59:55] !incidents [10:59:55] 6752 (ACKED) db1192 (paged)/MariaDB Replica IO: s8 (paged) [10:59:56] 6753 (ACKED) db1203 (paged)/MariaDB Replica IO: s8 (paged) [10:59:56] 6754 (ACKED) es1037 (paged)/MariaDB Replica IO: es6 (paged) [10:59:56] 6755 (ACKED) es1036 (paged)/MariaDB Replica IO: es6 (paged) [10:59:56] 6756 (ACKED) es1039 (paged)/MariaDB Replica IO: es7 (paged) [10:59:56] 6757 (ACKED) es1040 (paged)/MariaDB Replica IO: es7 (paged) [11:00:10] RECOVERY - MariaDB Replica IO: analytics-meta-replica on an-mariadb1002 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:00:17] RECOVERY - MariaDB Replica IO: s8 #page on db1192 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:00:21] RECOVERY - MariaDB Replica IO: s8 #page on db1203 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:00:22] RECOVERY - MariaDB Replica IO: es7 #page on es1039 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:00:22] RECOVERY - MariaDB Replica IO: es6 #page on es1036 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:00:23] RECOVERY - MariaDB Replica IO: es6 #page on es1037 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:00:23] RECOVERY - MariaDB Replica IO: s8 on dbstore1009 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:00:23] getting IC [11:00:25] <_joe_> ook [11:00:25] Not a single rack but so far same row [11:00:26] RECOVERY - MariaDB Replica IO: es7 #page on es1040 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:00:28] I spot-checked the hosts and it seems all eqiad row E and F [11:00:33] Yup [11:00:37] RECOVERY - MariaDB Replica IO: backup1-eqiad on db1205 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:00:45] <_joe_> looks like things are ok now? [11:00:49] yeah [11:00:50] ml nodes also up [11:00:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:01:08] replicas even recovered [11:01:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.95% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:01:19] <_joe_> latency and errors going down [11:01:27] <_joe_> topranks: did you do anything? [11:01:28] I'm seeing e3, f1, f2, f5... [11:01:36] _joe_: yes rolled back [11:01:41] <_joe_> ok :) [11:02:21] remind me to repool the replicas [11:02:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1298:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1298 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:02:41] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:02:41] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:03:01] dbs should recover automatically after network is back, but do other services would need help? [11:03:09] I'm sort of scratching my head, I made that device "less preferred" for traffic, rather than removed it from the available options [11:03:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [11:03:55] latency seems still a bit high, maybe? [11:03:59] mw one I mean [11:04:00] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:04:14] any other ongoing impact? [11:04:34] I think latency reovered [11:04:37] <_joe_> not really (mw latency) [11:04:45] edits? [11:04:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:05:18] <_joe_> edit rate recovered [11:05:23] <_joe_> quicker than latency ofc [11:05:32] "missing kubernetes logs" is that worrying? [11:05:43] or it is just log transmission lag? [11:06:06] that's the only ongoing alert I am seeing [11:06:35] all p*ges resolved [11:07:40] FIRING: [5x] KubernetesRsyslogDown: rsyslog on wikikube-worker1086:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:07:59] does rsyslog need a restart or something, or just wait to catch up? does anyone know? [11:08:30] <_joe_> jynus: asked serviceops to take care of it [11:08:35] thanks [11:08:50] it dropped to 0 since 10:54 [11:10:22] it got fixed, volume now spiked up, was something done? [11:12:40] RESOLVED: [5x] KubernetesRsyslogDown: rsyslog on wikikube-worker1086:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:13:05] I will wait for log volume to go back to normal to resolve the incident [11:13:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:17:30] I think globally things look fine now [11:19:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:19:01] Thanks for handling this. We had a short postgresql outage as a result of this, which had a knock-on effect on Airflow, but everything recovered without intervention. [11:19:22] good to hear (the recovery) [11:19:31] I will add that to the doc [11:19:44] I think things like this can be interesting for realiability analysis [11:19:58] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11193185 (10cmooney) So draining traffic from the node did not go as planned. This config was applied: ` set protocols bgp graceful-shutdown sender set routing-instanc... [11:21:13] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1203* gradually with 4 steps - Work done [11:21:19] thanks all for jumping in - my apologies for the error I did not expect this outcome we will indeed need to review in full [11:21:30] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1192* gradually with 4 steps - Work done [11:21:39] no worries, topranks [11:23:04] topranks: I think one thing that we change is to give a heads up on maintenance to oncall, as it was surprisingly impacting, and that can help reduce confusion. What do you think? [11:23:30] yeah I really should have done that. 100% [11:23:39] apologies if it was logged, there was so much alerting that it could have been buried on logs [11:23:43] normally do for anything it was an omission [11:23:54] I did log the change to SAL but that's not gonna register [11:24:15] is there an incident doc? [11:24:29] yep, put it on topic on the other channel [11:24:36] ok thanks [11:24:37] please help us complete the maintenance part [11:24:47] while I complete the response part [11:26:14] I will consider this fully resolved [11:28:19] I don't think this was too worring, I think the doc will mostly help prevent and debug [11:29:07] what's the procedure for train blockers? https://phabricator.wikimedia.org/T404902 is fixed and merged to master, should be deployed to group0/group1 and integrated in what's getting deployed on group2. do i close the issue and let the train conductor handle it, do i backport it myself, something else? [11:29:34] (and by "should be deployed" i mean "has to be deployed") [11:31:07] hi, ihurbain unsure if you are referring to the recent incident or just a deploying issue. The incident itself is fixed and should not block anything. [11:33:19] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete configuration options from SSH type [puppet] - 10https://gerrit.wikimedia.org/r/1189149 (owner: 10Muehlenhoff) [11:33:31] jynus: independent of the conversation/incident above; i just have a bug that's "we saw this during group1 and want to fix this before the train goes to group2". patch is merged to master, but it's unclear to me if i need to do anything else about it :) [11:34:11] I see, a more veteran deployer should be able to advice on that, if you don't get anyone, please contact release engineering [11:34:39] ack [11:35:14] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Release CampaignEvents extension to Wikimedia Commons - Sept 18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189267 (https://phabricator.wikimedia.org/T403667) (owner: 10Cmelo) [11:35:23] ihurbain: it should be deployed to the group0/group1 [11:35:25] I can do it [11:35:31] jouncebot: nowandnext [11:35:31] No deployments scheduled for the next 0 hour(s) and 24 minute(s) [11:35:32] In 0 hour(s) and 24 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1200) [11:35:40] Amir1: that'd be lovely :) [11:35:55] (03PS1) 10Ladsgroup: Do not access bundle on non-Parsoid content [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189453 (https://phabricator.wikimedia.org/T404902) [11:36:01] (03CR) 10Ladsgroup: [C:03+2] Do not access bundle on non-Parsoid content [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189453 (https://phabricator.wikimedia.org/T404902) (owner: 10Ladsgroup) [11:36:06] going in! [11:36:09] thank you! [11:40:29] (03Abandoned) 10Slyngshede: P:cache::haproxy avoid hardcoding wme ranges [puppet] - 10https://gerrit.wikimedia.org/r/1184772 (owner: 10Slyngshede) [11:41:28] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189205 (owner: 10Muehlenhoff) [11:42:27] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [11:42:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959 (10cmooney) 03NEW p:05Triage→03High [11:43:18] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [11:43:46] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11193246 (10BTullis) Hi. In case it helps with your investigation, I can tell you that we observed a brief loss of connectivity on the dse-k8s cluster, which may well ha... [11:44:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:44:56] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [11:46:11] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [11:46:26] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [11:46:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:46:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:47:47] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [11:49:49] (03Merged) 10jenkins-bot: Do not access bundle on non-Parsoid content [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189453 (https://phabricator.wikimedia.org/T404902) (owner: 10Ladsgroup) [11:50:45] Amir1: Have you repooled the replicas ? Reminding you just in case :-) [11:50:54] yup, thanks! [11:51:00] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1189453|Do not access bundle on non-Parsoid content (T404902)]] [11:51:04] T404902: Wikimedia\Assert\InvariantException: Invariant failed: getBasePageBundle called on non-Parsoid ContentHolder - https://phabricator.wikimedia.org/T404902 [11:51:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:51:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:57:06] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1189453|Do not access bundle on non-Parsoid content (T404902)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:57:11] T404902: Wikimedia\Assert\InvariantException: Invariant failed: getBasePageBundle called on non-Parsoid ContentHolder - https://phabricator.wikimedia.org/T404902 [11:57:52] ihurbain: it's in testservers now. Can you test it? [11:58:26] i can try yeah (unsure how/if parsercache and testservers interact, so we'll see if the test is valid or not ^^;) [11:58:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:59:20] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11193277 (10cmooney) >>! In T400783#11193246, @BTullis wrote: > Hi. In case it helps with your investigation, I can tell you that we observed a brief loss of connectivit... [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1200) [12:01:05] (03PS5) 10Federico Ceratto: clone_es.py: clone readonly es* hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 [12:02:08] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [12:02:38] !log installing libjson-xs-perl security updates [12:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:59] Amir1: "maybe". it doesn't seem to crash, whether it's because it's not trying to read the cache or because it can actually read it without error is TBD :D [12:03:17] noted [12:04:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repool es2027', diff saved to https://phabricator.wikimedia.org/P83433 and previous config saved to /var/cache/conftool/dbconfig/20250918-120441-ladsgroup.json [12:06:20] (03CR) 10Ladsgroup: tables-catalog: add CommunityRequests tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188413 (https://phabricator.wikimedia.org/T403559) (owner: 10MusikAnimal) [12:06:39] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1203* gradually with 4 steps - Work done [12:06:56] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1192* gradually with 4 steps - Work done [12:07:06] !log installing libyaml-libyaml-perl security updates [12:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:33] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189453|Do not access bundle on non-Parsoid content (T404902)]] (duration: 16m 33s) [12:07:38] T404902: Wikimedia\Assert\InvariantException: Invariant failed: getBasePageBundle called on non-Parsoid ContentHolder - https://phabricator.wikimedia.org/T404902 [12:10:58] looks like we're actually good! thanks Amir1 :) [12:11:09] \o/ [12:11:26] closing bug then \o/ [12:12:16] (03CR) 10Brouberol: [C:03+1] dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [12:16:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:17:17] 06SRE, 06collaboration-services, 10Gerrit, 06Traffic: Document how to deploy changes to DNS repo without Gerrit working - https://phabricator.wikimedia.org/T336754#11193359 (10ABran-WMF) 05In progress→03Resolved this has been [[ https://wikitech.wikimedia.org/wiki/DNS#Emergency_Measures | done ]],... [12:20:44] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11193379 (10Jhancock.wm) anything i can try onsite to help? [12:21:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:22:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11193382 (10Jhancock.wm) @elukey heads up, i'm gonna try swapping the console card with another CP server to see if it's the card or something else. will probab... [12:27:57] (03PS2) 10Zabe: Initial configuration for mswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188892 (https://phabricator.wikimedia.org/T404698) [12:28:53] (03PS10) 10Slyngshede: P:puppetserver::volatile Include XCheeseScore private repo [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) [12:30:00] (03PS1) 10Joal: Update hadoop capacity scheduler preemption [puppet] - 10https://gerrit.wikimedia.org/r/1189464 (https://phabricator.wikimedia.org/T404871) [12:31:07] (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile Include XCheeseScore private repo [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [12:31:08] (03CR) 10Btullis: [C:03+2] Update hadoop capacity scheduler preemption [puppet] - 10https://gerrit.wikimedia.org/r/1189464 (https://phabricator.wikimedia.org/T404871) (owner: 10Joal) [12:33:21] (03Abandoned) 10Ilias Sarantopoulos: ml-services: enable GPU for edit-check in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186448 (https://phabricator.wikimedia.org/T403378) (owner: 10Ilias Sarantopoulos) [12:34:10] ca [12:34:21] (wrong tab.) [12:36:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:52] (03PS11) 10Slyngshede: P:puppetserver::volatile Include XCheeseScore private repo [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) [12:41:55] Deploying Recommendation API.. [12:41:58] (03PS1) 10Majavah: team-wmcs: Retire KernelErrors alerts [alerts] - 10https://gerrit.wikimedia.org/r/1189472 (https://phabricator.wikimedia.org/T404300) [12:42:02] (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2025-09-15-194552-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189380 (https://phabricator.wikimedia.org/T404223) (owner: 10KartikMistry) [12:42:24] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:42:36] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6984/co" [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [12:43:13] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:43:47] (03Merged) 10jenkins-bot: Update Recommendation API to 2025-09-15-194552-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189380 (https://phabricator.wikimedia.org/T404223) (owner: 10KartikMistry) [12:44:00] FIRING: [3x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [12:47:32] (03PS12) 10Slyngshede: P:puppetserver::volatile Include XCheeseScore private repo [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) [12:49:00] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6985/co" [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [12:49:23] !log kartik@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:50:20] (03PS1) 10Majavah: O:wmcs::toolforge: package_builder: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1189474 [12:50:44] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6986/co" [puppet] - 10https://gerrit.wikimedia.org/r/1188437 (https://phabricator.wikimedia.org/T404639) (owner: 10Bearloga) [12:50:57] (03PS1) 10Majavah: O:extdist: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1189475 [12:51:47] (03CR) 10Btullis: [V:03+1 C:03+2] Absent the resources in statistics::product_analytics [puppet] - 10https://gerrit.wikimedia.org/r/1188437 (https://phabricator.wikimedia.org/T404639) (owner: 10Bearloga) [12:52:00] (03PS1) 10Majavah: P:mariadb::cloudinfra: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1189476 [12:54:13] (03PS1) 10Majavah: O:labs::lvm::mnt: Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/1189478 [12:56:36] (03CR) 10Filippo Giunchedi: [C:03+1] "\o/" [alerts] - 10https://gerrit.wikimedia.org/r/1189472 (https://phabricator.wikimedia.org/T404300) (owner: 10Majavah) [12:56:44] (03CR) 10Majavah: [C:03+2] team-wmcs: Retire KernelErrors alerts [alerts] - 10https://gerrit.wikimedia.org/r/1189472 (https://phabricator.wikimedia.org/T404300) (owner: 10Majavah) [12:57:03] (03CR) 10Filippo Giunchedi: [C:03+1] O:wmcs::toolforge: package_builder: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1189474 (owner: 10Majavah) [12:58:36] (03Merged) 10jenkins-bot: team-wmcs: Retire KernelErrors alerts [alerts] - 10https://gerrit.wikimedia.org/r/1189472 (https://phabricator.wikimedia.org/T404300) (owner: 10Majavah) [12:59:11] (03CR) 10Majavah: [C:03+2] O:wmcs::toolforge: package_builder: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1189474 (owner: 10Majavah) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1300). [13:00:05] cmelo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] \o/ [13:01:45] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 62856 [13:03:44] I can deploy [13:03:54] (03CR) 10Zabe: [C:03+2] Release CampaignEvents extension to Wikimedia Commons - Sept 18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189267 (https://phabricator.wikimedia.org/T403667) (owner: 10Cmelo) [13:04:02] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:04:46] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:04:48] (03Merged) 10jenkins-bot: Release CampaignEvents extension to Wikimedia Commons - Sept 18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189267 (https://phabricator.wikimedia.org/T403667) (owner: 10Cmelo) [13:05:37] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1189267|Release CampaignEvents extension to Wikimedia Commons - Sept 18 (T403667)]] [13:05:41] T403667: Release CampaignEvents extension to Wikimedia Commons - Sept 18 - https://phabricator.wikimedia.org/T403667 [13:06:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:06:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:07:24] cmooney@cumin1003 peering (PID 3107717) is awaiting input [13:08:56] about to set up the circular replication [13:09:09] (03PS5) 10Stevemunene: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) [13:09:23] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11193541 (10MoritzMuehlenhoff) After setting up the new servers, we ran into a problem with syncing up current OSM data sets. Initially the assumption was an error which happene durin... [13:10:14] !log zabe@deploy1003 zabe, cmelo: Backport for [[gerrit:1189267|Release CampaignEvents extension to Wikimedia Commons - Sept 18 (T403667)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:10:22] cmelo41: is there anything specific to test here? [13:10:43] I will test it now thanks [13:11:05] yes, tested it is there thank you!!! [13:11:12] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 62856 [13:11:29] !log zabe@deploy1003 zabe, cmelo: Continuing with sync [13:11:45] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 9.673 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:11:45] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.850 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:13:50] (03CR) 10Zabe: [C:03+2] Initial configuration for mswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188892 (https://phabricator.wikimedia.org/T404698) (owner: 10Zabe) [13:14:42] (03Merged) 10jenkins-bot: Initial configuration for mswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188892 (https://phabricator.wikimedia.org/T404698) (owner: 10Zabe) [13:16:17] !log kartik@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:16:52] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189267|Release CampaignEvents extension to Wikimedia Commons - Sept 18 (T403667)]] (duration: 11m 15s) [13:16:58] T403667: Release CampaignEvents extension to Wikimedia Commons - Sept 18 - https://phabricator.wikimedia.org/T403667 [13:17:09] cmelo41: should be live [13:17:22] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1188892|Initial configuration for mswikiquote (T404698)]] [13:17:26] T404698: Create Wikiquote Malay - https://phabricator.wikimedia.org/T404698 [13:17:51] !log ladsgroup@cumin1003 START - Cookbook sre.switchdc.databases.prepare for the switch from eqiad to codfw for all core sections [13:19:36] !log zabe@deploy1003 zabe: Backport for [[gerrit:1188892|Initial configuration for mswikiquote (T404698)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:20:00] !log zabe@deploy1003 zabe: Continuing with sync [13:21:04] !log kartik@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:22:57] (03PS1) 10Zabe: Activate mswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189486 [13:23:27] (03PS2) 10Zabe: Activate mswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189486 (https://phabricator.wikimedia.org/T404698) [13:24:38] !log Updated Recommendation API to 2025-09-15-194552-production (T404223. T404448. T400562) [13:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:46] T404223: Error fetching lead section size for
- https://phabricator.wikimedia.org/T404223 [13:24:46] T404448: Error fetching appendix sections - https://phabricator.wikimedia.org/T404448 [13:24:46] T400562: Create a unified Logstash dashboard displaying errors from cx, cxserver, RecommentationAPI, MinT - https://phabricator.wikimedia.org/T400562 [13:24:48] (03CR) 10Zabe: [C:03+2] Activate mswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189486 (https://phabricator.wikimedia.org/T404698) (owner: 10Zabe) [13:25:18] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188892|Initial configuration for mswikiquote (T404698)]] (duration: 07m 56s) [13:25:22] T404698: Create Wikiquote Malay - https://phabricator.wikimedia.org/T404698 [13:25:51] (03Merged) 10jenkins-bot: Activate mswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189486 (https://phabricator.wikimedia.org/T404698) (owner: 10Zabe) [13:27:46] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1189486|Activate mswikiquote (T404698)]] [13:29:30] (03PS2) 10Zabe: Initial configuration for thwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188894 (https://phabricator.wikimedia.org/T400001) [13:32:21] !log zabe@deploy1003 zabe: Backport for [[gerrit:1189486|Activate mswikiquote (T404698)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:32:26] T404698: Create Wikiquote Malay - https://phabricator.wikimedia.org/T404698 [13:33:27] !log zabe@deploy1003 zabe: Continuing with sync [13:34:54] (03PS1) 10Muehlenhoff: Add my second FIDO token [puppet] - 10https://gerrit.wikimedia.org/r/1189490 [13:34:56] 06SRE: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} - https://phabricator.wikimedia.org/T404974 (10JKelsoteel-WMF) 03NEW [13:35:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [13:36:31] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-backup-namenode[1001-1002].eqiad.wmnet [13:37:38] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from eqiad to codfw for all core sections [13:38:44] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189486|Activate mswikiquote (T404698)]] (duration: 10m 58s) [13:38:48] T404698: Create Wikiquote Malay - https://phabricator.wikimedia.org/T404698 [13:40:40] (03CR) 10Zabe: [C:03+2] Initial configuration for thwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188894 (https://phabricator.wikimedia.org/T400001) (owner: 10Zabe) [13:41:33] (03Merged) 10jenkins-bot: Initial configuration for thwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188894 (https://phabricator.wikimedia.org/T400001) (owner: 10Zabe) [13:42:00] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1188894|Initial configuration for thwikimedia (T400001)]] [13:42:04] T400001: Create a Wiki for Wikimedia Thailand - https://phabricator.wikimedia.org/T400001 [13:43:06] btullis@cumin1003 decommission (PID 3112898) is awaiting input [13:44:07] zabe: Once you're done, can I deploy? [13:44:17] (03PS1) 10Jforrester: test2wiki: Enable Wikifunctions client mode here too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189491 (https://phabricator.wikimedia.org/T397401) [13:44:51] (03CR) 10Slyngshede: [C:03+1] "Verified out of band." [puppet] - 10https://gerrit.wikimedia.org/r/1189490 (owner: 10Muehlenhoff) [13:45:03] sure [13:45:21] but I will need one more sync after this one [13:45:29] Of course, no rush. [13:45:37] There's 45 mins until the next window. [13:46:22] !log zabe@deploy1003 zabe: Backport for [[gerrit:1188894|Initial configuration for thwikimedia (T400001)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:46:31] (03PS1) 10Zabe: Actiave thwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189492 (https://phabricator.wikimedia.org/T400001) [13:47:18] !log zabe@deploy1003 zabe: Continuing with sync [13:49:48] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [13:50:38] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:51:06] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:51:53] !log imported imposm 0.14.1-3 (cherrypick of upstream fix to hopefully fix deadlock in OSM import) T381565 [13:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:58] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [13:52:34] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188894|Initial configuration for thwikimedia (T400001)]] (duration: 10m 34s) [13:52:39] T400001: Create a Wiki for Wikimedia Thailand - https://phabricator.wikimedia.org/T400001 [13:53:26] (03CR) 10Zabe: [C:03+2] Actiave thwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189492 (https://phabricator.wikimedia.org/T400001) (owner: 10Zabe) [13:54:17] (03Merged) 10jenkins-bot: Actiave thwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189492 (https://phabricator.wikimedia.org/T400001) (owner: 10Zabe) [13:54:37] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189494 [13:54:37] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189494 (owner: 10Zabe) [13:55:27] btullis@cumin1003 decommission (PID 3112898) is awaiting input [13:55:42] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189494 (owner: 10Zabe) [13:56:26] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1189492|Actiave thwikimedia (T400001)]], [[gerrit:1189494|Update interwiki cache]] [13:58:53] !log zabe@deploy1003 zabe: Backport for [[gerrit:1189492|Actiave thwikimedia (T400001)]], [[gerrit:1189494|Update interwiki cache]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:58:58] T400001: Create a Wiki for Wikimedia Thailand - https://phabricator.wikimedia.org/T400001 [13:59:40] !log zabe@deploy1003 zabe: Continuing with sync [14:00:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [14:01:07] cdanis: swfrench-wmf I think you saw https://docs.google.com/document/d/1WvFODOmeqZdKtLSXoLFZvnMoPf6HPUnTF72ISGNK4TQ any question? [14:01:44] jynus: I just opened it, so I'll need a few minutes to get up to speed [14:04:09] jynus: no questions! quite clear :) [14:05:06] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189492|Actiave thwikimedia (T400001)]], [[gerrit:1189494|Update interwiki cache]] (duration: 08m 39s) [14:05:07] James_F: over to you [14:05:10] Thanks. [14:05:11] T400001: Create a Wiki for Wikimedia Thailand - https://phabricator.wikimedia.org/T400001 [14:05:19] (03PS2) 10Jforrester: test2wiki: Enable Wikifunctions client mode here too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189491 (https://phabricator.wikimedia.org/T397401) [14:05:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189491 (https://phabricator.wikimedia.org/T397401) (owner: 10Jforrester) [14:06:34] (03Merged) 10jenkins-bot: test2wiki: Enable Wikifunctions client mode here too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189491 (https://phabricator.wikimedia.org/T397401) (owner: 10Jforrester) [14:06:58] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1189491|test2wiki: Enable Wikifunctions client mode here too (T397401)]] [14:07:03] T397401: If we follow Parsoid’s rollout and integrate Wikifunctions on most Wiktionaries and some low-traffic Wikipedias, we will get the testing we need to confidently roll out to larger wikis. - https://phabricator.wikimedia.org/T397401 [14:08:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [14:11:01] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1189491|test2wiki: Enable Wikifunctions client mode here too (T397401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:12:52] !log jforrester@deploy1003 jforrester: Continuing with sync [14:14:49] (03CR) 10Brouberol: opensearch-operator: fix pod security settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) (owner: 10Bking) [14:16:18] (03PS1) 10Muehlenhoff: Reset maps2011 [puppet] - 10https://gerrit.wikimedia.org/r/1189499 (https://phabricator.wikimedia.org/T381565) [14:16:31] (03PS2) 10Muehlenhoff: Reset maps2011 [puppet] - 10https://gerrit.wikimedia.org/r/1189499 (https://phabricator.wikimedia.org/T381565) [14:16:52] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [14:16:53] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:17:19] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [14:18:14] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189491|test2wiki: Enable Wikifunctions client mode here too (T397401)]] (duration: 11m 16s) [14:18:19] T397401: If we follow Parsoid’s rollout and integrate Wikifunctions on most Wiktionaries and some low-traffic Wikipedias, we will get the testing we need to confidently roll out to larger wikis. - https://phabricator.wikimedia.org/T397401 [14:18:30] (03PS4) 10Jforrester: Graph: Use new placeholder i18n from WikimediaMessages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) [14:19:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11193851 (10elukey) @Jhancock.wm I tried to use the firmware upgrade cookbook but it bails out due to this: ` cp2048: SKIPPING - iDRAC version (1.20.25.0) is t... [14:20:25] (03PS1) 10Kosta Harlan: hCaptcha: Log hcaptcha.execute() events [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189500 (https://phabricator.wikimedia.org/T402767) [14:20:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [14:21:13] (03PS13) 10Slyngshede: P:puppetserver::volatile Include XCheeseScore private repo [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) [14:21:35] (03Merged) 10jenkins-bot: Graph: Use new placeholder i18n from WikimediaMessages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [14:22:00] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1129894|Graph: Use new placeholder i18n from WikimediaMessages (T362317)]] [14:22:05] T362317: Undeploy Graph from Wikimedia production wikis - https://phabricator.wikimedia.org/T362317 [14:22:40] (03CR) 10Elukey: [C:03+1] Reset maps2011 [puppet] - 10https://gerrit.wikimedia.org/r/1189499 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:24:18] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1129894|Graph: Use new placeholder i18n from WikimediaMessages (T362317)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:25:03] (03PS1) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 [14:26:20] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6987/co" [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [14:26:23] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1022.eqiad.wmnet with OS bookworm [14:26:34] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [14:26:53] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [14:28:16] !log jforrester@deploy1003 jforrester: Continuing with sync [14:28:19] (03PS1) 10Andrew Bogott: cloudcephosd1022 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189504 [14:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:14] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1022 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189504 (owner: 10Andrew Bogott) [14:29:21] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189505 [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1430) [14:32:14] (03CR) 10Muehlenhoff: [C:03+2] Reset maps2011 [puppet] - 10https://gerrit.wikimedia.org/r/1189499 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:32:34] (03PS14) 10Slyngshede: P:puppetserver::volatile Include XCheeseScore private repo [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) [14:32:41] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189506 [14:33:18] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11193969 (10Jhancock.wm) since everything is reachable via idrac/mgmt now, i should be able to tackle that as a background task. I'll see how many i can get don... [14:33:40] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1129894|Graph: Use new placeholder i18n from WikimediaMessages (T362317)]] (duration: 11m 40s) [14:33:45] T362317: Undeploy Graph from Wikimedia production wikis - https://phabricator.wikimedia.org/T362317 [14:33:58] (03PS2) 10Jforrester: Stop loading the Graph extension anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184797 (https://phabricator.wikimedia.org/T362317) [14:34:08] jouncebot: nowandnext [14:34:11] For the next 0 hour(s) and 25 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1430) [14:34:11] In 0 hour(s) and 25 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1500) [14:34:36] !log upgrading Envoy on cloudweb hosts T403663 [14:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:40] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [14:35:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11193996 (10elukey) >>! In T392851#11193969, @Jhancock.wm wrote: > since everything is reachable via idrac/mgmt now, i should be able to tackle that as a backgr... [14:37:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2011.codfw.wmnet with OS bookworm [14:37:17] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6988/co" [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [14:37:45] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-backup-namenode[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [14:38:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-backup-namenode[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [14:38:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:38:12] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-backup-namenode[1001-1002].eqiad.wmnet [14:40:08] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6989/co" [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [14:42:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11194034 (10Jhancock.wm) @elukey I can wait! wasn't trying to rush you. lemme know next week and we'll take care of it then. =) [14:42:34] (03CR) 10Brouberol: [C:03+1] Add a dummy Ceph user keys for the cephcsi plugin to use [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [14:43:18] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1022.eqiad.wmnet with reason: host reimage [14:43:23] (03PS1) 10Krinkle: varnish: Enable unified mobile routing on fawiki and metawiki [puppet] - 10https://gerrit.wikimedia.org/r/1189510 (https://phabricator.wikimedia.org/T403510) [14:43:42] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1189435 (owner: 10Gmodena) [14:44:31] (03PS1) 10Krinkle: Disable wmgUseMdotRouting on fawiki and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189511 (https://phabricator.wikimedia.org/T403510) [14:46:39] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@b41bbe7] (releasing): Test deploy [14:46:58] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@b41bbe7] (releasing): Test deploy (duration: 00m 19s) [14:48:44] (03PS1) 10DCausse: cirrus-streaming-updater: consume all DCs topics in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189512 (https://phabricator.wikimedia.org/T403933) [14:49:03] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1022.eqiad.wmnet with reason: host reimage [14:49:58] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@b41bbe7] (releasing): Test deploy [14:50:28] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@b41bbe7] (releasing): Test deploy (duration: 00m 30s) [14:53:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [14:54:39] (03PS1) 10Andrew Bogott: cloudcephosd1035 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189513 [14:54:40] (03PS1) 10Andrew Bogott: cloudcephosd1042 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189514 [14:54:40] (03PS1) 10Andrew Bogott: cloudcephosd1043 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189515 [14:54:43] (03CR) 10Ebernhardson: [C:03+1] cirrus-streaming-updater: consume all DCs topics in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189512 (https://phabricator.wikimedia.org/T403933) (owner: 10DCausse) [14:55:18] jouncebot: nowandnext [14:55:19] For the next 0 hour(s) and 4 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1430) [14:55:19] In 0 hour(s) and 4 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1500) [14:55:26] (03CR) 10Bking: [C:03+1] cirrus-streaming-updater: consume all DCs topics in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189512 (https://phabricator.wikimedia.org/T403933) (owner: 10DCausse) [14:56:52] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage [14:57:38] (03PS1) 10Muehlenhoff: Re-re-add maps2011 as maps master [puppet] - 10https://gerrit.wikimedia.org/r/1189517 (https://phabricator.wikimedia.org/T381565) [14:58:06] !log drain cr1-codfw of traffic before work to test power cupplies T401937 [14:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:11] T401937: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937 [14:58:22] (03CR) 10Jforrester: "I plan to land this next week, after the weekend for any last-minute gotchas and for the train to successfully build and continue." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184797 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [15:00:05] jeena and dduvall: gettimeofday() says it's time for Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1500) [15:00:07] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: consume all DCs topics in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189512 (https://phabricator.wikimedia.org/T403933) (owner: 10DCausse) [15:01:43] (03PS2) 10Bking: opensearch-operator: fix pod security settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) [15:02:06] (03Merged) 10jenkins-bot: cirrus-streaming-updater: consume all DCs topics in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189512 (https://phabricator.wikimedia.org/T403933) (owner: 10DCausse) [15:02:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage [15:03:09] (03PS1) 10Scott French: shellbox*: update flavour override comments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189336 (https://phabricator.wikimedia.org/T403284) [15:03:09] (03CR) 10CI reject: [V:04-1] opensearch-operator: fix pod security settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) (owner: 10Bking) [15:03:21] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [15:03:47] (03CR) 10Elukey: [C:03+1] Re-re-add maps2011 as maps master [puppet] - 10https://gerrit.wikimedia.org/r/1189517 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:04:20] uhhh that doesn't look great [15:04:54] hmm [15:05:11] started right about 13:30 [15:05:12] spikes are not uncommon but this is a pretty notable period of them [15:05:34] timing lines up with zabe's backport [15:05:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqsin (103.102.166.130) - group Confed_eqsin - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:05:52] but it's hard to imagine how that's affecting it [15:06:09] Yeah, I doubt a new wiki will affect global session loss. [15:06:09] (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189505 (owner: 10PipelineBot) [15:06:15] yeah, it's a pretty inoccuous change [15:06:33] 13:16 was a deploy to Commons, however. [15:07:22] (03PS1) 10Elukey: redfish: improve log_entries for idrac 10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1189518 (https://phabricator.wikimedia.org/T392851) [15:07:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1022.eqiad.wmnet with OS bookworm [15:08:15] hnowlan: cdanis: are you looking at the session loss metrics as well? [15:08:21] swfrench-wmf: yep [15:08:21] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [15:08:47] did we make database changes around then? [15:08:54] for the switchover setup [15:09:00] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:17] swfrench-wmf: I don't think so, not certain [15:09:19] 13:17 START - Cookbook sre.switchdc.databases.prepare for the switch from eqiad to codfw for all core sections [15:09:23] Yes. [15:09:24] oh [15:09:30] heh. [15:09:33] See ops-l note from Amir1. [15:09:50] cdanis: END at 13:37 [15:09:55] 10ops-codfw, 06DC-Ops: Alert for device ps1-c5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404990 (10phaultfinder) 03NEW [15:10:04] But the rise in the session_loss is very sharp, at 13:31. [15:10:06] yeah [15:10:14] there's nothing matching on the sessionstore or cassandra dashboards [15:10:20] Would that take 16 minutes of peace and then sudden impact? [15:10:26] lemme look at etcd history to see what changed then [15:10:51] the cookbook took 20 minutes to run so there's plenty of window [15:11:00] ... hmm https://grafana.wikimedia.org/d/kUVKEvaWz/cassandra-storage?orgId=1&from=now-6h&to=now&timezone=utc&var-datasource=000000014&var-cluster=sessionstore [15:11:15] successful edit rate is consistent fwiw [15:11:16] oh I guess that kind of decrease happens pretty o ften actually [15:11:22] perryprog: hmmmmmmmmm [15:11:44] that is an excellent point. [15:12:25] wonder if it was a bunch of edits from one user that failed at the same time [15:12:42] SessionStore storage use is a 24hr-cycle sawtooth, yes. [15:13:56] we have seen spikes in the past related to bot/scraper activity/misbehaviour fwiw, but this is a fairly notable one [15:14:00] FIRING: NetworkDeviceAlarmActive: Alarm active on cr1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:17:53] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [15:18:22] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [15:18:53] I'm not seeing a notable spike in external errors in turnilo. The spike itself appears to have eased off [15:19:26] it's mw-web [15:20:57] is an edit save failure on mw-web even a 5xx? [15:21:30] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:21:36] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:21:57] following up, there were no notable changes in etcd at the time. I'm having a hard time aligning this with anything else (though reading through cookbook log now as well). [15:22:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2011.codfw.wmnet with OS bookworm [15:23:07] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:23:13] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:24:00] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr1-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:24:22] !log zabe@deploy1003:~$ mwscript createAndPromote.php --wiki=thwikimedia --bureaucrat --sysop --reason="T400001" Sarawut.Kha REDACTED [15:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:28] T400001: Create a Wiki for Wikimedia Thailand - https://phabricator.wikimedia.org/T400001 [15:24:39] * swfrench-wmf totally missed that the edit failure rate recovered at ~ 15:10 [15:24:41] swfrench-wmf: I haven't managed to track down a change in traffic, although, I am pretty unconvinced I'm looking in the right place [15:24:48] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on 7 hosts with reason: reboot cr1-codfw as requested by Juniper [15:24:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11194244 (10Papaul) {F66055737} [15:24:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11194245 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6fac31b1-92f6-4bf9-bf95-d9862483e9b6) set by cmooney@cumin1003 f... [15:25:38] can I deploy something to wmf.19? [15:26:38] jouncebot: nowandnext [15:26:38] For the next 0 hour(s) and 33 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1500) [15:26:38] In 0 hour(s) and 33 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1600) [15:26:39] In 0 hour(s) and 33 minute(s): DC Switchover Live-test (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1600) [15:26:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189500 (https://phabricator.wikimedia.org/T402767) (owner: 10Kosta Harlan) [15:28:19] I came here to ask about deployment too! I have SpiderPig access and would like to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1184589 . This will be my first time doing this so I just wanted to give a heads up in here in the unlikely event something goes wrong… For starters, should I wait for kostajh's deployment to finish? We are deploying to two different clusters, so maybe it doesn't matter? [15:28:40] musikanimal: yeah, would be best to wait [15:28:45] ok :) [15:28:54] (03Merged) 10jenkins-bot: hCaptcha: Log hcaptcha.execute() events [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189500 (https://phabricator.wikimedia.org/T402767) (owner: 10Kosta Harlan) [15:29:06] musikanimal: oh, for a -labs patch, you can just +2 it, and it gets synced out every 10 minutes (at least, that's what I recall) [15:29:19] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1189500|hCaptcha: Log hcaptcha.execute() events (T402767)]] [15:29:24] T402767: hCaptcha: Log hCaptcha error codes to Logstash and Prometheus - https://phabricator.wikimedia.org/T402767 [15:29:40] musikanimal: so you don't need to wait [15:30:11] kostajh: musikanimal: -labs patches must be fetched to the main deployment servers as well or otherwise the next person will be very confused and/or annoyed, which scap backport can do (but needs the lock) [15:30:21] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:30:22] ok [15:30:38] so I should use SpiderPig, then? [15:30:51] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:30:57] that's the easiest option these days I think [15:31:11] FIRING: PfwCoreBGPDown: ... [15:31:11] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.200) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [15:31:18] alright cool. I'll wait for the current deployment to finish then start mine [15:31:20] FIRING: PfwCoreBGPDown: ... [15:31:25] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.200) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [15:31:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:32:19] FIRING: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-b1-codfw and cr1-codfw (10.192.254.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [15:33:45] ^^ this is related to the reboot of cr1-codfw [15:34:00] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:39] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr1-codfw (208.80.153.192) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:35:40] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1189500|hCaptcha: Log hcaptcha.execute() events (T402767)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:35:47] T402767: hCaptcha: Log hCaptcha error codes to Logstash and Prometheus - https://phabricator.wikimedia.org/T402767 [15:36:29] !log kharlan@deploy1003 kharlan: Continuing with sync [15:38:20] (03CR) 10Scott French: [C:03+2] shellbox*: update flavour override comments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189336 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [15:39:08] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@b41bbe7] (releasing): Update Jenkins version [15:39:50] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@b41bbe7] (releasing): Update Jenkins version (duration: 00m 42s) [15:40:30] (03Merged) 10jenkins-bot: shellbox*: update flavour override comments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189336 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [15:41:40] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189500|hCaptcha: Log hcaptcha.execute() events (T402767)]] (duration: 12m 20s) [15:41:45] T402767: hCaptcha: Log hCaptcha error codes to Logstash and Prometheus - https://phabricator.wikimedia.org/T402767 [15:41:51] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:42:19] FIRING: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-b1-codfw and cr1-codfw (10.192.254.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [15:42:23] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:43:37] musikanimal: over to you [15:43:43] ty! [15:44:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:44:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and 208.80.153.210 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:44:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184589 (owner: 10MusikAnimal) [15:45:00] (03Merged) 10jenkins-bot: labs: log CommunityRequests channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184589 (owner: 10MusikAnimal) [15:45:35] (03CR) 10JHathaway: redfish: improve log_entries for idrac 10 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1189518 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [15:45:39] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr1-codfw (208.80.153.192) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:45:53] annnd it's done! okay, that was *ridiculously* easy…!!! [15:46:07] bless the authors of SpiderPig. Bless them ❤️ [15:46:11] RESOLVED: PfwCoreBGPDown: ... [15:46:11] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.200) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [15:46:11] RESOLVED: PfwCoreBGPDown: ... [15:46:17] Fundraising Firewall core BGP session down between pfw1-codfw and (null) (208.80.153.200) - group Production - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=pfw1-codfw:9804&var-bgp_group=Production&var-bgp_neighbor=(null) - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [15:46:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:47:19] RESOLVED: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-b1-codfw and cr1-codfw (10.192.254.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [15:49:08] musikanimal: tell your friends :D [15:49:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and 208.80.153.210 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:54:32] (03PS2) 10Andrew Bogott: cloudcephosd1035 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189513 [15:54:32] (03PS2) 10Andrew Bogott: cloudcephosd1042 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189514 [15:54:32] (03PS2) 10Andrew Bogott: cloudcephosd1043 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189515 [15:54:33] (03PS1) 10Andrew Bogott: codfw1dev horizon: profile::openstack::codfw1dev::horizon_version: 'epoxy' [puppet] - 10https://gerrit.wikimedia.org/r/1189526 [15:56:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:57:11] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev horizon: profile::openstack::codfw1dev::horizon_version: 'epoxy' [puppet] - 10https://gerrit.wikimedia.org/r/1189526 (owner: 10Andrew Bogott) [15:59:28] (03PS3) 10Daniel Kinzler: api-gateway: Remove .tpl extension from yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440 [16:00:04] jhathaway and moritzm: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:05] jasmine_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for DC Switchover Live-test . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1600). [16:01:05] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:01:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:03:40] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:04:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11194406 (10Papaul) out put of todays' troubleshooting Last login: Tue May 20 13:04:15 on ttyu0 --- JUNOS 23.4R2.13 Kernel 64-bit JNPR-12... [16:05:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqsin (103.102.166.130) - group Confed_eqsin - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:07:36] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404990#11194432 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:07:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:07:52] (03PS2) 10MusikAnimal: tables-catalog: add CommunityRequests tables [puppet] - 10https://gerrit.wikimedia.org/r/1188413 (https://phabricator.wikimedia.org/T403559) [16:08:10] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11194451 (10Jhancock.wm) [16:08:32] (03CR) 10MusikAnimal: tables-catalog: add CommunityRequests tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188413 (https://phabricator.wikimedia.org/T403559) (owner: 10MusikAnimal) [16:10:12] (03CR) 10CI reject: [V:04-1] tables-catalog: add CommunityRequests tables [puppet] - 10https://gerrit.wikimedia.org/r/1188413 (https://phabricator.wikimedia.org/T403559) (owner: 10MusikAnimal) [16:10:19] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks for datacenter switchover from codfw to eqiad [16:10:33] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) for datacenter switchover from codfw to eqiad [16:10:38] ^ live-test ongoing [16:11:13] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl for datacenter switchover from codfw to eqiad [16:11:31] No deploys until the all-clear is given :) [16:12:31] (03CR) 10MusikAnimal: "I also don't know why the test failed. I assume because the tables don't actually exist yet?" [puppet] - 10https://gerrit.wikimedia.org/r/1188413 (https://phabricator.wikimedia.org/T403559) (owner: 10MusikAnimal) [16:12:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:13:10] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:13:18] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:14:06] (03CR) 10Majavah: tables-catalog: add CommunityRequests tables (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1188413 (https://phabricator.wikimedia.org/T403559) (owner: 10MusikAnimal) [16:15:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqsin (103.102.166.130) - group Confed_eqsin - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:16:02] !log jasmine@deploy1003 Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover - T399891 [16:16:07] T399891: 🚀 Southward Datacenter Switchover (Sept. 2025) - https://phabricator.wikimedia.org/T399891 [16:16:45] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [16:16:56] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) for datacenter switchover from codfw to eqiad [16:17:04] FYI, we're going to be entering the more critical part of the live test shortly, so we have temporarily taken the scap lock to prevent overlapping deployments from starting [16:17:30] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [16:17:47] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance for datacenter switchover from codfw to eqiad [16:17:48] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) for datacenter switchover from codfw to eqiad [16:19:18] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.02-set-readonly for datacenter switchover from codfw to eqiad [16:19:18] !log jasmine@cumin1003 [DRY-RUN] MediaWiki read-only period starts at: 2025-09-18 16:19:18.465479 [16:19:37] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) for datacenter switchover from codfw to eqiad [16:19:43] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly for datacenter switchover from codfw to eqiad [16:20:33] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) for datacenter switchover from codfw to eqiad [16:20:41] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki for datacenter switchover from codfw to eqiad [16:20:55] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) for datacenter switchover from codfw to eqiad [16:21:02] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite for datacenter switchover from codfw to eqiad [16:21:07] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) for datacenter switchover from codfw to eqiad [16:21:17] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite for datacenter switchover from codfw to eqiad [16:21:21] !log jasmine@cumin1003 [DRY-RUN] MediaWiki read-only period ends at: 2025-09-18 16:21:21.591133 [16:21:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184797 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [16:21:23] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) for datacenter switchover from codfw to eqiad [16:22:30] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner for datacenter switchover from codfw to eqiad [16:22:31] !log root@deploy1003 helmfile [codfw] START helmfile.d/services/mw-jobrunner: sync [16:22:54] !log root@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: sync [16:22:56] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner (exit_code=0) for datacenter switchover from codfw to eqiad [16:23:19] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance for datacenter switchover from codfw to eqiad [16:23:19] !log root@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [16:23:23] !log root@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [16:23:25] !log root@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [16:23:30] !log root@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [16:23:32] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) for datacenter switchover from codfw to eqiad [16:23:40] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl for datacenter switchover from codfw to eqiad [16:24:11] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) for datacenter switchover from codfw to eqiad [16:24:36] !log jasmine@cumin1003 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters for datacenter switchover from codfw to eqiad [16:25:21] (03CR) 10Ryan Kemper: "We point them to 443 on `wdqs-main` and `wdqs-scholarly`, which probably implies that it's fine to do so. Although interestingly it does p" [puppet] - 10https://gerrit.wikimedia.org/r/1187772 (https://phabricator.wikimedia.org/T193473) (owner: 10Btullis) [16:27:00] (03PS1) 10Andrew Bogott: Octavia: add policy overrides for read endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1189529 (https://phabricator.wikimedia.org/T404862) [16:28:41] (03PS2) 10Andrew Bogott: Octavia: add policy overrides for read endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1189529 (https://phabricator.wikimedia.org/T404862) [16:32:29] !log jasmine@deploy1003 Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover - T399891 (duration: 16m 26s) [16:32:33] T399891: 🚀 Southward Datacenter Switchover (Sept. 2025) - https://phabricator.wikimedia.org/T399891 [16:33:53] (03PS3) 10Andrew Bogott: Octavia: add policy overrides for read endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1189529 (https://phabricator.wikimedia.org/T404862) [16:35:36] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:36:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:36:49] !log jasmine@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) for datacenter switchover from codfw to eqiad [16:44:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [16:47:15] (03PS1) 10BryanDavis: developer-portal: Bump to 2025-09-18-122940-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189531 [16:48:25] Hi jeena & dduvall , just an FYI for MW train deployment. I just learned that a change that is rolling out will cause event validation errors for the mediawiki.ip_reputation.score event stream. [16:48:25] Fix is here: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1189530 [16:48:26] Without fix, you'll see about 0.5 new errors in logs. I'm not sure how urgent this fix is. kostajh can comment on if we need to backport asap. [16:48:41] https://phabricator.wikimedia.org/T403664#11194614 [16:49:52] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2025-09-18-122940-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189531 (owner: 10BryanDavis) [16:51:28] (03Merged) 10jenkins-bot: developer-portal: Bump to 2025-09-18-122940-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189531 (owner: 10BryanDavis) [16:56:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [16:59:38] ottomata: I can backport it before deploying [17:00:05] jasmine_: Time to do the DC Switchover Live-test deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1600). [17:00:05] bd808: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1700). [17:00:38] If needed [17:00:57] * bd808 has a developer-portal container to ship in his window today [17:01:05] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:01:23] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:01:31] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:01:50] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:02:13] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:02:30] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:04:13] * bd808 is done with his window [17:06:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [17:10:03] (03PS1) 10Cwhite: logstash: drop 99% eventgate-analytics-external logs [puppet] - 10https://gerrit.wikimedia.org/r/1189534 (https://phabricator.wikimedia.org/T390215) [17:11:32] (03PS1) 10Jforrester: wikifunctions: Set ORCHESTRATOR_HEAP_SIZE for later tweaking [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189535 (https://phabricator.wikimedia.org/T403094) [17:13:38] (03PS2) 10Cwhite: logstash: drop 99% eventgate-analytics-external logs [puppet] - 10https://gerrit.wikimedia.org/r/1189534 (https://phabricator.wikimedia.org/T390215) [17:14:20] ottomata: let's backport it [17:14:48] Merge to master first, please. :-) [17:15:11] !log upgrading envoyproxy on aphlict1002 (active phab notifications) and contint2002 (active CI) T403663 [17:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:16] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [17:15:33] (03PS1) 10Kosta Harlan: Fix ip_reputation.score validation errors in production [extensions/WikimediaEvents] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189536 (https://phabricator.wikimedia.org/T403664) [17:15:52] ottomata jeena: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1189536 would either of you be able to backport and validate the fix? [17:16:34] (03CR) 10Cwhite: [C:03+2] logstash: drop 99% eventgate-analytics-external logs [puppet] - 10https://gerrit.wikimedia.org/r/1189534 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [17:16:37] kostajh: ^^^ Merge to master first, verify, then cherry-pick/deploy as normal, maybe? [17:16:51] James_F: I've +2'ed the patch. [17:17:21] Yes, but until it lands the cherry-pick lacks the as-landed hash for SBOM checking etc. [17:17:27] (03PS16) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [17:19:00] !log upgrading envoyproxy on lists1004 (active lists server) T403663 [17:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:18] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [17:26:50] (03PS17) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [17:30:15] (03PS2) 10Jforrester: Fix ip_reputation.score validation errors in production [extensions/WikimediaEvents] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189536 (https://phabricator.wikimedia.org/T403664) (owner: 10Kosta Harlan) [17:30:19] (03CR) 10Jforrester: [C:03+1] Fix ip_reputation.score validation errors in production [extensions/WikimediaEvents] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189536 (https://phabricator.wikimedia.org/T403664) (owner: 10Kosta Harlan) [17:32:43] !log upgrading envoyproxy on vrts1003 (active ticket.wikimedia.org ) T403663 [17:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:47] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [17:34:05] (03CR) 10Dzahn: [V:04-1] "Could not find template 'profile/initscripts/zuul/nodepool.systemd.erb'" [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [17:35:17] (03PS10) 10Dzahn: zuul: add systemd service for nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) [17:35:34] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [17:41:16] kostajh: jeena James_F I'm here now (sorry was in meetings) [17:42:46] kostajh: are ip_reputation events emitted in beta and can we test there to be sure? [17:44:16] should I schedule a backport? I see we have the cherry-pick https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1189536 [17:44:35] (03PS11) 10Dzahn: zuul: add systemd service for nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) [17:46:55] (03PS3) 10Bking: opensearch-operator: fix pod security settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) [17:47:21] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189539 [17:50:12] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1180187/6992/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [17:50:14] (03CR) 10Dzahn: [V:03+1 C:03+2] zuul: add systemd service for nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [17:50:56] ottomata: no you cannot test it in beta labs [17:51:13] The IPoid service is only accessible from production [17:51:29] (03CR) 10BCornwall: [C:03+2] varnish: Enable unified mobile routing on fawiki and metawiki [puppet] - 10https://gerrit.wikimedia.org/r/1189510 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [17:55:16] (03PS18) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [17:55:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11195081 (10RobH) [17:58:38] ottomata: kostajh I can backport during the train window. What are we looking for to test? [18:00:05] jeena and dduvall: Time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T1800). [18:01:31] (03PS1) 10TChin: Revert "[eventgate-*] Bump to v1.22.0 (service-utils)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189540 [18:01:51] jeena: I don't know how to manually test e.g. on mwdebug, but if it works then https://grafana.wikimedia.org/goto/thfzU1CNR?orgId=1 will drop back to 0. [18:01:52] However, note there is a completely unrelated deploy of eventgate-analytics-external :45 mins ago that is messing with metrics atm. [18:01:53] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul2001.codfw.wmnet with reason: WIP [18:01:55] :p [18:02:35] it is broken right now anyway, and that patch isn't going to break it any worse, so if you could just backport it that woudl be helpful [18:02:37] i am here to watch. [18:02:45] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1001.eqiad.wmnet with reason: WIP [18:02:48] better for me now than later near the end of my day [18:03:42] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [18:03:59] (03CR) 10Ottomata: [C:03+2] Revert "[eventgate-*] Bump to v1.22.0 (service-utils)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189540 (owner: 10TChin) [18:04:05] (03CR) 10Bking: "If we're pointing to 443 on for `wdqs-main` and `wdqs-scholarly` and everything works OK, then I guess it's OK to change the port. I defin" [puppet] - 10https://gerrit.wikimedia.org/r/1187772 (https://phabricator.wikimedia.org/T193473) (owner: 10Btullis) [18:04:14] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad row C/D DC Ops host migrations - https://phabricator.wikimedia.org/T405021 (10RobH) 03NEW [18:04:37] (03CR) 10Bking: [C:03+1] Switch the wdqs-internal services from http to https [puppet] - 10https://gerrit.wikimedia.org/r/1187772 (https://phabricator.wikimedia.org/T193473) (owner: 10Btullis) [18:06:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11195144 (10RobH) [18:06:03] (03Merged) 10jenkins-bot: Revert "[eventgate-*] Bump to v1.22.0 (service-utils)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189540 (owner: 10TChin) [18:06:41] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [18:06:52] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [18:07:26] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [18:07:34] (03PS1) 10Dzahn: zuul::main: fix line breaks in systemd service template [puppet] - 10https://gerrit.wikimedia.org/r/1189544 (https://phabricator.wikimedia.org/T401614) [18:07:55] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [18:08:00] ottomata: okay [18:08:24] (03CR) 10Dzahn: [C:03+2] zuul::main: fix line breaks in systemd service template [puppet] - 10https://gerrit.wikimedia.org/r/1189544 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [18:09:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189536 (https://phabricator.wikimedia.org/T403664) (owner: 10Kosta Harlan) [18:10:40] (03PS1) 10Scott French: Only present rename icon on eligible entities [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1189545 [18:10:58] (03Merged) 10jenkins-bot: Fix ip_reputation.score validation errors in production [extensions/WikimediaEvents] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189536 (https://phabricator.wikimedia.org/T403664) (owner: 10Kosta Harlan) [18:11:26] !log jhuneidi@deploy1003 Started scap sync-world: Backport for [[gerrit:1189536|Fix ip_reputation.score validation errors in production (T403664)]] [18:11:32] T403664: EventBus - Add central user id to MediaWiki events - https://phabricator.wikimedia.org/T403664 [18:12:22] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [18:12:33] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [18:12:43] (03CR) 10BCornwall: [C:03+1] varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 (owner: 10Krinkle) [18:15:36] (03CR) 10Scott French: [V:03+2] "Tested locally at c0d247c17111c5edc75b00db769c98a898ebbd93." [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1189545 (owner: 10Scott French) [18:15:39] (03CR) 10Scott French: [V:03+2 C:03+2] Only present rename icon on eligible entities [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1189545 (owner: 10Scott French) [18:17:17] !log swfrench@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Deploy: Only present rename icon on eligible entities - swfrench@cumin2002" [18:17:19] !log swfrench@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Only present rename icon on eligible entities - swfrench@cumin2002 [18:17:57] !log jhuneidi@deploy1003 kharlan, jhuneidi: Backport for [[gerrit:1189536|Fix ip_reputation.score validation errors in production (T403664)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:18:02] T403664: EventBus - Add central user id to MediaWiki events - https://phabricator.wikimedia.org/T403664 [18:18:13] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Only present rename icon on eligible entities - swfrench@cumin2002 [18:18:15] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Deploy: Only present rename icon on eligible entities - swfrench@cumin2002" [18:19:37] (03PS4) 10Bking: opensearch-operator: fix pod security settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) [18:22:52] (03CR) 10Bking: opensearch-operator: fix pod security settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) (owner: 10Bking) [18:23:23] !log jhuneidi@deploy1003 kharlan, jhuneidi: Continuing with sync [18:23:32] (03PS5) 10Bking: opensearch-operator: fix pod security settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) [18:25:47] (03PS1) 10Andrew Bogott: Add dummy certs for paws VM access [labs/private] - 10https://gerrit.wikimedia.org/r/1189547 [18:27:15] (03PS1) 10Andrew Bogott: Install paws worker access keys in /etc/magnum [puppet] - 10https://gerrit.wikimedia.org/r/1189549 (https://phabricator.wikimedia.org/T405023) [18:27:28] (03PS1) 10Arlolra: Deploy Parsoid Read Views to 28 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189550 (https://phabricator.wikimedia.org/T405016) [18:27:51] (03CR) 10CI reject: [V:04-1] Install paws worker access keys in /etc/magnum [puppet] - 10https://gerrit.wikimedia.org/r/1189549 (https://phabricator.wikimedia.org/T405023) (owner: 10Andrew Bogott) [18:28:43] !log jhuneidi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189536|Fix ip_reputation.score validation errors in production (T403664)]] (duration: 17m 17s) [18:28:49] T403664: EventBus - Add central user id to MediaWiki events - https://phabricator.wikimedia.org/T403664 [18:29:10] watching https://grafana.wikimedia.org/goto/tzqfQJCNg?orgId=1 [18:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:29:51] ottomata: shall I go ahead and roll the train forward? [18:32:09] jeena: yes it looks good to me thank you. [18:33:38] ok thanks! [18:34:05] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189551 (https://phabricator.wikimedia.org/T396380) [18:34:07] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189551 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot) [18:35:00] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189551 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot) [18:36:44] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11195295 (10Papaul) update from Juniper after our phone call today. ` Hello Teams, ​ Thank you for your time on our call. ​ During our call w... [18:39:36] (03PS6) 10Bking: opensearch-operator: fix pod security settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) [18:43:21] (03CR) 10Bking: [C:03+2] opensearch-operator: fix pod security settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) (owner: 10Bking) [18:44:22] (03CR) 10Bking: [C:03+2] opensearch-operator: fix pod security settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) (owner: 10Bking) [18:45:06] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.19 refs T396380 [18:45:11] T396380: 1.45.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T396380 [18:46:51] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 62970760 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:47:51] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3752848 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:02:17] (03PS2) 10Andrew Bogott: Install paws worker access keys in /etc/magnum [puppet] - 10https://gerrit.wikimedia.org/r/1189549 (https://phabricator.wikimedia.org/T405023) [19:02:17] (03PS4) 10Andrew Bogott: Octavia: add policy overrides for read endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1189529 (https://phabricator.wikimedia.org/T404862) [19:02:43] (03CR) 10CI reject: [V:04-1] Install paws worker access keys in /etc/magnum [puppet] - 10https://gerrit.wikimedia.org/r/1189549 (https://phabricator.wikimedia.org/T405023) (owner: 10Andrew Bogott) [19:04:07] (03PS3) 10Andrew Bogott: Install paws worker access keys in /etc/magnum [puppet] - 10https://gerrit.wikimedia.org/r/1189549 (https://phabricator.wikimedia.org/T405023) [19:04:07] (03PS5) 10Andrew Bogott: Octavia: add policy overrides for read endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1189529 (https://phabricator.wikimedia.org/T404862) [19:04:34] (03CR) 10CI reject: [V:04-1] Install paws worker access keys in /etc/magnum [puppet] - 10https://gerrit.wikimedia.org/r/1189549 (https://phabricator.wikimedia.org/T405023) (owner: 10Andrew Bogott) [19:05:42] (03PS4) 10Andrew Bogott: Install paws worker access keys in /etc/magnum [puppet] - 10https://gerrit.wikimedia.org/r/1189549 (https://phabricator.wikimedia.org/T405023) [19:05:42] (03PS6) 10Andrew Bogott: Octavia: add policy overrides for read endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1189529 (https://phabricator.wikimedia.org/T404862) [19:05:49] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1189549 (https://phabricator.wikimedia.org/T405023) (owner: 10Andrew Bogott) [19:08:51] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Add dummy certs for paws VM access [labs/private] - 10https://gerrit.wikimedia.org/r/1189547 (owner: 10Andrew Bogott) [19:09:14] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1189549 (https://phabricator.wikimedia.org/T405023) (owner: 10Andrew Bogott) [19:10:13] (03PS5) 10Andrew Bogott: Install paws worker access keys in /etc/magnum [puppet] - 10https://gerrit.wikimedia.org/r/1189549 (https://phabricator.wikimedia.org/T405023) [19:10:18] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1189549 (https://phabricator.wikimedia.org/T405023) (owner: 10Andrew Bogott) [19:14:59] (03PS6) 10Andrew Bogott: Install paws worker access keys in /etc/magnum [puppet] - 10https://gerrit.wikimedia.org/r/1189549 (https://phabricator.wikimedia.org/T405023) [19:14:59] (03PS7) 10Andrew Bogott: Octavia: add policy overrides for read endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1189529 (https://phabricator.wikimedia.org/T404862) [19:15:05] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1189549 (https://phabricator.wikimedia.org/T405023) (owner: 10Andrew Bogott) [19:18:00] (03CR) 10Andrew Bogott: [C:03+2] Install paws worker access keys in /etc/magnum [puppet] - 10https://gerrit.wikimedia.org/r/1189549 (https://phabricator.wikimedia.org/T405023) (owner: 10Andrew Bogott) [19:20:42] (03PS1) 10Dwisehaupt: crm: Update civicrm settings template for v6.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/1188863 (https://phabricator.wikimedia.org/T404757) [19:21:14] (03CR) 10JHathaway: [C:03+2] acme_chief: purge old certs [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway) [19:38:13] (03CR) 10Dzahn: "I am afraid I can't do an actually meaningful code review on this. But what I can do is to suggest to test it first on a VM that isn't get" [puppet] - 10https://gerrit.wikimedia.org/r/1188863 (https://phabricator.wikimedia.org/T404757) (owner: 10Dwisehaupt) [19:44:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:52:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11195545 (10cmooney) > It seems to be a part of the router's design. Can't believe they did all that, made us drain it and reboot and even r... [19:57:39] (03PS1) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) [19:57:46] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11195548 (10Krinkle) [19:59:03] (03CR) 10CI reject: [V:04-1] opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T2000). nyaa~ [20:00:06] No Gerrit patches in the queue for this window AFAICS. [20:00:58] (03PS2) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) [20:02:20] (03CR) 10CI reject: [V:04-1] opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:03:46] (03PS8) 10Andrew Bogott: Octavia: add policy overrides for read endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1189529 (https://phabricator.wikimedia.org/T404862) [20:03:47] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1189529 (https://phabricator.wikimedia.org/T404862) (owner: 10Andrew Bogott) [20:07:33] (03CR) 10Andrew Bogott: [C:03+2] Octavia: add policy overrides for read endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1189529 (https://phabricator.wikimedia.org/T404862) (owner: 10Andrew Bogott) [20:15:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:17:31] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1035.eqiad.wmnet with OS bookworm [20:18:46] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1035 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189513 (owner: 10Andrew Bogott) [20:19:23] (03CR) 10BCornwall: [C:03+1] wdqs: (step 2) remove wdqs discovery dns records [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [20:20:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:27:26] (03PS2) 10Dwisehaupt: crm: Update civicrm settings template for v6.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/1188863 (https://phabricator.wikimedia.org/T404757) [20:28:42] (03CR) 10BCornwall: [C:03+1] wdqs: (step 3) shift service state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1182975 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [20:36:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:39:18] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage [20:39:54] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for ericmill - https://phabricator.wikimedia.org/T404903#11195697 (10EMill-WMF) @Aklapper This: https://www.mediawiki.org/wiki/Product_Analytics/Superset_Access#Requesting_access Specifically, the `this Phabricator form` link. [20:43:25] (03PS1) 10DCausse: cirrus-streaming-updater: test flink bookworm base image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189575 [20:43:58] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage [20:44:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [20:44:01] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for ericmill - https://phabricator.wikimedia.org/T404903#11195721 (10Dzahn) @EMill-WMF You do not have shell access (and for this type of request it doesn't actually matter much or at all for us, aware the template asks you for it though). Now regarding your... [20:45:20] (03PS3) 10Ryan Kemper: wdqs: (step 3) shift service state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1182975 (https://phabricator.wikimedia.org/T395772) [20:45:20] (03PS2) 10Ryan Kemper: wdqs: (step 4) remove from LBs and wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1182977 (https://phabricator.wikimedia.org/T395772) [20:45:20] (03PS2) 10Ryan Kemper: wdqs: (steps 5,6) => final removal [puppet] - 10https://gerrit.wikimedia.org/r/1182978 (https://phabricator.wikimedia.org/T395772) [20:45:35] (03CR) 10CI reject: [V:04-1] wdqs: (step 3) shift service state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1182975 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [20:45:37] (03CR) 10CI reject: [V:04-1] wdqs: (step 4) remove from LBs and wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1182977 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [20:45:40] (03CR) 10CI reject: [V:04-1] wdqs: (steps 5,6) => final removal [puppet] - 10https://gerrit.wikimedia.org/r/1182978 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [20:47:03] (03PS4) 10Ryan Kemper: wdqs: (step 3) shift service state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1182975 (https://phabricator.wikimedia.org/T395772) [20:47:03] (03PS3) 10Ryan Kemper: wdqs: (step 4) remove from LBs and wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1182977 (https://phabricator.wikimedia.org/T395772) [20:47:03] (03PS3) 10Ryan Kemper: wdqs: (steps 5,6) => final removal [puppet] - 10https://gerrit.wikimedia.org/r/1182978 (https://phabricator.wikimedia.org/T395772) [20:47:11] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for ericmill - https://phabricator.wikimedia.org/T404903#11195725 (10Novem_Linguae) > Hi - please grant me access to the analytics-privatedata-users group. Since you are already a member of wmf, might make sense to rename this task to "Grant Access to analyti... [20:47:17] (03CR) 10CI reject: [V:04-1] wdqs: (step 3) shift service state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1182975 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [20:47:20] (03CR) 10CI reject: [V:04-1] wdqs: (step 4) remove from LBs and wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1182977 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [20:47:24] (03CR) 10CI reject: [V:04-1] wdqs: (steps 5,6) => final removal [puppet] - 10https://gerrit.wikimedia.org/r/1182978 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [20:50:18] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: test flink bookworm base image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189575 (owner: 10DCausse) [20:50:21] (03PS3) 10Ryan Kemper: wdqs: (step 2) remove wdqs discovery dns records [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) [20:51:02] (03PS4) 10Ryan Kemper: wdqs: (steps 5,6) => final removal [puppet] - 10https://gerrit.wikimedia.org/r/1182978 (https://phabricator.wikimedia.org/T395772) [20:51:10] (03CR) 10CI reject: [V:04-1] wdqs: (step 2) remove wdqs discovery dns records [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [20:51:28] (03CR) 10BCornwall: [C:03+2] wdqs: (step 2) remove wdqs discovery dns records [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [20:52:11] (03Merged) 10jenkins-bot: cirrus-streaming-updater: test flink bookworm base image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189575 (owner: 10DCausse) [20:54:04] (03PS4) 10Ryan Kemper: wdqs: (step 2) remove wdqs discovery dns records [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) [20:54:48] (03CR) 10CI reject: [V:04-1] wdqs: (step 2) remove wdqs discovery dns records [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [20:55:05] (03CR) 10BCornwall: [C:03+2] wdqs: (step 2) remove wdqs discovery dns records [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [20:56:25] (03PS5) 10Ryan Kemper: wdqs: (step 2) remove wdqs discovery dns records [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) [20:56:31] (03CR) 10BCornwall: [C:04-2] "Copied votes on follow-up patch sets have been updated:" [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [20:57:17] (03PS5) 10Ryan Kemper: wdqs: (step 3) shift service state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1182975 (https://phabricator.wikimedia.org/T395772) [20:57:17] (03PS4) 10Ryan Kemper: wdqs: (step 4) remove from LBs and wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1182977 (https://phabricator.wikimedia.org/T395772) [20:57:17] (03PS5) 10Ryan Kemper: wdqs: (steps 5,6) => final removal [puppet] - 10https://gerrit.wikimedia.org/r/1182978 (https://phabricator.wikimedia.org/T395772) [20:57:32] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:57:43] (03CR) 10BCornwall: [C:03+2] wdqs: (step 2) remove wdqs discovery dns records [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [20:57:45] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:58:02] !log brett@dns1004 START - running authdns-update [20:59:20] There will be some alerts for wdqs network probe failures, those are to be expected [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250918T2100) [21:00:12] (03CR) 10BCornwall: [C:03+2] wdqs: (step 3) shift service state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1182975 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [21:02:23] (03PS1) 10DCausse: Revert "cirrus-streaming-updater: test flink bookworm base image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189579 [21:02:46] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for ericmill - https://phabricator.wikimedia.org/T404903#11195769 (10Dzahn) Oh, if that's the case then this is a different type of request. [21:04:42] (03PS3) 10Bking: opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) [21:05:56] (03CR) 10DCausse: [C:03+2] Revert "cirrus-streaming-updater: test flink bookworm base image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189579 (owner: 10DCausse) [21:06:04] (03CR) 10CI reject: [V:04-1] opensearch-operator: Add WMF-specific chart code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189566 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [21:07:42] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for ericmill - https://phabricator.wikimedia.org/T404903#11195771 (10Dzahn) The docs say membership in wmf should be enough to log into superset. Can you not log into it? Or can you log into it but this request is specifically about seeing more private data... [21:07:43] (03Merged) 10jenkins-bot: Revert "cirrus-streaming-updater: test flink bookworm base image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189579 (owner: 10DCausse) [21:08:19] !log brett@dns1004 END - running authdns-update [21:09:24] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:09:36] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:12:20] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for ericmill - https://phabricator.wikimedia.org/T404903#11195784 (10Dzahn) In this context, please see https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#analytics-privatedata-users where it talks about `one of the more confusing groups as... [21:13:16] (03PS1) 10Andrew Bogott: cloudcephosd1035: correct nic names [puppet] - 10https://gerrit.wikimedia.org/r/1189581 [21:13:51] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1035: correct nic names [puppet] - 10https://gerrit.wikimedia.org/r/1189581 (owner: 10Andrew Bogott) [21:14:29] (03CR) 10BCornwall: [C:03+2] wdqs: (step 4) remove from LBs and wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1182977 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [21:15:43] If any pybal errors show up they're expected [21:17:09] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1035.eqiad.wmnet with OS bookworm [21:18:34] !log Restarting pybal on secondary eqiad/codfw lvs servers - T395772 [21:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:39] T395772: Teardown lvs for wdqs public pool - https://phabricator.wikimedia.org/T395772 [21:20:50] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:21:06] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:21:13] ^expected [21:21:32] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:22:06] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:22:50] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:23:46] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:24:01] FIRING: [2x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:25:37] !log Restarting pybal on low-traffic eqiad/codfw lvs servers - T395772 [21:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:42] T395772: Teardown lvs for wdqs public pool - https://phabricator.wikimedia.org/T395772 [21:26:00] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bookworm [21:30:45] !log Deleting wdqs, wdqs-heavy-queries, and wdqs-ssl ipvs services from A:lvs-secondary-codfw - T395772 [21:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:49] T395772: Teardown lvs for wdqs public pool - https://phabricator.wikimedia.org/T395772 [21:32:07] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:32:07] !log Deleting wdqs, wdqs-heavy-queries, and wdqs-ssl ipvs services from A:lvs-secondary-eqiad - T395772 [21:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:39] !log Deleting wdqs, wdqs-heavy-queries, and wdqs-ssl ipvs services from A:lvs-low-traffic-codfw - T395772 [21:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:00] FIRING: [4x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:34:21] !log Deleting wdqs, wdqs-heavy-queries, and wdqs-ssl ipvs services from A:lvs-low-traffic-eqiad - T395772 [21:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:58] (03CR) 10BCornwall: [C:03+2] wdqs: (steps 5,6) => final removal [puppet] - 10https://gerrit.wikimedia.org/r/1182978 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [21:35:47] (03PS6) 10Ryan Kemper: wdqs: (steps 5,6) => final removal [puppet] - 10https://gerrit.wikimedia.org/r/1182978 (https://phabricator.wikimedia.org/T395772) [21:35:56] (03CR) 10BCornwall: [C:03+2] wdqs: (steps 5,6) => final removal [puppet] - 10https://gerrit.wikimedia.org/r/1182978 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [21:36:05] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:36:31] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:37:49] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:39:00] FIRING: [8x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:43:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_wdqs-heavy-queries.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:44:00] FIRING: [10x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:44:49] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1042.eqiad.wmnet with reason: host reimage [21:47:05] (03PS3) 10MusikAnimal: tables-catalog: add CommunityRequests tables [puppet] - 10https://gerrit.wikimedia.org/r/1188413 (https://phabricator.wikimedia.org/T403559) [21:48:08] (03PS1) 10Ryan Kemper: s/geneated/generated [puppet] - 10https://gerrit.wikimedia.org/r/1189585 [21:48:25] (03PS2) 10Ryan Kemper: Fix inconsequential typos [puppet] - 10https://gerrit.wikimedia.org/r/1189585 [21:48:26] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1042.eqiad.wmnet with reason: host reimage [21:48:36] (03CR) 10MusikAnimal: tables-catalog: add CommunityRequests tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188413 (https://phabricator.wikimedia.org/T403559) (owner: 10MusikAnimal) [21:48:41] FIRING: [12x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_wdqs-heavy-queries.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:48:52] (03PS3) 10Ryan Kemper: Fix inconsequential typos [puppet] - 10https://gerrit.wikimedia.org/r/1189585 [21:49:00] FIRING: [12x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:49:29] (03CR) 10MusikAnimal: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1188413 (https://phabricator.wikimedia.org/T403559) (owner: 10MusikAnimal) [22:03:41] RESOLVED: [12x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_wdqs-heavy-queries.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:06:34] (03PS1) 10Jasmine: wmnet: update CNAME records for DB masters to codfw [dns] - 10https://gerrit.wikimedia.org/r/1189587 (https://phabricator.wikimedia.org/T399891) [22:07:15] (03CR) 10CI reject: [V:04-1] wmnet: update CNAME records for DB masters to codfw [dns] - 10https://gerrit.wikimedia.org/r/1189587 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [22:07:31] (03CR) 10Jasmine: "To confirm: we no longer have an entry for x2 in noc/dbconfig [1] due to recent work to remove x2 [1]" [dns] - 10https://gerrit.wikimedia.org/r/1189587 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [22:09:41] (03CR) 10Jasmine: "ms3-master (db1153.eqiad.wmnet) appears to be new, added in [1] I’ve updated it to db2143 according to dbconfig/codfw.json, although let m" [dns] - 10https://gerrit.wikimedia.org/r/1189587 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [22:13:23] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3074.* [22:13:36] ^ignore [22:15:23] !log ryankemper@cumin1002 conftool action : GET; selector: name=wdqs2009.codfw.wmnet [22:15:39] (03CR) 10Jasmine: "Edit: Disregard this actually. ms3 in fact is not new and was included in the last switchover. Updated accordingly" [dns] - 10https://gerrit.wikimedia.org/r/1189587 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [22:21:00] (03PS3) 10Andrew Bogott: cloudcephosd1042 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189514 [22:21:00] (03PS3) 10Andrew Bogott: cloudcephosd1043 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189515 [22:21:32] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1042 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189514 (owner: 10Andrew Bogott) [22:28:10] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1042.eqiad.wmnet with OS bookworm [22:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:35:48] (03PS2) 10Jasmine: wmnet: update CNAME records for DB masters to codfw [dns] - 10https://gerrit.wikimedia.org/r/1189587 (https://phabricator.wikimedia.org/T399891) [22:37:49] (03PS3) 10Jasmine: wmnet: update CNAME records for DB masters to codfw [dns] - 10https://gerrit.wikimedia.org/r/1189587 (https://phabricator.wikimedia.org/T399891) [22:40:27] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1043.eqiad.wmnet with OS bookworm [22:48:55] (03PS4) 10Andrew Bogott: cloudcephosd1043 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189515 [22:49:26] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1043 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189515 (owner: 10Andrew Bogott) [22:54:00] FIRING: [12x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:54:01] FIRING: [12x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:57:51] (03CR) 10Ryan Kemper: [C:03+2] Fix inconsequential typos [puppet] - 10https://gerrit.wikimedia.org/r/1189585 (owner: 10Ryan Kemper) [22:59:01] FIRING: [12x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:59:15] (03PS1) 10Ryan Kemper: wdqs: remove ipip [puppet] - 10https://gerrit.wikimedia.org/r/1189595 (https://phabricator.wikimedia.org/T395772) [22:59:30] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1189595 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [22:59:43] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1043.eqiad.wmnet with reason: host reimage [23:00:00] (03CR) 10BCornwall: [C:03+2] wdqs: remove ipip [puppet] - 10https://gerrit.wikimedia.org/r/1189595 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [23:03:50] (03PS1) 10Ryan Kemper: Fix inconsequential typos [puppet] - 10https://gerrit.wikimedia.org/r/1189596 [23:04:05] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1043.eqiad.wmnet with reason: host reimage [23:06:17] (03PS1) 10Jasmine: geo-maps: update map default to list codfw first [dns] - 10https://gerrit.wikimedia.org/r/1189598 (https://phabricator.wikimedia.org/T399891) [23:09:00] FIRING: [12x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:09:05] FIRING: [12x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:14:00] FIRING: [12x] SystemdUnitFailed: prometheus_ferm_mss.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:14:26] (03CR) 10Ryan Kemper: [C:03+2] Fix inconsequential typos [puppet] - 10https://gerrit.wikimedia.org/r/1189596 (owner: 10Ryan Kemper) [23:16:18] (03PS1) 10Ryan Kemper: wdqs: shift old wdqs-public hosts to test [puppet] - 10https://gerrit.wikimedia.org/r/1189600 (https://phabricator.wikimedia.org/T395772) [23:22:46] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1043.eqiad.wmnet with OS bookworm [23:24:28] (03CR) 10Ryan Kemper: [C:03+1] wdqs: shift old wdqs-public hosts to test [puppet] - 10https://gerrit.wikimedia.org/r/1189600 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [23:24:30] (03CR) 10Ryan Kemper: [C:03+2] wdqs: shift old wdqs-public hosts to test [puppet] - 10https://gerrit.wikimedia.org/r/1189600 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [23:26:54] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2016.codfw.wmnet with OS bullseye [23:27:12] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host wdqs2016 [23:28:33] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [23:31:55] (03PS1) 10Dzahn: zuul::main: let the container access nodepool on the host machine [puppet] - 10https://gerrit.wikimedia.org/r/1189602 (https://phabricator.wikimedia.org/T401614) [23:32:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [23:32:22] (03PS2) 10Dzahn: zuul::main: let nodepool connect to zookeeper on the host machine [puppet] - 10https://gerrit.wikimedia.org/r/1189602 (https://phabricator.wikimedia.org/T401614) [23:33:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:33:48] !log upgrading envoyproxy on production phabricator (phab1004) - T403663 [23:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:53] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [23:34:15] ryankemper@cumin2002 reimage (PID 3999830) is awaiting input [23:35:16] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2016 - ryankemper@cumin2002" [23:35:21] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs2016 - ryankemper@cumin2002" [23:35:21] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:35:22] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache wdqs2016.codfw.wmnet 193.16.192.10.in-addr.arpa 3.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [23:35:25] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wdqs2016.codfw.wmnet 193.16.192.10.in-addr.arpa 3.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [23:35:26] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs2016 [23:35:41] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs2016 [23:35:41] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wdqs2016 [23:36:32] 06SRE, 06collaboration-services, 10envoy, 06serviceops: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11196043 (10Dzahn) All services owned by collaboration-services have been upgraded to 1.29.12-1. ` sudo cumin 'A:owner-collaboration-services' 'dpkg -l | grep envoyproxy' .. 45... [23:38:27] FIRING: [4x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:38:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1189603 [23:38:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1189603 (owner: 10TrainBranchBot) [23:44:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:45:35] (03PS1) 10Dzahn: zuul::executor: let executor connect to zookeeper on the host machine [puppet] - 10https://gerrit.wikimedia.org/r/1189604 (https://phabricator.wikimedia.org/T403847) [23:48:27] FIRING: [4x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:53:20] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2016.codfw.wmnet with reason: host reimage [23:53:27] RESOLVED: [6x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:54:14] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1189603 (owner: 10TrainBranchBot) [23:58:42] FIRING: [8x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:59:11] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2016.codfw.wmnet with reason: host reimage