[00:03:11] (03PS2) 10RLazarus: deployment_server: Add a script for mass-deploying helmfile services [puppet] - 10https://gerrit.wikimedia.org/r/1188456 (https://phabricator.wikimedia.org/T380211) [00:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:04:38] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/ipoid: apply [00:04:46] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/ipoid: apply [00:05:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:08:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1188478 [00:08:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1188478 (owner: 10TrainBranchBot) [00:08:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:08:50] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: apply [00:08:59] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [00:09:14] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [00:11:50] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [00:13:23] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply [00:13:35] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [00:13:53] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [00:13:59] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:17:15] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:19:46] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [00:19:50] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [00:20:31] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/page-analytics: apply [00:20:40] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [00:20:52] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [00:21:00] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [00:21:11] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/push-notifications: apply [00:21:35] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [00:22:06] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [00:22:10] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [00:22:19] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/recommendation-api: apply [00:22:27] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [00:22:36] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/sessionstore: apply [00:22:52] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [00:23:08] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [00:23:13] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [00:23:31] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [00:23:35] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [00:23:53] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [00:23:57] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [00:24:06] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [00:24:10] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [00:24:21] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [00:24:51] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [00:28:05] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1188478 (owner: 10TrainBranchBot) [00:28:23] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/toolhub: apply [00:28:31] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/toolhub: apply [00:29:20] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [00:29:35] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [00:29:45] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/zotero: apply [00:29:58] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/zotero: apply [00:42:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:52:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:57:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:08:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.19 [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188486 (https://phabricator.wikimedia.org/T396380) [01:08:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.19 [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188486 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot) [01:23:32] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.19 [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188486 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot) [01:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:33:59] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:44:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183385 (10phaultfinder) [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0200) [02:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:13:59] FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:22:08] (03PS1) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:22:38] (03CR) 10CI reject: [V:04-1] varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 (owner: 10Krinkle) [02:23:31] (03PS2) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:23:57] (03CR) 10CI reject: [V:04-1] varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 (owner: 10Krinkle) [02:24:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183413 (10phaultfinder) [02:26:01] (03PS3) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:28:54] (03PS4) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:30:06] (03PS5) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:31:04] (03PS6) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:31:23] (03PS7) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:32:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11183419 (10Papaul) @cmooney we have the spare PEM on site. I need to get on a call with Juniper to troubleshooting this. Do you think Thursd... [02:34:08] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:34:08] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:35:17] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11183420 (10Papaul) [02:36:29] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11183421 (10Papaul) 05Open→03Resolved a:03Papaul The BIO reader is installed now and working. so closing this task [02:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:37:15] (03PS1) 10Superpes15: Throttle exemption for Editathon by Wikimedistas en Cruce - 26 September 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188493 (https://phabricator.wikimedia.org/T404592) [02:39:28] (03PS2) 10Superpes15: Throttle exemption for Editathon by Wikimedistas en Cruce - 26 September 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188493 (https://phabricator.wikimedia.org/T404592) [02:43:58] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:43:58] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:50:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183429 (10phaultfinder) [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0300) [03:23:59] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:24:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183455 (10phaultfinder) [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0400) [04:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:04:18] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.16 (duration: 04m 08s) [04:05:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:13:59] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:24:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183464 (10phaultfinder) [04:48:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:53:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:02:43] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:53] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:58] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:03:08] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:07:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:08:59] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:17:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:25:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:27:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:30:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:32:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:32:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:33:59] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:37:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:38:59] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:46:39] (03PS1) 10Huei Tan: xLab: Update the PageVisit target wiki for MinT readers [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420) [05:47:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [05:47:20] (03Restored) 10Huei Tan: XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan) [05:47:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan) [05:48:54] Hi, i have 2 patches for later backport, Kartik is not available, can you someone help with the deployment? [05:54:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183555 (10phaultfinder) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0600). [06:13:59] FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:14] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, but this changes an existing sudo rule, so needs SRE IF meeting approval" [puppet] - 10https://gerrit.wikimedia.org/r/1188408 (https://phabricator.wikimedia.org/T404630) (owner: 10CDanis) [06:52:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:59:56] I can deploy these backports. [07:00:00] o/ [07:00:03] thanks [07:00:04] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0700). nyaa~ [07:00:04] hueitan: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [07:03:14] (03Merged) 10jenkins-bot: xLab: Update the PageVisit target wiki for MinT readers [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [07:03:40] !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1188509|xLab: Update the PageVisit target wiki for MinT readers (T404420)]] [07:03:45] T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420 [07:07:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:09:20] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:09:40] !log awight@deploy1003 awight, hueitan: Backport for [[gerrit:1188509|xLab: Update the PageVisit target wiki for MinT readers (T404420)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:09:45] T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420 [07:11:39] hueitan: Please check on mwdebug [07:11:48] awight: thanks for the deployments! :] [07:12:33] awight checked, see it live now on mwdebug [07:12:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:12:43] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:12:45] hashar: My pleasure—Spider Pig has not let me down [07:12:51] hueitan: ty [07:12:58] !log awight@deploy1003 awight, hueitan: Continuing with sync [07:13:05] awight: yeah it is quite rad! Maybe one day we will have an equivalent to run Quibble from a web interface! :b [07:13:13] the bacula alert will get fixed soon [07:15:03] (03PS2) 10Slyngshede: Bump CAS container to 7.2.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151178 [07:15:09] (03CR) 10Slyngshede: [C:03+2] Bump CAS container to 7.2.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151178 (owner: 10Slyngshede) [07:18:15] (03Merged) 10jenkins-bot: Bump CAS container to 7.2.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151178 (owner: 10Slyngshede) [07:18:16] !log awight@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188509|xLab: Update the PageVisit target wiki for MinT readers (T404420)]] (duration: 14m 35s) [07:18:20] T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420 [07:18:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan) [07:18:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-ulsfo and Arelion (2001:2035:0:a9a::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:18:45] Finished. On to the second patch... [07:18:59] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:19:21] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:21:04] (03Merged) 10jenkins-bot: XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan) [07:21:21] !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1188398|XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS]] [07:21:25] FIRING: SystemdUnitFailed: logrotate.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:21:59] (03CR) 10Arnaudb: [C:03+2] mailman: add a local disk cache [puppet] - 10https://gerrit.wikimedia.org/r/1188320 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [07:25:47] hueitan: is this one testable? [07:25:55] let me check [07:26:09] maybe I kafkacat or... [07:26:43] hueitan: sorry, it's not quite ready to test yet [07:27:01] I was confusingly asking ahead of time [07:27:40] !log awight@deploy1003 hueitan, awight: Backport for [[gerrit:1188398|XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:27:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:27:43] RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:27:58] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:28:02] PROBLEM - mailman3-web on lists1004 is CRITICAL: PROCS CRITICAL: 14 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:28:11] awight i see it live now [07:28:44] !log awight@deploy1003 hueitan, awight: Continuing with sync [07:28:47] hueitan: ack [07:28:55] Thank you, all good [07:30:02] RECOVERY - mailman3-web on lists1004 is OK: PROCS OK: 13 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:30:53] (03PS1) 10Arnaudb: Revert "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188696 [07:31:34] (03CR) 10Jelto: [C:03+1] Revert "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188696 (owner: 10Arnaudb) [07:32:33] (03CR) 10Arnaudb: [C:03+2] Revert "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188696 (owner: 10Arnaudb) [07:32:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:32:48] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:33:02] PROBLEM - mailman3-web on lists1004 is CRITICAL: PROCS CRITICAL: 14 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:33:36] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/2 (Transit: Arelion (IC-308844) {#1071}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:34:31] !log awight@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188398|XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS]] (duration: 13m 10s) [07:35:02] RECOVERY - mailman3-web on lists1004 is OK: PROCS OK: 13 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:35:20] 💯 [07:35:38] !log UTC morning deployments finished [07:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:48] (03PS1) 10Arnaudb: Revert^2 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 [07:35:49] hueitan: Thanks for the help :-) [07:36:06] awight thank you [07:37:43] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:39:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/2 (Transit: Arelion (IC-308844) {#1071}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:42:43] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:44:06] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/2 (Transit: Arelion (IC-308844) {#1071}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:46:51] (03CR) 10Fabfur: [C:03+2] profile:cache:haproxy: copy utf8ps lua converter on cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1188366 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur) [07:47:43] FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:47:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:52:38] (03PS1) 10Brouberol: mediawiki-dumps-legacy: add missing client_config_file config in addschange config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188712 (https://phabricator.wikimedia.org/T404162) [07:52:43] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:52:48] FIRING: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:52:53] FIRING: [15x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:53:36] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:57:43] FIRING: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:57:48] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:02:43] RESOLVED: [8x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:13:59] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:22:04] PROBLEM - Kafka broker TLS certificate validity on kafka-test1008 is CRITICAL: SSL CRITICAL - Certificate kafka-test1008.eqiad.wmnet valid until 2025-09-23 08:22:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:22:22] PROBLEM - Kafka broker TLS certificate validity on kafka-test1006 is CRITICAL: SSL CRITICAL - Certificate kafka-test1006.eqiad.wmnet valid until 2025-09-23 08:22:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:22:22] PROBLEM - Kafka broker TLS certificate validity on kafka-test1010 is CRITICAL: SSL CRITICAL - Certificate kafka-test1010.eqiad.wmnet valid until 2025-09-23 08:22:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:23:22] PROBLEM - Kafka broker TLS certificate validity on kafka-test1007 is CRITICAL: SSL CRITICAL - Certificate kafka-test1007.eqiad.wmnet valid until 2025-09-23 08:23:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:24:06] PROBLEM - Kafka broker TLS certificate validity on kafka-main2006 is CRITICAL: SSL CRITICAL - Certificate kafka-main2006.codfw.wmnet valid until 2025-09-23 08:24:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:24:48] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: add missing client_config_file config in addschange config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188712 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [08:25:04] PROBLEM - Kafka broker TLS certificate validity on kafka-test1009 is CRITICAL: SSL CRITICAL - Certificate kafka-test1009.eqiad.wmnet valid until 2025-09-23 08:25:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:26:59] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: add missing client_config_file config in addschange config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188712 (https://phabricator.wikimedia.org/T404162) (owner: 10Brouberol) [08:27:06] PROBLEM - Kafka broker TLS certificate validity on kafka-main2007 is CRITICAL: SSL CRITICAL - Certificate kafka-main2007.codfw.wmnet valid until 2025-09-23 08:27:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:28:04] PROBLEM - Kafka broker TLS certificate validity on kafka-main2009 is CRITICAL: SSL CRITICAL - Certificate kafka-main2009.codfw.wmnet valid until 2025-09-23 08:28:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:31:01] (03CR) 10Elukey: [C:03+1] Apply replica role to maps1012-1014 [puppet] - 10https://gerrit.wikimedia.org/r/1188308 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:31:06] PROBLEM - Kafka broker TLS certificate validity on kafka-main2008 is CRITICAL: SSL CRITICAL - Certificate kafka-main2008.codfw.wmnet valid until 2025-09-23 08:31:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:32:03] (03CR) 10Elukey: [C:03+2] spicerack/mysql.py: update CORE_SECTIONS to reflect newly added x3 section [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187871 (https://phabricator.wikimedia.org/T404464) (owner: 10Jasmine) [08:34:53] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for hueitan - https://phabricator.wikimedia.org/T404681 (10hueitan) 03NEW [08:35:54] (03PS1) 10Gergő Tisza: User: Simplify makeUpdateConditions() [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188715 (https://phabricator.wikimedia.org/T401748) [08:36:38] (03PS1) 10Gergő Tisza: session: Add a mechanism for forcing a refresh [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188716 (https://phabricator.wikimedia.org/T399200) [08:37:06] (03PS1) 10Gergő Tisza: Use short expiry for JWT cookies [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188717 (https://phabricator.wikimedia.org/T399200) [08:37:24] PROBLEM - Kafka broker TLS certificate validity on kafka-main2010 is CRITICAL: SSL CRITICAL - Certificate kafka-main2010.codfw.wmnet valid until 2025-09-23 08:37:00 +0000 (expires in 6 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [08:38:10] (03PS1) 10Gergő Tisza: tests: Update for SessionCookieJwtExpiration added in core [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) [08:38:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188715 (https://phabricator.wikimedia.org/T401748) (owner: 10Gergő Tisza) [08:38:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188716 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [08:38:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188717 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [08:39:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [08:41:33] (03Merged) 10jenkins-bot: spicerack/mysql.py: update CORE_SECTIONS to reflect newly added x3 section [software/spicerack] - 10https://gerrit.wikimedia.org/r/1187871 (https://phabricator.wikimedia.org/T404464) (owner: 10Jasmine) [08:42:02] (03PS1) 10Slyngshede: Permissions: Prevent duplicate permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) [08:45:12] (03PS2) 10Gergő Tisza: Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) [08:45:17] (03CR) 10Gergő Tisza: Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [08:45:38] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v11.7.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1188721 [08:45:47] (03CR) 10CI reject: [V:04-1] Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [08:46:32] (03PS2) 10Slyngshede: Permissions: Prevent duplicate permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) [08:46:43] (03CR) 10CI reject: [V:04-1] tests: Update for SessionCookieJwtExpiration added in core [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [08:47:01] (03PS3) 10Gergő Tisza: Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) [08:47:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [08:48:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:53:39] FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:53:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:57:37] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v11.7.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1188721 (owner: 10Elukey) [08:58:39] FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:58:49] (03PS3) 10Effie Mouzeli: P:hcaptcha: add temporary redirect [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) [08:59:25] 10ops-eqiad, 06SRE, 06DC-Ops: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#11183855 (10cmooney) >>! In T380050#10654652, @BCornwall wrote: > Re: https://gerrit.wikimedia.org/r/c/operations/dns/+/1091711/comments/5e6962e8_b88980ce - Do the IPs need to be deleted from netbox? Y... [09:00:00] (03CR) 10Effie Mouzeli: [C:04-1] "@kosta, please provide where we define the version, so to add it in the comments and move forward, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) (owner: 10Effie Mouzeli) [09:00:06] (03PS1) 10Elukey: Upstream release v11.7.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1188728 [09:00:18] (03CR) 10Effie Mouzeli: [C:04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) (owner: 10Effie Mouzeli) [09:01:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [09:03:01] (03PS2) 10Arnaudb: Revert^2 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) [09:03:39] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [09:04:43] (03PS4) 10Effie Mouzeli: P:hcaptcha: add temporary redirect [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) [09:05:08] (03CR) 10Effie Mouzeli: "variable is $wgHCaptchaApiUrl" [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) (owner: 10Effie Mouzeli) [09:05:24] (03PS3) 10Arnaudb: Revert^2 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) [09:05:40] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v11.7.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1188728 (owner: 10Elukey) [09:06:14] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [09:06:28] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [09:06:34] (03PS1) 10Alexandros Kosiaris: deploy: Set HELM_DIFF_OUTPUT_CONTEXT=5 in kube_env.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188731 [09:08:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [09:08:18] (03PS2) 10Effie Mouzeli: P:hcaptcha: add keepalive_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1187828 [09:08:39] RESOLVED: [4x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:08:52] (03CR) 10Arnaudb: [C:03+2] Revert^2 "mailman: add a local disk cache" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [09:09:33] (03CR) 10Jelto: [C:03+1] Revert^2 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188708 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [09:11:23] (03CR) 10Elukey: [C:03+1] deploy: Set HELM_DIFF_OUTPUT_CONTEXT=5 in kube_env.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188731 (owner: 10Alexandros Kosiaris) [09:12:03] (03CR) 10Effie Mouzeli: P:hcaptcha: add keepalive_timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187828 (owner: 10Effie Mouzeli) [09:12:13] (03PS1) 10Arnaudb: Revert^3 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188732 [09:12:57] (03PS3) 10Effie Mouzeli: P:hcaptcha: add keepalive_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1187828 (https://phabricator.wikimedia.org/T403416) [09:12:57] (03CR) 10Arnaudb: [C:03+2] Revert^3 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188732 (owner: 10Arnaudb) [09:13:27] (03CR) 10Effie Mouzeli: P:hcaptcha: add keepalive_timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1187828 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [09:23:50] jouncebot: nowandnext [09:23:50] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [09:23:50] In 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1000) [09:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:24:31] (03CR) 10Hnowlan: [C:03+1] P:hcaptcha: add temporary redirect [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) (owner: 10Effie Mouzeli) [09:24:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:24:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183943 (10phaultfinder) [09:25:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:25:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:27:05] !log uploaded spicerack_11.7.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia [09:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:13] (03CR) 10Giuseppe Lavagetto: [C:03+1] "we alrready do this in scap IIRC." [puppet] - 10https://gerrit.wikimedia.org/r/1188731 (owner: 10Alexandros Kosiaris) [09:30:40] (03CR) 10Slyngshede: [C:03+1] "LGTM. Tested on in local environment." [puppet] - 10https://gerrit.wikimedia.org/r/1188379 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur) [09:31:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188352 (https://phabricator.wikimedia.org/T404594) (owner: 10Dreamy Jazz) [09:31:30] !log elukey@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [09:31:59] (03Merged) 10jenkins-bot: Document that test2wiki has suggested investigations DB tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188352 (https://phabricator.wikimedia.org/T404594) (owner: 10Dreamy Jazz) [09:32:15] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1188352|Document that test2wiki has suggested investigations DB tables (T404594)]] [09:32:19] T404594: Create suggested investigation database tables on test2wiki - https://phabricator.wikimedia.org/T404594 [09:32:54] RECOVERY - Kafka broker TLS certificate validity on kafka-test1006 is OK: SSL OK - Certificate kafka-test1006.eqiad.wmnet valid until 2026-08-23 08:32:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:34:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:36:27] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS trixie [09:38:19] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:puppetserver::volatile avoid loading Spur data on certain host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184646 (https://phabricator.wikimedia.org/T403616) (owner: 10Slyngshede) [09:38:20] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1188352|Document that test2wiki has suggested investigations DB tables (T404594)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:38:25] T404594: Create suggested investigation database tables on test2wiki - https://phabricator.wikimedia.org/T404594 [09:38:42] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [09:39:50] RECOVERY - Kafka broker TLS certificate validity on kafka-test1007 is OK: SSL OK - Certificate kafka-test1007.eqiad.wmnet valid until 2026-08-23 08:34:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:41:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:44:04] (03PS2) 10Gergő Tisza: tests: Update for SessionCookieJwtExpiration added in core [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) [09:44:09] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188352|Document that test2wiki has suggested investigations DB tables (T404594)]] (duration: 11m 54s) [09:44:14] T404594: Create suggested investigation database tables on test2wiki - https://phabricator.wikimedia.org/T404594 [09:45:37] (03CR) 10Fabfur: "Thanks, another test is always helpful!" [puppet] - 10https://gerrit.wikimedia.org/r/1188379 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur) [09:46:02] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2229.codfw.wmnet with reason: Maintenance [09:46:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2229 (T402925)', diff saved to https://phabricator.wikimedia.org/P83353 and previous config saved to /var/cache/conftool/dbconfig/20250916-094609-ladsgroup.json [09:46:14] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [09:46:25] RECOVERY - Kafka broker TLS certificate validity on kafka-test1008 is OK: SSL OK - Certificate kafka-test1008.eqiad.wmnet valid until 2026-08-23 08:21:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:47:58] !log disable puppet on A:cp to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/1188379 (T401383) [09:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:02] T401383: Reduce noise from duplicate sequence-gap alerts on HaProxy-webrequests - https://phabricator.wikimedia.org/T401383 [09:48:39] (03PS2) 10Federico Ceratto: es2049.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1185879 (https://phabricator.wikimedia.org/T402859) [09:48:39] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2220.codfw.wmnet with reason: Maintenance [09:48:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2220 (T402925)', diff saved to https://phabricator.wikimedia.org/P83354 and previous config saved to /var/cache/conftool/dbconfig/20250916-094846-ladsgroup.json [09:49:48] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [09:50:30] 06SRE: Allow Puppet to pull in XCHEESESCORE git repo - https://phabricator.wikimedia.org/T404688 (10SLyngshede-WMF) 03NEW [09:50:48] 06SRE: Allow Puppet to pull in XCHEESESCORE git repo - https://phabricator.wikimedia.org/T404688#11184085 (10SLyngshede-WMF) p:05Triage→03High [09:52:08] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS bookworm [09:52:15] (03CR) 10Fabfur: [C:03+2] haproxy: use utf8ps converter on received headers [puppet] - 10https://gerrit.wikimedia.org/r/1188379 (https://phabricator.wikimedia.org/T401383) (owner: 10Fabfur) [09:52:25] 06SRE: Allow Puppet to pull in XCHEESESCORE git repo - https://phabricator.wikimedia.org/T404688#11184097 (10SLyngshede-WMF) [09:52:55] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:52:59] RECOVERY - Kafka broker TLS certificate validity on kafka-test1009 is OK: SSL OK - Certificate kafka-test1009.eqiad.wmnet valid until 2026-08-23 08:32:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:53:57] RECOVERY - Kafka broker TLS certificate validity on kafka-test1010 is OK: SSL OK - Certificate kafka-test1010.eqiad.wmnet valid until 2026-08-23 08:23:00 +0000 (expires in 340 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:54:21] !log elukey@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad [09:56:28] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11184115 (10Jelto) My proposal to move forward is to sync the files from object storage to a local folder on the GitLab host. Ideal... [09:57:02] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:58:54] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [09:59:58] (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: unset more headers [puppet] - 10https://gerrit.wikimedia.org/r/1188367 (https://phabricator.wikimedia.org/T403416) (owner: 10Effie Mouzeli) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1000) [10:00:04] claime: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:01:01] jynus: volans: heads up, I'm going to start deploying a change to multi-dc.lua on cp nodes https://gerrit.wikimedia.org/r/c/1182815/ [10:01:19] cc fabfur ^ [10:01:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T402925)', diff saved to https://phabricator.wikimedia.org/P83355 and previous config saved to /var/cache/conftool/dbconfig/20250916-100152-ladsgroup.json [10:01:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:01:58] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:02:16] (03PS1) 10Brouberol: deployment_server: allow different namespaces to be deployed within a same cluster group [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) [10:02:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:03:09] (03PS2) 10Brouberol: deployment_server: allow different namespaces to be deployed within a group [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) [10:03:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:04:16] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [10:04:17] (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: add temporary redirect [puppet] - 10https://gerrit.wikimedia.org/r/1188380 (https://phabricator.wikimedia.org/T404251) (owner: 10Effie Mouzeli) [10:06:13] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [10:06:40] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11184198 (10Joe) >>! In T400119#11115049, @TheDJ wrote: > Yeah getting the swagger spec via `curl https://api.wikimedia.org/core/v1/wikipedia/en/search/pag... [10:08:30] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS bookworm [10:09:55] (03PS1) 10Effie Mouzeli: P:hcaptcha: typo (oops) [puppet] - 10https://gerrit.wikimedia.org/r/1188737 [10:10:32] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - proxoid_4260: Servers urldownloader1004.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:10:32] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - proxoid_4260: Servers urldownloader1004.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:10:42] !log tests looks good, enable puppet on A:cp (T401383) [10:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:46] T401383: Reduce noise from duplicate sequence-gap alerts on HaProxy-webrequests - https://phabricator.wikimedia.org/T401383 [10:11:17] (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: typo (oops) [puppet] - 10https://gerrit.wikimedia.org/r/1188737 (owner: 10Effie Mouzeli) [10:11:44] ^^ that is me [10:11:51] it is ok [10:12:25] FIRING: SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:32] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:14:00] FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:14:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T402925)', diff saved to https://phabricator.wikimedia.org/P83356 and previous config saved to /var/cache/conftool/dbconfig/20250916-101420-ladsgroup.json [10:14:25] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:15:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187033 (https://phabricator.wikimedia.org/T404218) (owner: 10Anzx) [10:15:04] (03PS3) 10Kosta Harlan: hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251) [10:15:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:17:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P83357 and previous config saved to /var/cache/conftool/dbconfig/20250916-101700-ladsgroup.json [10:17:04] fabfur: ah, you're deploying things on cp nodes? [10:17:16] Should I wait a little for https://gerrit.wikimedia.org/r/c/1182815/ ? [10:17:25] RESOLVED: SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:17:46] (03CR) 10Muehlenhoff: "That seems fine, only the the ipblocks/abuse hierarchy is sourced by the ferm requestctl support and those rules are mostly made in reacti" [puppet] - 10https://gerrit.wikimedia.org/r/1188300 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm) [10:18:00] claime: I reenabled puppet on A:cp, no problem on my side to proceed with other changes [10:18:17] but thanks for noticing! [10:18:44] fabfur: ack, but since I'll have to re-disable puppet on A:cp, I may still need to wait [10:18:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:18:53] otherwise your change may not deploy in isolation [10:19:32] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1012.eqiad.wmnet with OS trixie [10:23:49] it's ok for me [10:24:02] so, the ongoing recovery of restarted is claime and the recovery of urldownloader was effie, right? [10:24:11] No, I've touched nothing yet [10:24:13] (03PS1) 10Federico Ceratto: preseed.yaml: Remove es2050 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1188738 (https://phabricator.wikimedia.org/T402859) [10:24:14] ah [10:25:28] (03CR) 10Alexandros Kosiaris: [C:03+2] deploy: Set HELM_DIFF_OUTPUT_CONTEXT=5 in kube_env.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188731 (owner: 10Alexandros Kosiaris) [10:25:33] anything urldownloader is effie rn though :) [10:27:38] !log sudo cumin 'A:cp' "disable-puppet 'Deploying multi-dc.lua changes - T402412 - ${USER}'" [10:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:43] T402412: Route test2wiki rest.php APIs through rest-gateway - https://phabricator.wikimedia.org/T402412 [10:28:11] riposoqualita@gmail.com [10:28:16] ops, bad paste [10:28:25] 👀 [10:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:29:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P83358 and previous config saved to /var/cache/conftool/dbconfig/20250916-102928-ladsgroup.json [10:29:37] (03CR) 10Clément Goubert: [C:03+2] multi-dc: Dynamic rewrite to -ro destinations [puppet] - 10https://gerrit.wikimedia.org/r/1182815 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert) [10:29:58] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11184292 (10elukey) 05Resolved→03Open >>! In T394357#11162710, @MatthewVernon wrote: > Hi @Jhancock.wm / @elukey . I've found 2 show-stoppers thus far (the second of which has... [10:30:27] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11184299 (10elukey) The host doesn't PXE/HTTP boot for some reason, I reopened the provision task in T394357#11184292. [10:31:01] !log Enabling puppet for testing on cp6011 and cp2041 - T402412 - T400131 [10:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:09] T400131: Improved API rerouting strategy for REST gateway - https://phabricator.wikimedia.org/T400131 [10:31:44] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [10:32:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P83359 and previous config saved to /var/cache/conftool/dbconfig/20250916-103208-ladsgroup.json [10:33:23] jynus: yes we are alright [10:33:39] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#11184312 (10jcrespo) If that unblocks you, I am ok with that- sadly because other priorities keep entering data persistence with un... [10:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:37:03] elukey@cumin1003 reimage (PID 2807097) is awaiting input [10:42:29] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS bookworm [10:43:13] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:44:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P83360 and previous config saved to /var/cache/conftool/dbconfig/20250916-104436-ladsgroup.json [10:45:33] (03CR) 10Ladsgroup: [C:03+1] es2049.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1185879 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:46:09] My tests look good but something looks wrong-ish with the rest-gateway [10:46:26] it's serving 30 5xx per second since a bit past 0955 [10:47:09] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:47:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T402925)', diff saved to https://phabricator.wikimedia.org/P83361 and previous config saved to /var/cache/conftool/dbconfig/20250916-104715-ladsgroup.json [10:47:20] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:49:15] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:50:02] Ugh proton again [10:50:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11184382 (10cmooney) Hey @papaul yeah Thursday will be fine thanks. [10:52:16] I'm moving forward despite this, I'll diagnose it in parallel, it's unrelated to the change [10:52:31] !log sudo cumin 'A:cp' "enable-puppet 'Deploying multi-dc.lua changes - T402412 - ${USER}'" [10:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:36] T402412: Route test2wiki rest.php APIs through rest-gateway - https://phabricator.wikimedia.org/T402412 [10:53:09] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:56:48] (03CR) 10Federico Ceratto: [C:03+2] es2049.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1185879 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:57:21] elukey@cumin1003 interactive (PID 2810037) is awaiting input [10:57:43] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for mszwarc - https://phabricator.wikimedia.org/T404697 (10mszwarc) 03NEW [10:59:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T402925)', diff saved to https://phabricator.wikimedia.org/P83363 and previous config saved to /var/cache/conftool/dbconfig/20250916-105944-ladsgroup.json [10:59:49] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [11:00:36] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for mszwarc - https://phabricator.wikimedia.org/T404697#11184443 (10OKryva-WMF) As Marcin's Engineering Manager, approve. [11:18:19] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update for routed Ganeti - jmm@cumin2002" [11:18:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update for routed Ganeti - jmm@cumin2002" [11:19:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:21:26] FIRING: SystemdUnitFailed: logrotate.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:23:04] (03PS4) 10Stevemunene: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) [11:25:05] (03CR) 10CI reject: [V:04-1] dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [11:27:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:31:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:31:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:33:22] (03CR) 10Btullis: [C:03+1] "Looks good to me, but I would like to make sure that others are also able to review, for visibility." [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol) [11:39:23] (03PS1) 10Kosta Harlan: P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) [11:39:31] (03PS1) 10Clément Goubert: Revert^2 "trafficserver: test2wiki rest API to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1188751 [11:40:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3004.wikimedia.org [11:41:42] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 8.218 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:41:42] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.385 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:41:47] (03PS2) 10Kosta Harlan: P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) [11:42:09] (03CR) 10Hnowlan: [C:03+1] Revert^2 "trafficserver: test2wiki rest API to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1188751 (owner: 10Clément Goubert) [11:43:08] (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [11:44:19] (03CR) 10CI reject: [V:04-1] P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [11:45:48] (03PS3) 10Kosta Harlan: P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) [11:47:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3004.wikimedia.org [11:48:08] (03CR) 10CI reject: [V:04-1] P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [11:48:59] RESOLVED: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:33] (03PS4) 10Kosta Harlan: P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) [11:54:51] (03CR) 10Clément Goubert: [C:03+2] Revert^2 "trafficserver: test2wiki rest API to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1188751 (owner: 10Clément Goubert) [11:54:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11184626 (10phaultfinder) [11:58:51] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11184633 (10elukey) There is something definitely off, I just tested the following and everything hangs: ` -> reset /system1/pwrmgtsvc1 /system1/pwrmgtsvc1 ` I am trying to set... [11:59:40] (03PS1) 10Hnowlan: (api|rest)-gateway: set Server header if supplied by service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188758 (https://phabricator.wikimedia.org/T401396) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1200) [12:00:47] (03Abandoned) 10Kosta Harlan: hCaptcha: Special handling for hcaptcha-secure-api.js requests [puppet] - 10https://gerrit.wikimedia.org/r/1187439 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [12:03:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:05:46] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11184653 (10MoritzMuehlenhoff) There was a small issue with install3004, it lacked the global ipv6 address, which caused failing ipv6 probes to Squid. The rele... [12:08:40] (03PS1) 10Esanders: Enable Flow in read-only mode on wikis using LiquidThreads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188759 (https://phabricator.wikimedia.org/T404687) [12:08:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s7 in eqiad (T403966)', diff saved to https://phabricator.wikimedia.org/P83364 and previous config saved to /var/cache/conftool/dbconfig/20250916-120842-ladsgroup.json [12:08:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:08:48] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [12:09:51] (03CR) 10CI reject: [V:04-1] Enable Flow in read-only mode on wikis using LiquidThreads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188759 (https://phabricator.wikimedia.org/T404687) (owner: 10Esanders) [12:15:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1194 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83365 and previous config saved to /var/cache/conftool/dbconfig/20250916-121545-ladsgroup.json [12:15:50] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [12:18:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11184704 (10cmooney) >>! In T404609#11181649, @RobH wrote: > @cmooney: What do you think is the best way to go about migrating these connections on upcoming C... [12:18:20] !log depooling cp2041 - T402412 [12:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:25] T402412: Route test2wiki rest.php APIs through rest-gateway - https://phabricator.wikimedia.org/T402412 [12:19:58] (03PS2) 10Hnowlan: (api|rest)-gateway: set Server header if supplied by service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188758 (https://phabricator.wikimedia.org/T401396) [12:22:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11184711 (10cmooney) @RobH @Jclark-ctr there is also another way we could try to approach this so may as well mention it now before we start planning. Rack-b... [12:37:06] (03PS1) 10Huei Tan: xLab: Fix instrument to produce valid events [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) [12:37:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [12:40:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11184793 (10Jclark-ctr) @cmooney I’m flexible to try either way. Maybe a mix could work? We could start with roles that aren’t single points of failure and ar... [12:46:38] (03PS1) 10Urbanecm: feat: Allow communities to opt out experienced users from mentorship [extensions/GrowthExperiments] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188767 (https://phabricator.wikimedia.org/T403563) [12:46:46] jouncebot: nowandnext [12:46:46] For the next 0 hour(s) and 13 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1200) [12:46:46] In 0 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1300) [12:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:49:14] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [12:49:38] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [12:50:58] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [12:53:22] (03PS2) 10Federico Ceratto: preseed.yaml: Remove es2050 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1188738 (https://phabricator.wikimedia.org/T402859) [12:53:22] (03PS1) 10Federico Ceratto: instances.yaml: add es2049 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1188769 (https://phabricator.wikimedia.org/T402859) [12:56:11] (03CR) 10Effie Mouzeli: [C:03+2] P:hcaptcha: Adjust regex match [puppet] - 10https://gerrit.wikimedia.org/r/1188750 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [12:57:52] (03PS1) 10Slyngshede: P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) [12:58:37] o/ i need someone help with my patch deployment. [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1300). [13:00:05] joelyrookewmde, tgr, anzx, and hueitan: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] o/ i need someone help with my patch deployment. [13:00:19] (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [13:00:50] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1188374 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [13:01:00] I can deploy [13:01:27] o/ [13:01:29] tgr tq [13:01:33] o/ [13:01:42] I’m in a meeting, thanks tgr for deploying :) [13:02:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1253 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83366 and previous config saved to /var/cache/conftool/dbconfig/20250916-130201-ladsgroup.json [13:02:06] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [13:03:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187033 (https://phabricator.wikimedia.org/T404218) (owner: 10Anzx) [13:03:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184480 (https://phabricator.wikimedia.org/T395674) (owner: 10Joely Rooke WMDE) [13:05:01] (03Merged) 10jenkins-bot: Lift IP cap for workshop at University of Pretoria on 29-30 September [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187033 (https://phabricator.wikimedia.org/T404218) (owner: 10Anzx) [13:05:04] (03Merged) 10jenkins-bot: Remove feature flag to resolve changelist wikibase link labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184480 (https://phabricator.wikimedia.org/T395674) (owner: 10Joely Rooke WMDE) [13:05:21] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1187033|Lift IP cap for workshop at University of Pretoria on 29-30 September (T404218)]], [[gerrit:1184480|Remove feature flag to resolve changelist wikibase link labels (T395674)]] [13:05:26] T404218: Request for IP exemption for event with University of Pretoria on 2025-09-29 - https://phabricator.wikimedia.org/T404218 [13:05:27] T395674: Post-acceptance cleanup for adding labels to Wikidata recent changes - https://phabricator.wikimedia.org/T395674 [13:06:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1191 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83367 and previous config saved to /var/cache/conftool/dbconfig/20250916-130618-ladsgroup.json [13:09:35] 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11184909 (10AMJohnson) 05Open→03Resolved a:03AMJohnson @DSeyfert_WMF was able to fix this for us. Thank you, Dustin! Going ahead and closing out this... [13:09:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1202 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83368 and previous config saved to /var/cache/conftool/dbconfig/20250916-130935-ladsgroup.json [13:09:41] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [13:10:58] !log tgr@deploy1003 tgr, joelyrookewmde, anzx: Backport for [[gerrit:1187033|Lift IP cap for workshop at University of Pretoria on 29-30 September (T404218)]], [[gerrit:1184480|Remove feature flag to resolve changelist wikibase link labels (T395674)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:11:04] T404218: Request for IP exemption for event with University of Pretoria on 2025-09-29 - https://phabricator.wikimedia.org/T404218 [13:11:04] T395674: Post-acceptance cleanup for adding labels to Wikidata recent changes - https://phabricator.wikimedia.org/T395674 [13:11:25] tgr: nothing to test on throttle [13:11:52] joelyrookewmde: I assume you don't need to test either? [13:11:53] @tgr sorry I missed the start of this deployment. Thanks for approving it ! [13:12:02] no all good for me [13:12:10] !log tgr@deploy1003 tgr, joelyrookewmde, anzx: Continuing with sync [13:13:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s7 in codfw (T403966)', diff saved to https://phabricator.wikimedia.org/P83369 and previous config saved to /var/cache/conftool/dbconfig/20250916-131345-ladsgroup.json [13:14:25] hueitan: do you feel confident about your patch? if it's low-risk, I'll bundle it with the other backports [13:14:39] yes, confident [13:15:21] (03CR) 10Stevemunene: [C:03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol) [13:17:35] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1187033|Lift IP cap for workshop at University of Pretoria on 29-30 September (T404218)]], [[gerrit:1184480|Remove feature flag to resolve changelist wikibase link labels (T395674)]] (duration: 12m 14s) [13:17:41] T404218: Request for IP exemption for event with University of Pretoria on 2025-09-29 - https://phabricator.wikimedia.org/T404218 [13:17:42] T395674: Post-acceptance cleanup for adding labels to Wikidata recent changes - https://phabricator.wikimedia.org/T395674 [13:17:44] (03CR) 10Phuedx: [C:03+1] xLab: Fix instrument to produce valid events [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [13:19:45] (03CR) 10Phuedx: [C:03+1] xLab: Fix instrument to produce valid events (031 comment) [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [13:20:15] tgr: Just reviewed it. It's low risk and will reduce event validation errors back to the baseline rate [13:20:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188715 (https://phabricator.wikimedia.org/T401748) (owner: 10Gergő Tisza) [13:20:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188716 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [13:20:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188717 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [13:20:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [13:20:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [13:22:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one typo inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [13:22:25] (03PS1) 10Clément Goubert: Revert^3 "trafficserver: test2wiki rest API to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1188774 [13:22:37] (03CR) 10Clément Goubert: [V:03+2 C:03+2] Revert^3 "trafficserver: test2wiki rest API to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1188774 (owner: 10Clément Goubert) [13:23:25] (03CR) 10Filippo Giunchedi: [C:03+2] profile: clean up root-authorized-key.erb transition [puppet] - 10https://gerrit.wikimedia.org/r/1188374 (https://phabricator.wikimedia.org/T317362) (owner: 10Filippo Giunchedi) [13:23:29] jouncebot: nowandnext [13:23:29] For the next 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1300) [13:23:29] In 0 hour(s) and 36 minute(s): Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1400) [13:23:47] (03PS2) 10Slyngshede: P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) [13:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:25:20] (03PS1) 10Andrew Bogott: codfw1dev: bump horizon build version [puppet] - 10https://gerrit.wikimedia.org/r/1188776 [13:25:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:25:52] (03Merged) 10jenkins-bot: User: Simplify makeUpdateConditions() [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188715 (https://phabricator.wikimedia.org/T401748) (owner: 10Gergő Tisza) [13:25:56] (03Merged) 10jenkins-bot: session: Add a mechanism for forcing a refresh [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188716 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [13:26:02] (03Merged) 10jenkins-bot: Use short expiry for JWT cookies [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188717 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [13:26:04] (03Merged) 10jenkins-bot: tests: Update for SessionCookieJwtExpiration added in core [extensions/CentralAuth] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188718 (https://phabricator.wikimedia.org/T399200) (owner: 10Gergő Tisza) [13:26:23] (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [13:27:05] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev: bump horizon build version [puppet] - 10https://gerrit.wikimedia.org/r/1188776 (owner: 10Andrew Bogott) [13:29:22] (03Merged) 10jenkins-bot: xLab: Fix instrument to produce valid events [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188765 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [13:29:43] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1188715|User: Simplify makeUpdateConditions() (T401748)]], [[gerrit:1188716|session: Add a mechanism for forcing a refresh (T399200)]], [[gerrit:1188717|Use short expiry for JWT cookies (T399200)]], [[gerrit:1188718|tests: Update for SessionCookieJwtExpiration added in core (T399200 T404667)]], [[gerrit:1188765|xLab: Fix instrument to produce valid events [13:29:43] (T404420)]] [13:29:51] T401748: Unexpected Phan SecurityCheck failure in UpdateQueryBuilder::execute - https://phabricator.wikimedia.org/T401748 [13:29:52] T399200: Update existing cookie-based sessions to include JWT cookie - https://phabricator.wikimedia.org/T399200 [13:29:52] T404667: CentralAuth tests failing - https://phabricator.wikimedia.org/T404667 [13:29:53] T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420 [13:30:15] (03PS4) 10Kosta Harlan: hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251) [13:30:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:31:51] (03PS2) 10Majavah: hieradata: Drop old eqiad1 bastions [puppet] - 10https://gerrit.wikimedia.org/r/1187804 (https://phabricator.wikimedia.org/T392689) [13:31:51] (03PS1) 10Majavah: hieradata: openstack: Update Toolforge bastion example [puppet] - 10https://gerrit.wikimedia.org/r/1188778 (https://phabricator.wikimedia.org/T392510) [13:32:51] (03CR) 10Majavah: [C:03+2] hieradata: Drop old eqiad1 bastions [puppet] - 10https://gerrit.wikimedia.org/r/1187804 (https://phabricator.wikimedia.org/T392689) (owner: 10Majavah) [13:33:02] (03CR) 10Majavah: [C:03+2] hieradata: openstack: Update Toolforge bastion example [puppet] - 10https://gerrit.wikimedia.org/r/1188778 (https://phabricator.wikimedia.org/T392510) (owner: 10Majavah) [13:33:21] (03PS3) 10Slyngshede: P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) [13:33:59] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:35:15] !log repooling cp2041, test inconclusive, rolled back - T402412 [13:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:21] T402412: Route test2wiki rest.php APIs through rest-gateway - https://phabricator.wikimedia.org/T402412 [13:35:40] !log tgr@deploy1003 hueitan, tgr: Backport for [[gerrit:1188715|User: Simplify makeUpdateConditions() (T401748)]], [[gerrit:1188716|session: Add a mechanism for forcing a refresh (T399200)]], [[gerrit:1188717|Use short expiry for JWT cookies (T399200)]], [[gerrit:1188718|tests: Update for SessionCookieJwtExpiration added in core (T399200 T404667)]], [[gerrit:1188765|xLab: Fix instrument to produce valid events (T404420)]] [13:35:40] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:35:48] T401748: Unexpected Phan SecurityCheck failure in UpdateQueryBuilder::execute - https://phabricator.wikimedia.org/T401748 [13:35:49] T399200: Update existing cookie-based sessions to include JWT cookie - https://phabricator.wikimedia.org/T399200 [13:35:49] T404667: CentralAuth tests failing - https://phabricator.wikimedia.org/T404667 [13:35:50] T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420 [13:36:51] (03CR) 10CI reject: [V:04-1] P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [13:38:35] (y) [13:40:54] (03PS4) 10Slyngshede: P:puppetserver::volatile xcheese [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) [13:41:58] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1017.eqiad.wmnet with OS bookworm [13:43:23] !log tgr@deploy1003 hueitan, tgr: Continuing with sync [13:43:28] (03PS1) 10Andrew Bogott: Prepare cloudcephosd1017 for bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1188782 (https://phabricator.wikimedia.org/T404249) [13:44:38] (03PS3) 10Majavah: P:toolforge: remove support for grid bastions [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665) [13:44:39] (03PS1) 10Majavah: P:toolforge::checker: Remove grid base profile [puppet] - 10https://gerrit.wikimedia.org/r/1188783 (https://phabricator.wikimedia.org/T314664) [13:44:40] (03PS1) 10Majavah: P:toolforge: Cleanup buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188784 [13:45:03] (03PS1) 10Andrew Bogott: Prepare cloudcephosd105* for bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1188785 (https://phabricator.wikimedia.org/T404249) [13:45:12] (03CR) 10Andrew Bogott: [C:03+2] Prepare cloudcephosd1017 for bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1188782 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott) [13:48:27] (03PS4) 10Superpes15: Throttle exemption for Editathon by Wikimedistas en Cruce - 26 September 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188493 (https://phabricator.wikimedia.org/T404592) [13:48:33] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188715|User: Simplify makeUpdateConditions() (T401748)]], [[gerrit:1188716|session: Add a mechanism for forcing a refresh (T399200)]], [[gerrit:1188717|Use short expiry for JWT cookies (T399200)]], [[gerrit:1188718|tests: Update for SessionCookieJwtExpiration added in core (T399200 T404667)]], [[gerrit:1188765|xLab: Fix instrument to produce valid events [13:48:33] (T404420)]] (duration: 18m 50s) [13:48:41] T401748: Unexpected Phan SecurityCheck failure in UpdateQueryBuilder::execute - https://phabricator.wikimedia.org/T401748 [13:48:42] T399200: Update existing cookie-based sessions to include JWT cookie - https://phabricator.wikimedia.org/T399200 [13:48:42] T404667: CentralAuth tests failing - https://phabricator.wikimedia.org/T404667 [13:48:43] (03CR) 10Bking: [C:03+2] admin_ng: allow opensearch deploy to use role/rolebinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188446 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:48:43] T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420 [13:50:44] (03PS18) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [13:50:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [13:51:11] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:51:29] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:51:37] !log bking@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:51:58] (03Merged) 10jenkins-bot: Enable JWT session cookies on testwiki and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1186593 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [13:52:11] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1186593|Enable JWT session cookies on testwiki and beta (T399631)]] [13:52:15] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [13:52:39] (03PS4) 10Majavah: P:toolforge: Remove support for grid bastions [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665) [13:52:39] (03PS2) 10Majavah: P:toolforge: Cleanup buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188784 [13:53:29] !log bking@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:53:59] FIRING: [17x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:54:08] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6955/co" [puppet] - 10https://gerrit.wikimedia.org/r/1188770 (https://phabricator.wikimedia.org/T404688) (owner: 10Slyngshede) [13:54:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Bump vslow replicas of s4 in eqiad to 300 (T403966)', diff saved to https://phabricator.wikimedia.org/P83370 and previous config saved to /var/cache/conftool/dbconfig/20250916-135433-ladsgroup.json [13:54:38] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [13:55:25] (03CR) 10Andrew Bogott: [C:03+2] Prepare cloudcephosd105* for bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1188785 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott) [13:55:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1160 (candidate master of s4) from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83371 and previous config saved to /var/cache/conftool/dbconfig/20250916-135542-ladsgroup.json [13:57:40] !log tgr@deploy1003 tgr: Backport for [[gerrit:1186593|Enable JWT session cookies on testwiki and beta (T399631)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:57:45] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [13:58:09] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1188784 (owner: 10Majavah) [13:58:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11185177 (10Jhancock.wm) @elukey 2049 was powered off. once i powered it on the nic came up. I'll not set the root for 2053-8 [13:58:59] FIRING: [19x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:00:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1199 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83372 and previous config saved to /var/cache/conftool/dbconfig/20250916-140020-ladsgroup.json [14:00:26] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [14:01:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1247 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83373 and previous config saved to /var/cache/conftool/dbconfig/20250916-140147-ladsgroup.json [14:01:59] (03PS1) 10Majavah: kubeadm: Explicitely install kubelet [puppet] - 10https://gerrit.wikimedia.org/r/1188788 [14:02:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Fix db1242 weight in s4 (T403966)', diff saved to https://phabricator.wikimedia.org/P83374 and previous config saved to /var/cache/conftool/dbconfig/20250916-140237-ladsgroup.json [14:02:46] (03CR) 10JMeybohm: [C:03+1] "LGTM, but please keep in mind that files/certs already created on the deployment servers will not be cleaned up. You might want to do so m" [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol) [14:03:08] (03CR) 10David Caro: [C:03+1] kubeadm: Explicitely install kubelet [puppet] - 10https://gerrit.wikimedia.org/r/1188788 (owner: 10Majavah) [14:03:13] (03CR) 10Majavah: [C:03+2] kubeadm: Explicitely install kubelet [puppet] - 10https://gerrit.wikimedia.org/r/1188788 (owner: 10Majavah) [14:03:36] (03CR) 10Brouberol: [C:03+2] "Good call @jmeybohm@wikimedia.org thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1188734 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol) [14:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:03:49] !log tgr@deploy1003 tgr: Continuing with sync [14:03:59] FIRING: [23x] CertAlmostExpired: Certificate for service asw1-b12-drmrs.mgmt.drmrs.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:06:11] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11185242 (10Pcoombe) For fundraising banners we use the country from `mw.centralNotice.data.country` (which allows us to... [14:06:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s4 in eqiad (T403966)', diff saved to https://phabricator.wikimedia.org/P83375 and previous config saved to /var/cache/conftool/dbconfig/20250916-140638-ladsgroup.json [14:06:44] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [14:09:13] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1017.eqiad.wmnet with reason: host reimage [14:09:15] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1186593|Enable JWT session cookies on testwiki and beta (T399631)]] (duration: 17m 04s) [14:09:20] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [14:09:39] (03PS7) 10Scott French: hieradata: migrate parsoidtest1001 to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1184119 (https://phabricator.wikimedia.org/T403772) [14:10:03] !log UTC afternoon deploys done [14:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:36] 06SRE, 06FR-donorrelations, 06Infrastructure-Foundations, 10Mail: Donations@ doesn't forward to donate@ - https://phabricator.wikimedia.org/T403986#11185266 (10Aklapper) a:05AMJohnson→03DSeyfert_WMF [14:11:20] (03CR) 10Clément Goubert: [C:03+1] hieradata: migrate parsoidtest1001 to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1184119 (https://phabricator.wikimedia.org/T403772) (owner: 10Scott French) [14:11:39] (03PS2) 10Stevemunene: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) [14:13:10] FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:13:58] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1017.eqiad.wmnet with reason: host reimage [14:16:03] (03CR) 10Scott French: [C:03+2] hieradata: migrate parsoidtest1001 to 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1184119 (https://phabricator.wikimedia.org/T403772) (owner: 10Scott French) [14:18:10] RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:18:59] FIRING: [27x] CertAlmostExpired: Certificate for service asw1-b12-drmrs.mgmt.drmrs.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:19:13] (03CR) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [14:19:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11185300 (10phaultfinder) [14:23:37] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11185307 (10AKanji-WMF) @XenoRyet and I discussed getting this into our next Sprint as a stretch. [14:27:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173396 (https://phabricator.wikimedia.org/T400048) (owner: 10Cyndywikime) [14:29:09] (03CR) 10Michael Große: [C:03+1] beta(Growth,MetricsPlatform): add notification experiment config and enable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173396 (https://phabricator.wikimedia.org/T400048) (owner: 10Cyndywikime) [14:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:30:33] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1017.eqiad.wmnet with OS bookworm [14:30:44] (03PS3) 10Sergio Gimeno: beta(Growth,MetricsPlatform): add notification experiment config and enable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173396 (https://phabricator.wikimedia.org/T400048) (owner: 10Cyndywikime) [14:32:18] (03PS1) 10Arnaudb: Revert^3 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188796 [14:33:30] (03PS1) 10Giuseppe Lavagetto: Add inline patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1188797 [14:34:11] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add inline patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1188797 (owner: 10Giuseppe Lavagetto) [14:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:37:02] (03PS1) 10Arnaudb: Revert^4 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188798 [14:37:27] (03PS1) 10Sbisson: SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063) [14:38:39] (03CR) 10CI reject: [V:04-1] SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063) (owner: 10Sbisson) [14:38:47] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Add inline pattern support - oblivian@cumin1003" [14:38:48] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Add inline pattern support - oblivian@cumin1003 [14:39:08] (03PS1) 10Muehlenhoff: imposm-initial-import: Fix check whether imposm is running [puppet] - 10https://gerrit.wikimedia.org/r/1188801 (https://phabricator.wikimedia.org/T381565) [14:39:27] (03PS2) 10Muehlenhoff: imposm-initial-import: Fix check whether imposm is running [puppet] - 10https://gerrit.wikimedia.org/r/1188801 (https://phabricator.wikimedia.org/T381565) [14:39:34] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Add inline pattern support - oblivian@cumin1003 [14:39:35] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Add inline pattern support - oblivian@cumin1003" [14:39:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11185397 (10phaultfinder) [14:40:13] !log installing libsndfile security updates [14:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:20] (03CR) 10Samuel (WMF): [C:03+1] hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [14:41:33] (03PS1) 10Andrew Bogott: Update nic IDs for cloudcephosd1017 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1188805 (https://phabricator.wikimedia.org/T404249) [14:42:30] (03PS2) 10Sbisson: SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063) [14:42:38] (03CR) 10Andrew Bogott: [C:03+2] Update nic IDs for cloudcephosd1017 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1188805 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott) [14:43:40] (03CR) 10CI reject: [V:04-1] SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063) (owner: 10Sbisson) [14:46:16] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::checker: Remove grid base profile [puppet] - 10https://gerrit.wikimedia.org/r/1188783 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah) [14:46:29] (03CR) 10Majavah: [C:03+2] P:toolforge::checker: Remove grid base profile [puppet] - 10https://gerrit.wikimedia.org/r/1188783 (https://phabricator.wikimedia.org/T314664) (owner: 10Majavah) [14:48:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:48:48] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11185433 (10elukey) @Jhancock.wm Hi! When you have a moment could you please check if sretest2010 is in a weird state? I am not able to powercycle it.. [14:49:35] !log dancy@deploy1003 Started scap sync-world: Testing for T403882 [14:49:39] T403882: Wikidata N-Triples RDF dumps empty, broken since at least 25 July 2025 - https://phabricator.wikimedia.org/T403882 [14:49:47] jouncebot: nowandnext [14:49:47] For the next 0 hour(s) and 10 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1430) [14:49:47] In 0 hour(s) and 10 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1500) [14:50:41] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:51:47] (03PS5) 10Majavah: P:toolforge: Remove support for grid bastions [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665) [14:51:47] (03PS3) 10Majavah: P:toolforge: Cleanup buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188784 [14:51:47] (03PS1) 10Majavah: P:toolforge: Delete cmd_checklist test suite [puppet] - 10https://gerrit.wikimedia.org/r/1188807 [14:52:39] (03PS3) 10Sbisson: SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063) [14:52:50] (03CR) 10Filippo Giunchedi: [C:03+1] "spot-checked the most common entry paths, LGTM! feels-good-meme.png" [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [14:53:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:54:27] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:55:29] (03PS1) 10Muehlenhoff: Reset maps nodes for a fresh import [puppet] - 10https://gerrit.wikimedia.org/r/1188808 [14:57:21] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bookworm [14:57:25] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1052.eqiad.wmnet with OS bookworm [14:57:28] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bookworm [14:58:49] (03CR) 10Elukey: [C:03+1] Reset maps nodes for a fresh import [puppet] - 10https://gerrit.wikimedia.org/r/1188808 (owner: 10Muehlenhoff) [14:59:13] (03PS2) 10Krinkle: Disable wmgUseMdotRouting on cawiki, hewiki, itwiki (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185120 (https://phabricator.wikimedia.org/T403510) [14:59:17] (03PS3) 10Krinkle: varnish: Enable unified mobile routing on cawiki, hewiki, itwiki (group1) [puppet] - 10https://gerrit.wikimedia.org/r/1185116 (https://phabricator.wikimedia.org/T403510) [15:01:36] !log dancy@deploy1003 Finished scap sync-world: Testing for T403882 (duration: 12m 01s) [15:01:40] T403882: Wikidata N-Triples RDF dumps empty, broken since at least 25 July 2025 - https://phabricator.wikimedia.org/T403882 [15:07:38] (03PS1) 10Andrew Bogott: cloudcephosd: add python3-packaging [puppet] - 10https://gerrit.wikimedia.org/r/1188810 (https://phabricator.wikimedia.org/T404249) [15:08:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:09:00] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:24] (03PS2) 10Andrew Bogott: cloudcephosd: add python3-packaging [puppet] - 10https://gerrit.wikimedia.org/r/1188810 (https://phabricator.wikimedia.org/T404249) [15:09:28] (03CR) 10Dzahn: [C:03+2] phabricator: remove defunct ElasticSearch backend settings [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper) [15:09:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11185524 (10phaultfinder) [15:10:14] andre: no phorge deploy? [15:10:33] (03PS1) 10Brouberol: kubernetes: add service secrets for dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188811 [15:10:41] (03CR) 10Clément Goubert: [C:03+1] (api|rest)-gateway: set Server header if supplied by service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188758 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan) [15:10:49] (03CR) 10Brouberol: [C:03+2] kubernetes: add service secrets for dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188811 (owner: 10Brouberol) [15:10:51] (03CR) 10Brouberol: [V:03+2 C:03+2] kubernetes: add service secrets for dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188811 (owner: 10Brouberol) [15:10:57] mutante: not this week, need to test the upstream pull more [15:11:55] andre: ok, ACK! we are doing the puppet patch that removes elasticsearch config [15:12:06] had it planned for the window.. remember [15:12:21] or that was the suggestion.. so getting it out now [15:12:46] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd: add python3-packaging [puppet] - 10https://gerrit.wikimedia.org/r/1188810 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott) [15:13:23] mutante, argh, true. Sorry, I forgot that one [15:13:33] (03CR) 10Urbanecm: [C:03+2] feat: Allow communities to opt out experienced users from mentorship [extensions/GrowthExperiments] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188767 (https://phabricator.wikimedia.org/T403563) (owner: 10Urbanecm) [15:14:08] andre: arrr.. I realized this file is under hieradata/role/eqiad .. that is kind of bad [15:14:41] mutante, feel free not to deploy and rethink the problem :) [15:15:23] andre: well.. 2 options here.. either stuff is duplicated for each DC or it needs a second patch for codfw [15:15:32] ok [15:17:46] (03CR) 10Dzahn: [C:03+2] "this only affects eqiad but the same thing also exists in codfw - needs another patch" [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper) [15:18:40] (03PS1) 10Dzahn: phabricator: drop elasticsearch settings in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1188815 (https://phabricator.wikimedia.org/T403948) [15:19:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:19:55] (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1188815" [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper) [15:20:07] (03CR) 10Dzahn: [C:03+2] phabricator: drop elasticsearch settings in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1188815 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn) [15:20:40] (03PS1) 10Brouberol: kubernetes: add service secrets for airflow-dev/dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188816 [15:20:56] (03CR) 10Brouberol: [C:03+2] kubernetes: add service secrets for airflow-dev/dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188816 (owner: 10Brouberol) [15:21:02] (03CR) 10Brouberol: [V:03+2 C:03+2] kubernetes: add service secrets for airflow-dev/dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188816 (owner: 10Brouberol) [15:21:41] FIRING: SystemdUnitFailed: logrotate.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:23:18] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:23:51] (03Merged) 10jenkins-bot: feat: Allow communities to opt out experienced users from mentorship [extensions/GrowthExperiments] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188767 (https://phabricator.wikimedia.org/T403563) (owner: 10Urbanecm) [15:24:09] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr1-codfw [15:24:24] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-codfw [15:24:56] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr2-codfw [15:25:10] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-codfw [15:25:23] jouncebot: nowandnext [15:25:23] For the next 0 hour(s) and 34 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1500) [15:25:23] In 0 hour(s) and 34 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1600) [15:26:20] 10SRE-SLO, 10Charts, 06Reader Growth Team: Finalize Charts SLO - https://phabricator.wikimedia.org/T399613#11185583 (10elukey) Thanks a lot for the patience folks, we have stopped onboarding new SLOs in Pyrra temporarily while we figure out T403729. We are comparing the results with another tool in T404171,... [15:26:29] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1188767|feat: Allow communities to opt out experienced users from mentorship (T403563)]] [15:26:32] Anyone mind if I deploy a security patch in this window? [15:26:34] T403563: Do not automatically enroll experienced editors into Mentorship when they visit the Homepage - https://phabricator.wikimedia.org/T403563 [15:26:43] Oh it seems that someone started scap as I said that :D [15:26:44] Dreamy_Jazz: i am currently deploying sth, but no concerns once i'm done [15:26:49] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-by27-esams [15:26:57] sorry! CI just finished, so it started. [15:27:01] Np [15:27:02] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-by27-esams [15:27:08] (03CR) 10Dzahn: [C:03+2] "Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Function lookup() did not find a value for the na" [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper) [15:27:13] (03CR) 10Dzahn: [C:03+2] "Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Function lookup() did not find a value for the na" [puppet] - 10https://gerrit.wikimedia.org/r/1188815 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn) [15:28:01] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-b13-drmrs [15:28:15] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b13-drmrs [15:28:33] (03CR) 10Majavah: [C:03+2] P:toolforge: Remove support for grid bastions [puppet] - 10https://gerrit.wikimedia.org/r/1012752 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [15:28:59] FIRING: [27x] CertAlmostExpired: Certificate for service asw1-b12-drmrs.mgmt.drmrs.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:29:08] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-b12-drmrs [15:29:22] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b12-drmrs [15:29:36] andre: 3 different problems :) [15:29:45] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr2-esams [15:29:58] mutante, sorry, I did not see that can of worms coming and thought it's gonna be trivial :( [15:30:08] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-esams [15:30:19] if you want to turn that into a phab task feel free to I guess [15:30:26] (03CR) 10Majavah: [C:03+2] P:toolforge: Cleanup buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188784 (owner: 10Majavah) [15:30:46] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-bw27-esams [15:30:49] andre: no blame! just sharing. I will leave comments [15:30:59] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-bw27-esams [15:31:07] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr2-drmrs [15:31:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-drmrs [15:31:30] sorry for the spam with these cookbook runs for certs [15:32:28] (03PS1) 10Dzahn: Revert "phabricator: remove defunct ElasticSearch backend settings" [puppet] - 10https://gerrit.wikimedia.org/r/1188820 [15:32:51] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr1-esams [15:33:09] (03PS1) 10Dzahn: Revert "phabricator: remove defunct ElasticSearch backend settings" [puppet] - 10https://gerrit.wikimedia.org/r/1188821 [15:33:16] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-esams [15:33:18] jouncebot: nowandnext [15:33:19] For the next 0 hour(s) and 26 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1500) [15:33:19] In 0 hour(s) and 26 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1600) [15:33:27] I'd like to deploy a MediaWiki patch [15:33:40] (03Abandoned) 10Dzahn: Revert "phabricator: remove defunct ElasticSearch backend settings" [puppet] - 10https://gerrit.wikimedia.org/r/1188821 (owner: 10Dzahn) [15:33:59] FIRING: [27x] CertAlmostExpired: Certificate for service asw1-b12-drmrs.mgmt.drmrs.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:34:00] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:08] I'm in the queue to deploy a security patch [15:34:09] (03PS1) 10Dzahn: Revert "phabricator: drop elasticsearch settings in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1188822 [15:34:13] (03PS1) 10Kosta Harlan: hCaptcha: Enable version pinning and subresource integrity [extensions/ConfirmEdit] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188823 (https://phabricator.wikimedia.org/T404251) [15:34:19] There is already a scap backport happening [15:34:23] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr1-drmrs [15:34:29] (03PS1) 10Kosta Harlan: hCaptcha: Enable version pinning and subresource integrity [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188824 (https://phabricator.wikimedia.org/T404251) [15:34:41] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-drmrs [15:34:46] (03PS1) 10Scott French: shellbox: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188817 (https://phabricator.wikimedia.org/T403284) [15:34:47] (03PS1) 10Scott French: shellbox-timeline: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188818 (https://phabricator.wikimedia.org/T403284) [15:34:48] (03PS1) 10Scott French: shellbox-video: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188819 (https://phabricator.wikimedia.org/T403284) [15:34:56] !log jhancock@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host mc-misc2001 [15:34:59] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr3-eqsin [15:35:06] !log jhancock@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-misc2001 [15:35:24] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:35:33] Dreamy_Jazz: ack, please ping me when you're done [15:35:41] Sure [15:35:43] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr3-eqsin [15:35:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:36:01] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr4-ulsfo [15:36:02] (03CR) 10Dzahn: [C:03+2] "a couple other things are needed here:" [puppet] - 10https://gerrit.wikimedia.org/r/1185884 (https://phabricator.wikimedia.org/T403948) (owner: 10Aklapper) [15:36:10] urbanecm: Mind pinging me when you are done? [15:36:13] sure [15:36:18] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr4-ulsfo [15:36:32] (03CR) 10Dzahn: [C:03+2] Revert "phabricator: remove defunct ElasticSearch backend settings" [puppet] - 10https://gerrit.wikimedia.org/r/1188820 (owner: 10Dzahn) [15:36:45] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr3-ulsfo [15:36:59] 10SRE-SLO, 10Charts, 06Reader Growth Team: Finalize Charts SLO - https://phabricator.wikimedia.org/T399613#11185664 (10CDanis) Luca, do you want an early test subject for the Sloth trial? [15:37:07] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr3-ulsfo [15:37:13] (03CR) 10Dzahn: [C:03+2] Revert "phabricator: drop elasticsearch settings in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1188822 (owner: 10Dzahn) [15:37:45] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f3-eqiad [15:37:51] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f3-eqiad [15:38:07] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f2-eqiad [15:38:13] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f2-eqiad [15:38:20] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e3-eqiad [15:38:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e3-eqiad [15:38:31] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e2-eqiad [15:38:37] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e2-eqiad [15:38:45] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e1-eqiad [15:38:50] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e1-eqiad [15:39:00] FIRING: [27x] CertAlmostExpired: Certificate for service asw1-b12-drmrs.mgmt.drmrs.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:39:07] (03PS1) 10Majavah: backy2: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188825 [15:39:07] (03PS1) 10Majavah: ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 [15:39:07] (03PS1) 10Majavah: P:toolforge::checker: Remove absent checks [puppet] - 10https://gerrit.wikimedia.org/r/1188827 [15:39:08] (03PS1) 10Majavah: P:wmcs::metricsinfra: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188828 [15:39:08] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cloudsw1-f4-eqiad [15:39:09] (03PS1) 10Majavah: P:toolforge::prometheus: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188829 [15:39:14] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-f4-eqiad [15:39:20] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cloudsw1-e4-eqiad [15:39:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-e4-eqiad [15:39:37] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cloudsw1-b1-codfw [15:39:46] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-b1-codfw [15:39:56] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr1-eqiad [15:39:58] (03CR) 10CI reject: [V:04-1] ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah) [15:40:08] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:40:10] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-eqiad [15:40:25] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cloudsw1-c8-eqiad [15:40:29] (03CR) 10CI reject: [V:04-1] P:wmcs::metricsinfra: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188828 (owner: 10Majavah) [15:40:36] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-c8-eqiad [15:40:49] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f1-eqiad [15:40:54] (03PS2) 10Scott French: shellbox: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188817 (https://phabricator.wikimedia.org/T403284) [15:40:54] (03PS2) 10Scott French: shellbox-timeline: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188818 (https://phabricator.wikimedia.org/T403284) [15:40:54] (03PS2) 10Scott French: shellbox-video: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188819 (https://phabricator.wikimedia.org/T403284) [15:40:55] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-eqiad [15:41:05] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cloudsw1-d5-eqiad [15:41:16] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-d5-eqiad [15:41:25] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:41:35] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr2-eqiad [15:41:38] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage [15:41:43] (03PS2) 10Majavah: ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 [15:41:49] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqiad [15:42:28] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker2324:9290 - https://phabricator.wikimedia.org/T404480#11185695 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:42:30] (03CR) 10CI reject: [V:04-1] ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah) [15:42:42] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage [15:42:43] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah) [15:44:01] (03CR) 10RLazarus: [C:03+1] shellbox: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188817 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [15:44:08] (03CR) 10RLazarus: [C:03+1] shellbox-timeline: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188818 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [15:44:14] (03CR) 10RLazarus: [C:03+1] shellbox-video: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188819 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [15:44:27] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1052.eqiad.wmnet with reason: host reimage [15:45:02] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage [15:45:25] jhancock@cumin1002 provision (PID 1127341) is awaiting input [15:45:34] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1188807 (owner: 10Majavah) [15:45:41] (03CR) 10Majavah: [C:03+2] P:toolforge: Delete cmd_checklist test suite [puppet] - 10https://gerrit.wikimedia.org/r/1188807 (owner: 10Majavah) [15:47:09] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11185731 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:48:32] urbanecm: I guess this is still going because it modified i18n? [15:48:39] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1186649 (owner: 10Andrew Bogott) [15:48:55] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1052.eqiad.wmnet with reason: host reimage [15:49:34] I need to go, so kostajh you've moved forward in the queue [15:49:51] I'll leave the security patch till later [15:51:04] 10SRE-SLO, 10Charts, 06Reader Growth Team: Finalize Charts SLO - https://phabricator.wikimedia.org/T399613#11185773 (10elukey) >>! In T399613#11185664, @CDanis wrote: > Luca, do you want an early test subject for the Sloth trial? Definitely, the first use case will be Citoid so we can make a comparison with... [15:52:25] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage [15:54:37] Dreamy_Jazz: likely [15:55:18] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:58:00] (03CR) 10Btullis: "The values themselves look good, but you haven't enabled the installation for the dse-k8s-codfw cluster." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [16:00:05] jhathaway and moritzm: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1600). [16:00:05] zabe: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:31] o/ [16:00:59] o/ [16:01:17] there are some MW patches going out currently, as a heads up ^ [16:02:36] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bookworm [16:02:49] (03CR) 10JHathaway: [C:03+2] Add Apache configuration for Wikimedia Thailand wiki [puppet] - 10https://gerrit.wikimedia.org/r/1187539 (https://phabricator.wikimedia.org/T400001) (owner: 10Zabe) [16:04:09] !log urbanecm@deploy1003 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.45.0-wmf.17,1.45.0-wmf.18,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/ [16:04:09] mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.210.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/medi [16:04:09] awiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.210.0) (duration: 37m 39s) [16:04:22] what the hell? [16:04:22] zabe: patch merged [16:04:33] jhathaway: thx :) [16:05:03] jhathaway: might the merge interfere with the scap (that was running from before)? or is that unrelated? [16:06:10] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1052.eqiad.wmnet with OS bookworm [16:06:40] urbanecm: not sure [16:06:56] urbanecm: I think it would have to be that the patch got merged *and* puppet ran on the deployment host, for there to be any possible effect [16:07:07] fair [16:07:14] this seems to be the key part of the log https://www.irccloud.com/pastebin/6OvUSnro/ [16:07:42] on deploy1003 it last finished at 15:54, well before the +2 [16:07:49] yeah [16:08:15] soooo [16:08:40] it did renew some certificates, for mw-experimental / mw-experimental-deploy [16:09:03] I don't know if scap uses those? [16:09:21] it seems pushing to docker-registry failed [16:09:32] I don't think that would have used those certs [16:09:46] FIRING: Emergency syslog message: Alert for device pfw1-codfw.wikimedia.org - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:09:46] FIRING: Emergency syslog message: Alert for device pfw1-codfw.wikimedia.org - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:09:55] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bookworm [16:11:04] i can also try again and hope the push'll work on second try :-/ [16:12:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11185881 (10RobH) I overthought this, we should just move them with an SFP-T to the new port and worry about reimage and migration to full 10G later. [16:13:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11185897 (10RobH) [16:13:42] cdanis: any objections to that? or do you want to look into where this occured? [16:14:22] urandom: no objections [16:14:40] nothing jumping out at me in https://grafana.wikimedia.org/d/StcefURWz/docker-registry?orgId=1&from=now-3h&to=now&timezone=utc&var-datasource=000000006&var-instance=$__all either [16:15:06] ack, restarting [16:15:39] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1188767|feat: Allow communities to opt out experienced users from mentorship (T403563)]] [16:15:43] T403563: Do not automatically enroll experienced editors into Mentorship when they visit the Homepage - https://phabricator.wikimedia.org/T403563 [16:18:39] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:18:41] (03PS2) 10Muehlenhoff: Reset maps nodes for a fresh import [puppet] - 10https://gerrit.wikimedia.org/r/1188808 [16:19:55] does scap not write to ~/scap-image-build-and-push-log anymore? [16:20:37] (03CR) 10Elukey: [C:03+1] imposm-initial-import: Fix check whether imposm is running [puppet] - 10https://gerrit.wikimedia.org/r/1188801 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:20:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188808 (owner: 10Muehlenhoff) [16:21:43] ohhh spiderpig [16:22:32] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11185938 (10elukey) @Jhancock.wm perfect I can confirm that the provision cookbook ran fine (the test-cookbook version I mean). At this point we could use it to... [16:23:19] cdanis: should be in `/var/lib/spiderpig/` i think [16:23:27] yep, just found it [16:23:27] but not sure if the logs survived the restart [16:23:43] yea [16:24:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11185965 (10RobH) [16:24:09] (03CR) 10Muehlenhoff: [C:03+2] Reset maps nodes for a fresh import [puppet] - 10https://gerrit.wikimedia.org/r/1188808 (owner: 10Muehlenhoff) [16:24:30] nothing in the log for the past 8 minutes [16:24:46] RESOLVED: Emergency syslog message: Device pfw1-codfw.wikimedia.org recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:24:46] RESOLVED: Emergency syslog message: Device pfw1-codfw.wikimedia.org recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:27:21] cdanis: fwiw, https://spiderpig.wikimedia.org/jobs/563 seems to have most of the log of the past attempt [16:27:56] i can confirm it's making progress, per ss -tpi | grep -A1 docker [16:28:20] ... it's writing to codfw? [16:29:05] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:29:38] oh [16:33:58] Hmm. [16:34:11] Unpleasant. [16:35:44] did it finish? [16:35:55] no [16:41:02] I don't think it's a network problem though -- in the ss output, bytes_retrans is very small, and there's also `app_limited`, which is the TCP stack claiming that it isn't the problem lol [16:42:14] The original attempt (https://spiderpig.wikimedia.org/jobs/563) shows that full l10n rebuild happened, so this will be a ~40 minute deployment (if the registry doesn't burp) [16:43:14] what's interesting is the registry doesn't think it served any 5xx [16:43:29] nginx? [16:43:50] e.g., possibly filled up its spooling area [16:44:37] ahhh [16:44:54] yeah, usually if dockerd received a 5xx, but the registry has no knowledge of that, it's going to be the staging area [16:45:15] 2025/09/16 15:56:55 [crit] 1731183#1731183: *1091214 pwrite() "/var/lib/nginx/body/0000003080" failed (28: No space left on device), client: 10.64.16.93, server: , request: "PATCH /v2/restricted/mediawiki-multiversion/blobs/uploads.... [16:45:24] le sigh [16:45:35] this is a ganeti node huh [16:46:47] https://grafana.wikimedia.org/goto/74s6h5CHR?orgId=1 the memory and network graphs tell it all [16:47:59] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1056.eqiad.wmnet with OS bookworm [16:48:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11186016 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm executed with errors: - es1056 (**F... [16:48:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11186015 (10Jhancock.wm) yeah that's probably a good idea to do that. I finally got around to getting 53-58 iped. should be done by end of day so they're ready... [16:48:26] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1056.eqiad.wmnet with OS bookworm [16:48:33] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11186018 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm [16:49:16] /var/lib/nginx on registry2004 only uses 48% of the disk though (now) [16:49:26] cdanis: https://phabricator.wikimedia.org/T390251 mentions this pain, amongst other related pains [16:50:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:51:14] (03PS1) 10Andrew Bogott: Update nic names for cloudcephosd1052/bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1188837 (https://phabricator.wikimedia.org/T404249) [16:51:51] It pushed! [16:51:57] woohoo! [16:52:30] I guess dockerd tends to be fairly persistent :) [16:52:31] mutante: Just to confirm, is /var/lib/nginx a real filesystem (not tmpfs) ? [16:52:39] and now /var/lib/nginx has 0 usage [16:52:44] it's tmpfs [16:52:48] on a separate mount point [16:52:54] nod.. ok. that's what I thought. [16:53:05] (03CR) 10Andrew Bogott: [C:03+2] Update nic names for cloudcephosd1052/bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1188837 (https://phabricator.wikimedia.org/T404249) (owner: 10Andrew Bogott) [16:53:15] size: 4G [16:53:26] (i.e., too small for known workloads) [16:54:01] there's ostensibly some reason we can't make this larger, but my recollection of the details is failing me [16:54:15] some ganeti-related constraint? [16:54:33] swfrench-wmf: this is odd to me https://phabricator.wikimedia.org/P83378 [16:54:46] there's persistently some of those [16:54:57] is it really just making fresh connections to ms-fe that often? [16:55:32] dancy: left a comment on the ticket you linked [16:56:50] (03CR) 10Andrew Bogott: [C:03+2] Ceph rbd: remove option to use 'civetweb' front-end [puppet] - 10https://gerrit.wikimedia.org/r/1186649 (owner: 10Andrew Bogott) [16:57:59] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1188767|feat: Allow communities to opt out experienced users from mentorship (T403563)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:58:04] T403563: Do not automatically enroll experienced editors into Mentorship when they visit the Homepage - https://phabricator.wikimedia.org/T403563 [16:59:11] swfrench-wmf: maybe because the limit we have for RAM on ganeti instances is 8GB and by default, without changing it specifically, tmpfs defaults to half of the RAM size [16:59:42] cdanis: good question, and to be honest, I don't really know off hand. much of the docker registry swift storage backend implementation is ... questionable, so I frankly have no idea how frequently it churns through connections. [17:00:04] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1018.eqiad.wmnet with OS bookworm [17:00:05] swfrench-wmf: Time to do the MediaWiki infrastructure (UTC late) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1700). [17:00:42] mutante: ah, thanks - yeah, if this is a case of "defaults that are hard to change later"(?) then that makes quite a bit of sense. [17:01:12] o/ I'll hold off on my work planned for the infra window until the ongoing mediawiki deployment wraps up [17:02:51] swfrench-wmf: it is not hard to expand the RAM size of ganeti instances [17:04:21] cdanis: indeed, I was just reading through [0] and it seems the main sticking point is coordinating the restarts. [17:04:21] [0] https://wikitech.wikimedia.org/wiki/Ganeti#Increase/Decrease_CPU/RAM [17:04:45] what worries me is, though, that's just a bandaid [17:05:00] we need to figure out why the swift cluster was just effectively not accepting bytes, from registry's POV [17:05:17] https://grafana.wikimedia.org/goto/Nj6z0cjNR?orgId=1 [17:05:48] I'm not sure I follow [17:06:08] but yes, like most things here, 100% bandaid [17:06:27] it isnt but to get more than 8GB and up to 16 you would have to talk to infra foundations about a special case, afaict [17:07:13] cdanis: I notice that this most recent build-and-push took 35 minutes. The same operation for train presync takes about 26 minutes . [17:07:16] and reboot the machines which might renumber drives and need an edit to /etc/fstab to make them come back [17:07:38] swfrench-wmf: if you zoom out -- the normal case is that the registry is effectively streaming bytes to swift as it receives them [17:07:50] and, it wasn't that the NICs on the ganeti machines were saturated (they're 10gbit) [17:09:09] cdanis: AFAIK, that's not what happens here, though - i.e., dockerd sends a single monolithic chunk to the registry, which is staged in the staging dir in its entirety until the full upload completes, then the registry starts uploading to swift. [17:09:17] i.e., there's no streaming in this case [17:09:20] hmm [17:09:33] so is it just that these files were especially big because of the l10n update or something? [17:10:25] maybe because 2 things were being uploaded at the same time? just saying that because it was: "disk full" then "48% use" and then "0% use" once it completed [17:10:30] cdanis: exactly, yeah - an l10n update will consistently trigger a full image rebuild [17:10:41] (03CR) 10Eric Gardner: ReaderExperiments' ImageBrowsing stream configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187413 (https://phabricator.wikimedia.org/T403255) (owner: 10Marco Fossati) [17:10:47] And we upload several images at once so that it doesn't take an eternity [17:10:59] there's a _lot_ of things being pushed at the same time - {8.1, 8.3} x {multiversion, singleversion} [17:10:59] okay sounds like we just need more spool space [17:11:09] it feels to me it was bad luck that 2 things happened at once that dont always happen at once [17:11:10] Yes please [17:11:50] I think that would go a long way here, yeah, considering this is a persistent problem even outside of the mediawiki image context - e.g., I've run into this during production image rebuilds [17:12:08] (03CR) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [17:12:54] I run into it periodically on gitlab CI jobs (trusted runners that can push to the prod registry) [17:13:22] I'm sure others have too but haven't reported (I haven't either.. I just hit sigh and the retry button) [17:14:19] (03PS7) 10Brouberol: deployment_server: restore service private files ownership [puppet] - 10https://gerrit.wikimedia.org/r/1188795 (https://phabricator.wikimedia.org/T404068) [17:15:11] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1056.eqiad.wmnet with reason: host reimage [17:15:13] urbanecm: just to confirm, you saw that your 2nd attempt is now ready on testservers, correct? [17:15:15] it's interesting we're keeping it on tmpfs, i assume that's to avoid drbd latency but it might be worse overall [17:16:36] swfrench-wmf: is serviceops the service owner for docker-registry? [17:16:54] Nod. It would be nice of nginx just used anonymous memory for spooling until it reached a limit, then fell back to writing to real disk. [17:17:20] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6966/co" [puppet] - 10https://gerrit.wikimedia.org/r/1188795 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol) [17:17:25] (but not cross-host disk..) [17:17:41] cdanis: serviceops is probably the closest thing to it, yes [17:18:24] anyone object to me bumping these VMs up to 16GB RAM? from the infra foundations side I can confidently say we have plenty of RAM available in both ganeti clusters [17:18:58] cdanis: I have zero objections and was just about to open a task for that [17:19:03] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1056.eqiad.wmnet with reason: host reimage [17:19:10] OMG I'm so happy [17:19:11] swfrench-wmf: please open a task and I'll start pushing buttons [17:20:46] actually, brb more iced coffee [17:21:30] urbanecm: Checking in to see if you're still testing your change [17:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:24:29] cdanis: https://phabricator.wikimedia.org/T404742 [17:25:41] thanks! [17:25:45] I'll do eqiad first [17:25:57] FYI, given the hour, and the fact that my change should not in practice conflict with the pending backport (just an abundance of caution), I'm going to proceed with my shellbox change. [17:26:02] thanks, cdanis! [17:26:24] (03CR) 10Scott French: [C:03+2] shellbox: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188817 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [17:26:26] I don't agree that it's a bandaid to properly size the staging area for the workloads that we actually have. [17:27:20] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1018.eqiad.wmnet with reason: host reimage [17:28:10] (03Merged) 10jenkins-bot: shellbox: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188817 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [17:28:25] dancy: yeah, it's a heavily qualified use of "bandaid" ... what I mean here here is that, as long as we're not able to back this with disk, we can always get ourselves into trouble again by increasing one of the terms in the cross-product of stuff we're building :) [17:28:41] Gotcha [17:28:54] i.e, much more storage would be better [17:29:08] !log T404742 💙cdanis@ganeti1046.eqiad.wmnet ~ 🕜☕ sudo gnt-instance modify -B memory=16g registry1004.eqiad.wmnet [17:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:12] T404742: Increase RAM and nginx tmpfs on docker registry hosts - https://phabricator.wikimedia.org/T404742 [17:29:20] more storage == more better [17:29:26] MOAR! [17:29:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11186239 (10Jhancock.wm) update. everything but 2056 is ready. that one has a physical issue. the console and idrac connections are on a removeable card on thes... [17:29:32] !log cdanis@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM registry1004.eqiad.wmnet [17:30:12] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [17:30:16] Anyone remember Sisters of Mercy? I want more! [17:30:38] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [17:32:58] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1018.eqiad.wmnet with reason: host reimage [17:33:01] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [17:33:54] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [17:33:59] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:34:02] !log cdanis@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry1004.eqiad.wmnet [17:34:09] dancy: in the temple of love.. shine like thunder [17:36:08] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [17:36:18] !log T404742 💙cdanis@ganeti1046.eqiad.wmnet ~ 🕜☕ sudo gnt-instance modify -B memory=16g registry1005.eqiad.wmnet [17:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:23] T404742: Increase RAM and nginx tmpfs on docker registry hosts - https://phabricator.wikimedia.org/T404742 [17:36:25] !log cdanis@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM registry1005.eqiad.wmnet [17:37:52] cdanis: the VM has more RAM now but that does not chagne the size of the tmpfs mounted on /var/lib/nginx yet.. guess it has to be recreated too [17:38:06] urbanecm: Please come back! It would be very painful to have to revert your change [17:38:34] mutante: If it is mounted with default mount options, it will automatically be half the size of RAM [17:39:08] * swfrench-wmf is checking puppet [17:39:10] dancy: yea, but it was already mounted when RAM was 8GB [17:39:13] vriley@cumin1003 reimage (PID 2844030) is awaiting input [17:39:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [17:39:49] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1056.eqiad.wmnet with OS bookworm [17:40:03] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11186287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host es1056.eqiad.wmnet with OS bookworm completed: - es1056 (**PASS**) -... [17:40:12] I found it in puppet [17:40:15] writing a patch [17:40:17] would a mount -o remount change it? [17:40:22] yes [17:40:43] amazing, thanks cdanis [17:40:55] !log cdanis@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry1005.eqiad.wmnet [17:41:15] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11186288 (10VRiley-WMF) 05Open→03Resolved [17:41:31] cdanis: happy to send a patch your way if you're busy with actually doing the important part :) [17:41:43] Thanks to all of you! [17:43:44] swfrench-wmf: so I've just found https://phabricator.wikimedia.org/T359067 [17:44:22] in which a.kosiaris called what I'm planning to do "out of the question" as of a year ago ... but I kind of disagree with that analysis of the situation, unless I'm missing something [17:44:43] okay, a year and a half ago [17:44:57] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11186297 (10Jhancock.wm) @elukey i found it booted to the sretest and was responsive. mgmt ip pinged. rebooted it via keyboard. mgmt pings and i can login to the BMC. Not sure wha... [17:46:07] 10GB image is pretty massive. [17:46:18] but it's not just one image at a time [17:46:53] and if you want to limit image or layer size, do it another way, not via how much RAM is available here [17:47:35] I wasn't aware of this task, but my quick take would be that this was specific to the poor design of the specific images being discussed there. in our case, this is more of a question of concurrent pushes of (large, but still significantly smaller) images. [17:48:20] about when did we start doing concurrent pushes from scap? [17:49:14] which ones? (kidding) [17:49:54] we've been doing concurrent pushes from scap on and off for over a year a least, with varying numbers of images being pushed [17:50:00] nod [17:50:14] okay so since march 2024, we also approximately doubled ganeti cluster ram+cpu since then, and i assume, in the form of bigger machines (so also making larger VMs less troublesome) [17:51:04] (03PS1) 10CDanis: docker_registry: nginx spool: bump to 12Gbyte [puppet] - 10https://gerrit.wikimedia.org/r/1188843 (https://phabricator.wikimedia.org/T404742) [17:51:06] most recently, we increased the number for both (1) an additional PHP version (which will linger for at least the next 1-2 months) and (2) the additional of the singleversion images [17:51:58] okay yeah, I was imagining that singleversion was probably a factor [17:52:18] the extra PHP version is a big contributor too [17:52:29] yep [17:53:48] (03PS2) 10CDanis: docker_registry: nginx spool: bump to 12Gbyte [puppet] - 10https://gerrit.wikimedia.org/r/1188843 (https://phabricator.wikimedia.org/T404742) [17:53:55] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188843 (https://phabricator.wikimedia.org/T404742) (owner: 10CDanis) [17:55:16] swfrench-wmf: https://puppet-compiler.wmflabs.org/output/1188843/4982/registry2004.codfw.wmnet/index.html lgtu? [17:56:39] for rollout I was just going to reuse the ganeti.reboot-vm cookbook because it already depools and forces a puppet run before repooling [17:56:47] cdanis: LGTM! I do wonder if we need to depool each host while that's applied [17:56:53] ah, perfect [17:56:54] yes [17:57:15] (03CR) 10Scott French: [C:03+1] docker_registry: nginx spool: bump to 12Gbyte [puppet] - 10https://gerrit.wikimedia.org/r/1188843 (https://phabricator.wikimedia.org/T404742) (owner: 10CDanis) [17:57:18] (03CR) 10CDanis: [C:03+2] docker_registry: nginx spool: bump to 12Gbyte [puppet] - 10https://gerrit.wikimedia.org/r/1188843 (https://phabricator.wikimedia.org/T404742) (owner: 10CDanis) [17:58:08] for now, disabling puppet on registry2* [17:58:16] (03PS1) 10Dzahn: zuul::executor: hiera'ize docker image version and use in template [puppet] - 10https://gerrit.wikimedia.org/r/1188844 (https://phabricator.wikimedia.org/T403847) [17:58:18] * swfrench-wmf thumbs up [17:58:59] !log cdanis@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM registry1004.eqiad.wmnet [18:00:05] jeena and dduvall: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T1800). [18:00:22] jeena: We're in a weird situation right now. urbanecm has gone missing in the middle of his backport. [18:02:35] 2025-09-16T18:02:10.060444+00:00 registry1004 puppet-agent[1385]: (Mount[/var/lib/nginx](provider=parsed)) Remounting [18:02:37] neat [18:02:48] Very fancy [18:03:08] neat indeed! [18:03:18] !log cdanis@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry1004.eqiad.wmnet [18:03:30] !log cdanis@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM registry1005.eqiad.wmnet [18:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:04:46] for codfw i'm just gonna do it in one reboot [18:05:12] solid [18:05:54] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1018.eqiad.wmnet with OS bookworm [18:08:01] !log cdanis@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry1005.eqiad.wmnet [18:08:17] dancy: okay, I will hold off [18:08:56] !log T404742 💙cdanis@ganeti2032.codfw.wmnet ~ 🕑☕ sudo gnt-instance modify -B memory=16g registry2004.codfw.wmnet [18:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:00] T404742: Increase RAM and nginx tmpfs on docker registry hosts - https://phabricator.wikimedia.org/T404742 [18:09:10] !log cdanis@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM registry2004.codfw.wmnet [18:09:19] jeena: urbanecm has been offline for about 1.5 hours. [18:09:25] oh well hmm [18:09:31] should we cancel it then? [18:09:51] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1188844/6967/zuul1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1188844 (https://phabricator.wikimedia.org/T403847) (owner: 10Dzahn) [18:10:19] Must gut says yes.. but!! the change he was backporting caused l10n rebuild, so the revert will cause another full image build. It'll take a long time. [18:10:38] But I don't see an alternative at this time unless someone can step up to test https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1188767 [18:11:46] I tried to reach them on Slack, but they are offline [18:13:31] !log T404742 💙cdanis@ganeti2032.codfw.wmnet ~ 🕑☕ sudo gnt-instance modify -B memory=16g registry2005.codfw.wmnet [18:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:41] !log cdanis@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry2004.codfw.wmnet [18:14:15] !log cdanis@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM registry2005.codfw.wmnet [18:14:43] jeena: Suggestion: Start the revert of urbanecm's change now. If they show up before it gets deployed, it can be cancelled. [18:14:45] okay i'm gonna cancel and revert [18:14:52] Synchronized [18:14:59] :P [18:15:10] FWIW, if 1188767 is going to be reverted, we should wait until c.danis is done before starting the backport of the revert patch. [18:15:40] 100% The CI side of the revert will take a while before that becomes a concern. [18:16:13] ah, good idea getting a head-start on that part [18:16:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:16:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:16:49] !log urbanecm@deploy1003 Sync cancelled. [18:17:20] !log cdanis@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry2005.codfw.wmnet [18:17:21] swfrench-wmf: ok, all done [18:17:29] dancy: hey, I'm here now. So sorry, due to the long duration I got distracted [18:17:33] cdanis: nice! [18:17:39] (03PS1) 10Jeena Huneidi: Revert "feat: Allow communities to opt out experienced users from mentorship" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188856 [18:17:53] The change should be a no-op in the current state, but I'm happy to test if I'm in a good place to do that [18:18:08] urbanecm: ++ [18:18:15] jeena: ^ fyi [18:18:18] oh cool, I just clicked cancel but since it's merged I think we can just re-run the scap backport [18:18:41] Did the build finish? It touches i18n, so it takes a long way [18:18:53] (03Abandoned) 10Jeena Huneidi: Revert "feat: Allow communities to opt out experienced users from mentorship" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188856 (owner: 10Jeena Huneidi) [18:18:53] image build finished from last time [18:18:59] urbanecm: yeah, it made it to testservers [18:19:08] Okay, let me test real quick [18:19:15] And cdanis made some changes to prevent that problem that you hit. [18:21:43] the change does what it is supposed to do [18:21:50] jeena: dancy: ^ [18:22:17] I think the suggested course of action is to scap the same backport again urbanecm [18:22:33] wouldn't that take another 40 mins though? [18:22:45] I don't think so? [18:22:57] It won't [18:23:07] Full images were successfully build on the p[rior cancelled run [18:23:10] okay, scapping again then [18:23:22] 👍 [18:23:23] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1188767|feat: Allow communities to opt out experienced users from mentorship (T403563)]] [18:23:28] T403563: Do not automatically enroll experienced editors into Mentorship when they visit the Homepage - https://phabricator.wikimedia.org/T403563 [18:24:36] it's at K8s images build/push output now [18:24:41] and syncing§ [18:25:27] 18:24:37 [root] Image builds completed [18:26:15] can verify that the codfw registries are serving a lot of http 200/206 for mediawiki blobs [18:28:56] this deployment will be a bit slow, since little or layer data is cached already, but seems to be working as expected :) [18:29:04] *little or no [18:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:31:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.242 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:32:32] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1018.eqiad.wmnet with reason: host reimage [18:36:39] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:36:39] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:38:57] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1018.eqiad.wmnet with reason: host reimage [18:44:16] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188767|feat: Allow communities to opt out experienced users from mentorship (T403563)]] (duration: 20m 52s) [18:44:20] T403563: Do not automatically enroll experienced editors into Mentorship when they visit the Homepage - https://phabricator.wikimedia.org/T403563 [18:44:37] Finally [18:48:05] thanks! [18:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:52:25] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188861 (https://phabricator.wikimedia.org/T396380) [18:52:27] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188861 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot) [18:54:16] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188861 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot) [18:54:43] !log jhuneidi@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.19 refs T396380 [18:54:47] T396380: 1.45.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T396380 [18:59:31] (03PS1) 10Dwisehaupt: crm: Update civicrm settings template for v6.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/1188863 (https://phabricator.wikimedia.org/T404757) [19:01:53] (03PS1) 10Andrew Bogott: cloudcephosd1018 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1188864 [19:02:23] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1018 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1188864 (owner: 10Andrew Bogott) [19:04:17] jeena: if you end up having any unused time at the end of the deployment window, I'll borrow it for some unrelated service bumps, let me know :) no worries if you end up needing it all [19:04:40] just pouring envoy upgrades in to fill up any cracks in the calendar [19:05:11] rzl: I will let you know when it finishes! [19:05:43] cheers [19:07:21] rzl: FYI, I have a shellbox change that's half deployed (paused during the registry work earlier). if could borrow 2m of our borrowed time to wrap that up, that would be splendid :) [19:07:30] *of your [19:07:47] swfrench-wmf: oh yeah of course [19:09:24] thank you :) [19:10:45] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1018.eqiad.wmnet with OS bookworm [19:17:39] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:17:39] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:18:59] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:19:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) (owner: 10Ebernhardson) [19:21:41] FIRING: SystemdUnitFailed: logrotate.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:30:42] I was briefly confused as to why group0 would have resulted in l10n updates, and thus full image builds ... but now I see this is `testwikis to 1.45.0-wmf.19` - i.e., what usually(?) would have been done early UTC morning by presync IIUC [19:31:17] oh hmm yeah I should have noticed that earlier [19:35:13] in any case, almost done :) [19:35:22] and group0 should be quick after that [19:35:28] yeah [19:36:00] 10ops-codfw, 06SRE, 06DC-Ops: mc-misc2001 won't power up - https://phabricator.wikimedia.org/T395526#11186714 (10Jhancock.wm) heads up. we **//finally//** got the server to where it boots. but we lost everything that was on the drives in the process. I'm gonna reimage the host if that's okay with you. I'm pr... [19:36:47] !log jhuneidi@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.19 refs T396380 (duration: 42m 03s) [19:36:51] T396380: 1.45.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T396380 [19:37:03] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:37:14] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188872 (https://phabricator.wikimedia.org/T396380) [19:37:16] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188872 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot) [19:38:07] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188872 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot) [19:41:02] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:43:39] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:43:59] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:47:05] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:48:14] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:48:34] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:51:44] 10ops-eqiad, 06SRE, 06DC-Ops: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#11186840 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Spoke to @cmooney and these switches have been set to offline. Closing this ticket [19:52:47] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.19 refs T396380 [19:52:51] T396380: 1.45.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T396380 [19:57:16] swfrench-wmf: rzl finished [19:57:28] thanks, jeena! [19:57:54] thanks! all you, Scott [19:57:54] 10SRE-Access-Requests, 10Phabricator: Phabricator admin access for new team members - https://phabricator.wikimedia.org/T404768 (10BTracy-WMF) 03NEW [19:58:02] rzl: if you have anything you'd like to sneak in before backports start, feel free to do so concurrently [19:58:14] nah, I don't want to rush it, I'll go after backports [19:58:36] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [19:59:27] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T2000). [20:00:05] Superpes and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] !log migrated shellbox (score) to PHP 8.3 - T403284 [20:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:16] T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284 [20:00:17] Hi :) [20:00:36] (03CR) 10Eamedina: [C:03+1] SpecialContribute: configure new page target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188799 (https://phabricator.wikimedia.org/T327063) (owner: 10Sbisson) [20:03:20] Hi Superpes, do you need a deployer? [20:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:04:30] Hi jeena yep thanks! This is a simple patch that doesn't even require testing :) [20:04:52] 👍 [20:05:41] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188875 [20:07:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188493 (https://phabricator.wikimedia.org/T404592) (owner: 10Superpes15) [20:08:45] (03Merged) 10jenkins-bot: Throttle exemption for Editathon by Wikimedistas en Cruce - 26 September 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188493 (https://phabricator.wikimedia.org/T404592) (owner: 10Superpes15) [20:09:21] !log jhuneidi@deploy1003 Started scap sync-world: Backport for [[gerrit:1188493|Throttle exemption for Editathon by Wikimedistas en Cruce - 26 September 2025 (T404592)]] [20:09:25] T404592: Lift IP cap on this date 2025-09-26 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T404592 [20:12:02] 10SRE-Access-Requests, 10Phabricator: Phabricator admin access for new team members - https://phabricator.wikimedia.org/T404768#11186979 (10Aklapper) 05Open→03Declined Hi, this does not require admin access. Please see https://www.mediawiki.org/wiki/Phabricator/Creating_and_renaming_projects and bring... [20:16:01] !log jhuneidi@deploy1003 jhuneidi, superpes: Backport for [[gerrit:1188493|Throttle exemption for Editathon by Wikimedistas en Cruce - 26 September 2025 (T404592)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:16:06] T404592: Lift IP cap on this date 2025-09-26 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T404592 [20:16:27] !log jhuneidi@deploy1003 jhuneidi, superpes: Continuing with sync [20:21:44] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host dse-k8s-worker2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:21:48] !log jhuneidi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188493|Throttle exemption for Editathon by Wikimedistas en Cruce - 26 September 2025 (T404592)]] (duration: 12m 27s) [20:21:53] T404592: Lift IP cap on this date 2025-09-26 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T404592 [20:22:44] Many thanks for your assistance [20:22:46] :) [20:26:02] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:27:02] can I deploy something now? [20:30:58] kostajh: I guess so. ebernhardson are you here for your backport? [20:31:34] Superpes: you're welcome! [20:32:33] jeena: I'll go ahead, if that's ok [20:32:45] yeah I think it's okay to do so [20:34:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188823 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [20:34:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188824 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [20:36:03] (03Merged) 10jenkins-bot: hCaptcha: Enable version pinning and subresource integrity [extensions/ConfirmEdit] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188823 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [20:36:04] (03Merged) 10jenkins-bot: hCaptcha: Enable version pinning and subresource integrity [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188824 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [20:36:32] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1188823|hCaptcha: Enable version pinning and subresource integrity (T404251)]], [[gerrit:1188824|hCaptcha: Enable version pinning and subresource integrity (T404251)]] [20:36:36] T404251: hCaptcha: Enable version pinning and subresource integrity - https://phabricator.wikimedia.org/T404251 [20:42:29] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1188823|hCaptcha: Enable version pinning and subresource integrity (T404251)]], [[gerrit:1188824|hCaptcha: Enable version pinning and subresource integrity (T404251)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:42:34] T404251: hCaptcha: Enable version pinning and subresource integrity - https://phabricator.wikimedia.org/T404251 [20:43:51] !log kharlan@deploy1003 kharlan: Continuing with sync [20:49:03] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188823|hCaptcha: Enable version pinning and subresource integrity (T404251)]], [[gerrit:1188824|hCaptcha: Enable version pinning and subresource integrity (T404251)]] (duration: 12m 31s) [20:49:08] T404251: hCaptcha: Enable version pinning and subresource integrity - https://phabricator.wikimedia.org/T404251 [20:49:16] I have one more to go [20:49:30] np [20:50:29] (03PS5) 10Kosta Harlan: hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251) [20:50:35] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [20:50:48] (03CR) 10TrainBranchBot: "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [20:52:28] (03Merged) 10jenkins-bot: hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187079 (https://phabricator.wikimedia.org/T404251) (owner: 10Kosta Harlan) [20:52:52] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1187079|hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version (T404251)]] [20:58:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:58:45] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1187079|hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version (T404251)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:58:52] T404251: hCaptcha: Enable version pinning and subresource integrity - https://phabricator.wikimedia.org/T404251 [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T2100) [21:00:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:01:14] !log kharlan@deploy1003 kharlan: Continuing with sync [21:05:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:06:35] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1187079|hCaptcha: Set wgHCaptchaApiUrlIntegrityHash and pin secure-api.js version (T404251)]] (duration: 13m 42s) [21:06:40] T404251: hCaptcha: Enable version pinning and subresource integrity - https://phabricator.wikimedia.org/T404251 [21:07:00] ok, all done. [21:07:28] going ahead, if the web team window isn't in use [21:14:49] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [21:15:11] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [21:15:58] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [21:16:18] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [21:17:01] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [21:17:31] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [21:17:47] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11187380 (10vaughnwalters) >>! In T403638#11178012, @vaughnwalters wrote: > I made a note of this when testing the campaign events extension T404244#11177... [21:18:21] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [21:19:04] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [21:19:45] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [21:20:14] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [21:21:07] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [21:21:35] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [21:21:55] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:21:58] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:22:59] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:23:05] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:24:08] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [21:24:33] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [21:24:40] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [21:25:06] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [21:25:27] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [21:25:58] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [21:26:13] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [21:26:38] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [21:27:52] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [21:28:03] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [21:28:21] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/data-gateway: apply [21:28:37] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [21:29:47] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [21:30:02] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [21:31:01] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [21:31:15] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [21:31:29] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/echostore: apply [21:32:34] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/echostore: apply [21:32:44] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/echostore: apply [21:33:51] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/echostore: apply [21:33:59] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:34:14] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [21:34:49] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [21:35:18] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [21:35:45] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [21:35:59] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [21:36:21] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [21:36:33] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [21:36:59] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [21:37:15] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [21:37:28] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [21:37:41] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [21:38:19] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [21:38:31] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [21:38:42] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [21:39:05] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [21:39:19] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [21:39:55] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [21:40:26] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [21:40:44] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [21:41:12] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [21:41:29] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [21:42:34] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [21:42:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:43:32] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/ipoid: apply [21:43:53] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [21:44:04] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/ipoid: apply [21:44:24] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [21:44:50] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/kartotherian: apply [21:45:53] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/kartotherian: apply [21:46:08] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: apply [21:50:24] hmmmmmm. tap tap tap [21:50:52] (03PS1) 10Catrope: labs only: Enable multiple 2FA modules and new 2FA UI in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188890 (https://phabricator.wikimedia.org/T404029) [21:56:23] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: apply [21:58:57] (that was a helm timeout, but looks like a clean rollback, proceeding and I'll come back around to take another look) [21:59:55] no smoking gun but there are some readiness check failures and some cpu quota FailedCreates in the logs, plenty to look at [22:00:04] (03PS1) 10Zabe: Initial configuration for mswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188892 (https://phabricator.wikimedia.org/T404698) [22:07:56] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [22:08:39] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [22:08:53] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [22:14:03] (03PS1) 10Zabe: Initial configuration for thwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188894 (https://phabricator.wikimedia.org/T400001) [22:15:03] (03PS14) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [22:15:18] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [22:15:55] that one succeeded, just a slow startup [22:16:03] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [22:21:37] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [22:21:51] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [22:23:25] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [22:25:58] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [22:26:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s4 in codfw (T403966)', diff saved to https://phabricator.wikimedia.org/P83380 and previous config saved to /var/cache/conftool/dbconfig/20250916-222612-ladsgroup.json [22:26:17] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [22:27:02] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [22:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:47] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [22:30:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Bump weight of db1206 in general group (T403966)', diff saved to https://phabricator.wikimedia.org/P83381 and previous config saved to /var/cache/conftool/dbconfig/20250916-223019-ladsgroup.json [22:31:32] !log rzl@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [22:31:59] !log rzl@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [22:32:31] !log rzl@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [22:33:04] !log rzl@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [22:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:42:54] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [22:43:37] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [22:43:53] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [22:44:35] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [22:45:55] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [22:45:58] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [22:46:07] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [22:46:12] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [22:47:28] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [22:48:35] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [23:02:32] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [23:03:11] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [23:03:22] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/push-notifications: apply [23:03:53] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [23:04:55] (03PS1) 10Cwhite: opensearch: enable setting cluster recovery rate [puppet] - 10https://gerrit.wikimedia.org/r/1188900 [23:04:56] (03PS1) 10Cwhite: logstash: set recovery rate to 800mb [puppet] - 10https://gerrit.wikimedia.org/r/1188901 [23:05:36] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [23:05:40] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [23:10:43] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [23:10:57] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [23:11:04] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/sessionstore: apply [23:11:19] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [23:17:47] (03PS15) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [23:18:59] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:21:41] FIRING: SystemdUnitFailed: logrotate.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:25:36] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [23:26:48] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [23:26:51] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [23:27:07] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [23:27:13] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [23:27:47] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [23:28:10] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [23:28:41] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [23:29:11] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [23:31:15] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/termbox: apply [23:32:02] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [23:32:49] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/toolhub: apply [23:33:00] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [23:34:10] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/toolhub: apply [23:34:45] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [23:35:34] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [23:35:50] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [23:37:12] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [23:37:27] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [23:38:23] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/zotero: apply [23:38:43] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [23:40:08] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/zotero: apply [23:40:27] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/zotero: apply [23:41:49] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: apply [23:42:51] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: apply [23:43:09] worked great the second time 🤷 [23:43:59] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:48:08] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1002.eqiad.wmnet with reason: WIP [23:55:51] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul2002.codfw.wmnet with reason: WIP