[00:00:08] 10ops-esams, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11552156 (10ssingh) Thank for reporting. Can you please try again from your home connection (without VPN) and let us know if that works? We depooled... [00:00:24] RECOVERY - Host cp3066 is UP: PING OK - Packet loss = 0%, RTA = 80.96 ms [00:00:24] RECOVERY - Host cp3070 is UP: PING OK - Packet loss = 0%, RTA = 80.93 ms [00:00:24] RECOVERY - Host cp3068 is UP: PING OK - Packet loss = 0%, RTA = 80.98 ms [00:00:26] RECOVERY - Host cp3079 is UP: PING OK - Packet loss = 0%, RTA = 80.95 ms [00:00:26] RECOVERY - Host cp3071 is UP: PING OK - Packet loss = 0%, RTA = 80.93 ms [00:00:26] RECOVERY - Host cp3074 is UP: PING OK - Packet loss = 0%, RTA = 81.01 ms [00:00:26] RECOVERY - Host cp3078 is UP: PING OK - Packet loss = 0%, RTA = 80.91 ms [00:00:26] RECOVERY - Host cp3081 is UP: PING OK - Packet loss = 0%, RTA = 80.92 ms [00:00:26] RECOVERY - Host cp3076 is UP: PING OK - Packet loss = 0%, RTA = 80.85 ms [00:00:27] RECOVERY - Host cp3075 is UP: PING OK - Packet loss = 0%, RTA = 80.91 ms [00:00:27] RECOVERY - Host cp3072 is UP: PING OK - Packet loss = 0%, RTA = 80.89 ms [00:00:28] RECOVERY - Host cp3067 is UP: PING OK - Packet loss = 0%, RTA = 82.67 ms [00:00:28] RECOVERY - Host cp3077 is UP: PING OK - Packet loss = 0%, RTA = 80.93 ms [00:00:29] RECOVERY - Host cp3069 is UP: PING OK - Packet loss = 0%, RTA = 80.90 ms [00:00:29] RECOVERY - Host cp3073 is UP: PING OK - Packet loss = 0%, RTA = 82.67 ms [00:00:30] RECOVERY - Host cp3080 is UP: PING OK - Packet loss = 0%, RTA = 82.59 ms [00:00:34] huh [00:00:38] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:00:40] RECOVERY - Host lvs3008 is UP: PING OK - Packet loss = 0%, RTA = 80.81 ms [00:00:40] RECOVERY - Host lvs3010 is UP: PING OK - Packet loss = 0%, RTA = 80.75 ms [00:00:40] RECOVERY - Host ganeti3005 is UP: PING OK - Packet loss = 0%, RTA = 80.90 ms [00:00:40] RECOVERY - Host lvs3009 is UP: PING OK - Packet loss = 0%, RTA = 81.97 ms [00:00:40] RECOVERY - Host ganeti3006 is UP: PING OK - Packet loss = 0%, RTA = 80.92 ms [00:00:41] RECOVERY - Host ganeti3008 is UP: PING OK - Packet loss = 0%, RTA = 82.58 ms [00:00:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=esams - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:01:00] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:01:02] RECOVERY - Host ganeti3007 is UP: PING OK - Packet loss = 0%, RTA = 80.94 ms [00:01:04] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:01:04] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:01:14] Clearing temporarily for me.. [00:01:18] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:01:28] 10ops-esams, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11552158 (10AlexisJazz) Works now, thanks! [00:02:10] FIRING: [6x] BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:02:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-drmrs (185.15.58.146) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:05:28] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:07:10] RESOLVED: [6x] BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:07:36] 10ops-esams, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11552161 (10AlexisJazz) {F71607077} There was a visible dip in editing and surge in error responses. [00:07:39] RESOLVED: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-drmrs (185.15.58.146) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:13:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230708 (https://phabricator.wikimedia.org/T415335) (owner: 10Stang) [00:19:37] !log sukhe@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site esams [reason: repooling esams; link issues resolved, T415473] [00:19:42] T415473: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473 [00:19:52] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site esams [reason: repooling esams; link issues resolved, T415473] [00:24:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:34:23] !log remove static routes for esams ranges on cr1-eqiad [00:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1232830 [00:40:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1232830 (owner: 10TrainBranchBot) [00:50:32] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:54:38] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1232830 (owner: 10TrainBranchBot) [00:55:17] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:00:41] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1232841 [01:10:05] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1232841 (owner: 10TrainBranchBot) [01:13:24] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 43s) [01:19:27] 10ops-esams, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11552172 (10ssingh) We had a transient link failure between eqiad and esams that resulted in this issue. It should be resolved now, and esams is pool... [01:26:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:29:41] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:38] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1232841 (owner: 10TrainBranchBot) [01:33:42] 10ops-esams, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: ESAMS 502 broken pipe connection issues - https://phabricator.wikimedia.org/T415473#11552173 (10Peachey88) [02:15:05] (03CR) 10Brouberol: [C:03+1] "I wonder if we should instead..." [puppet] - 10https://gerrit.wikimedia.org/r/1230547 (owner: 10Ladsgroup) [02:20:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:20:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:39:14] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:55:32] FIRING: [20x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:57:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 19.73% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:02:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:09:14] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:26:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:41] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:14] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:10] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:40:17] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:02:16] PROBLEM - Ensure acme-chief-api is running on acmechief2002 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief [06:02:16] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief2002 is CRITICAL: PROCS CRITICAL: 2 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [06:03:16] RECOVERY - Ensure acme-chief-api is running on acmechief2002 is OK: PROCS OK: 1 process with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief [06:03:16] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief2002 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [06:47:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:02:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260126T0800). [08:00:05] samwilson, koi, and Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:02:17] (03Abandoned) 10Aqu: Allow connections to eventgates from Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228599 (https://phabricator.wikimedia.org/T411989) (owner: 10Aqu) [08:03:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samwilson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229946 (https://phabricator.wikimedia.org/T413967) (owner: 10Samwilson) [08:04:09] o/ [08:04:52] (03Merged) 10jenkins-bot: Enable watchlist labels on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1229946 (https://phabricator.wikimedia.org/T413967) (owner: 10Samwilson) [08:05:40] !log samwilson@deploy2002 Started scap sync-world: Backport for [[gerrit:1229946|Enable watchlist labels on testwiki (T413967)]] [08:05:45] T413967: Deploy watchlist labels - https://phabricator.wikimedia.org/T413967 [08:10:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:15:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:28:44] !log samwilson@deploy2002 samwilson: Backport for [[gerrit:1229946|Enable watchlist labels on testwiki (T413967)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:28:50] T413967: Deploy watchlist labels - https://phabricator.wikimedia.org/T413967 [08:30:12] !log samwilson@deploy2002 samwilson: Continuing with sync [08:32:49] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1232979 [08:43:19] !log samwilson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1229946|Enable watchlist labels on testwiki (T413967)]] (duration: 37m 38s) [08:43:23] T413967: Deploy watchlist labels - https://phabricator.wikimedia.org/T413967 [08:45:07] (03PS1) 10Marostegui: db1264: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1233097 (https://phabricator.wikimedia.org/T415358) [08:47:03] (03CR) 10Marostegui: [C:03+2] db1264: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1233097 (https://phabricator.wikimedia.org/T415358) (owner: 10Marostegui) [08:48:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1264 T415358', diff saved to https://phabricator.wikimedia.org/P87925 and previous config saved to /var/cache/conftool/dbconfig/20260126-084852-marostegui.json [08:48:57] T415358: Migrate 1P db* to Debian Trixie - https://phabricator.wikimedia.org/T415358 [08:49:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1264.eqiad.wmnet with reason: reimage to Trixie [08:51:24] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1264.eqiad.wmnet with OS trixie [09:02:15] (03CR) 10Joal: "I think this patch can be abandoned in favor of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1229524" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228599 (https://phabricator.wikimedia.org/T411989) (owner: 10Aqu) [09:03:51] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1264.eqiad.wmnet with reason: host reimage [09:08:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1264.eqiad.wmnet with reason: host reimage [09:18:48] (03PS1) 10Marostegui: Revert "db1264: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1233111 [09:25:39] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db1224.eqiad.wmnet onto db1264.eqiad.wmnet [09:25:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db1224 - Depool db1224.eqiad.wmnet to then clone it to db1264.eqiad.wmnet - marostegui@cumin1003 [09:26:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1224 - Depool db1224.eqiad.wmnet to then clone it to db1264.eqiad.wmnet - marostegui@cumin1003 [09:26:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:27:37] (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1233111 (owner: 10Marostegui) [09:28:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1264.eqiad.wmnet with OS trixie [09:28:21] (03PS1) 10Marostegui: installserver: Do not format /srv on db1264 [puppet] - 10https://gerrit.wikimedia.org/r/1233114 [09:29:41] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:24] (03CR) 10Marostegui: [C:03+2] installserver: Do not format /srv on db1264 [puppet] - 10https://gerrit.wikimedia.org/r/1233114 (owner: 10Marostegui) [09:35:05] (03Abandoned) 10Thiemo Kreuz (WMDE): [beta] Start using Cite's Community Configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133119 (https://phabricator.wikimedia.org/T385597) (owner: 10Thiemo Kreuz (WMDE)) [09:39:14] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:40:32] FIRING: [22x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:43:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2198.codfw.wmnet with reason: schema change [09:55:36] (03PS1) 10Sergio Gimeno: fix: avoid logging traffic from overridden experiment users [extensions/GrowthExperiments] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233116 (https://phabricator.wikimedia.org/T415294) [09:55:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233116 (https://phabricator.wikimedia.org/T415294) (owner: 10Sergio Gimeno) [10:05:30] (03CR) 10BCornwall: [C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1231057 (owner: 10Ncmonitor) [10:06:13] (03PS1) 10Sergio Gimeno: fix: avoid displaying incorrect additional userpage link [extensions/GrowthExperiments] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233117 (https://phabricator.wikimedia.org/T415291) [10:06:15] !log brett@dns1006 START - running authdns-update [10:06:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233117 (https://phabricator.wikimedia.org/T415291) (owner: 10Sergio Gimeno) [10:07:51] !log brett@dns1006 END - running authdns-update [10:08:43] (03CR) 10BCornwall: [C:03+1] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1229527 (https://phabricator.wikimedia.org/T415171) (owner: 10Gerrit maintenance bot) [10:12:07] (03CR) 10Brouberol: [C:03+2] Update dse-k8s-eqiad airflow values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229524 (https://phabricator.wikimedia.org/T411989) (owner: 10Joal) [10:14:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [10:15:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [10:16:19] (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1231058 (owner: 10Ncmonitor) [10:18:40] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1231058 (owner: 10Ncmonitor) [10:18:42] (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1231058 (owner: 10Ncmonitor) [10:22:00] (03PS2) 10Ladsgroup: kerberos: Add a space after period in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1230547 [10:22:09] (03PS3) 10Ladsgroup: kerberos: Add a space after period in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1230547 [10:22:12] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1231059 (owner: 10Ncmonitor) [10:22:21] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1230547 (owner: 10Ladsgroup) [10:25:13] (03PS1) 10Kevin Bazira: ml: fix torch dependencies in vLLM image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1233119 (https://phabricator.wikimedia.org/T385173) [10:27:00] (03CR) 10Brouberol: "This is doing more than provisioning the namespaces: it's also deploying the. cluster themselves, in both DCs." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230512 (https://phabricator.wikimedia.org/T414702) (owner: 10Ryan Kemper) [10:27:37] (03PS1) 10Aqu: Airflow devenv: Add eventgates into Envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233120 (https://phabricator.wikimedia.org/T411989) [10:32:22] (03CR) 10Joal: [C:03+1] Airflow devenv: Add eventgates into Envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233120 (https://phabricator.wikimedia.org/T411989) (owner: 10Aqu) [10:33:16] (03CR) 10Michael Große: [C:03+1] fix: avoid logging traffic from overridden experiment users [extensions/GrowthExperiments] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233116 (https://phabricator.wikimedia.org/T415294) (owner: 10Sergio Gimeno) [10:34:32] (03PS3) 10Daniel Kinzler: rest route: support multiple rate limit policies at once [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228218 [10:37:04] (03CR) 10Brouberol: [C:03+2] Airflow devenv: Add eventgates into Envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233120 (https://phabricator.wikimedia.org/T411989) (owner: 10Aqu) [10:42:22] 06SRE, 06Traffic, 07Documentation: Documentation error about TLS 1.2 on Wikimedia DNS DoH on metawiki - https://phabricator.wikimedia.org/T415449#11553069 (10ssingh) Yes, thanks for doing that @Naruse_shiroha. That was a typo, and well, a big one at that. For DoH (and DoT before that), the rationale for... [10:45:42] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:46:16] ^^ expected, working on it [10:47:42] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:48:17] (03PS1) 10Giuseppe Lavagetto: Code changes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1233124 [10:48:40] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Code changes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1233124 (owner: 10Giuseppe Lavagetto) [10:49:22] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Fix bugs with commit history - oblivian@cumin1003" [10:49:24] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix bugs with commit history - oblivian@cumin1003 [10:50:12] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix bugs with commit history - oblivian@cumin1003 [10:50:13] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Fix bugs with commit history - oblivian@cumin1003" [10:50:13] (03PS1) 10Bartosz Wójtowicz: ml-services: Update image for Article Topic model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233125 (https://phabricator.wikimedia.org/T414573) [10:50:16] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:51:42] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:54:38] 06SRE, 06Traffic, 07Documentation: Documentation error about TLS 1.2 on Wikimedia DNS DoH on metawiki - https://phabricator.wikimedia.org/T415449#11553163 (10Cuthead) > no DoH client in theory should not support anything other than 1.3 Actually yes, there's one as I've mentioned in the main post: [OkHtt... [10:57:16] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260126T1100) [11:00:16] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:03:06] (03CR) 10AikoChou: [C:03+1] ml-services: Update image for Article Topic model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233125 (https://phabricator.wikimedia.org/T414573) (owner: 10Bartosz Wójtowicz) [11:03:52] (03PS1) 10Zabe: Start reading from il_target_id from all medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233127 (https://phabricator.wikimedia.org/T413669) [11:07:16] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:09:42] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:11:09] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update image for Article Topic model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233125 (https://phabricator.wikimedia.org/T414573) (owner: 10Bartosz Wójtowicz) [11:13:16] (03Merged) 10jenkins-bot: ml-services: Update image for Article Topic model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233125 (https://phabricator.wikimedia.org/T414573) (owner: 10Bartosz Wójtowicz) [11:17:16] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:17:42] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:20:16] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:21:42] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:26:16] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:26:42] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:27:16] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [11:28:46] (03CR) 10Ssingh: "Sorry, we obviously missed this. I will take care of it today." [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T401489) (owner: 10Krinkle) [11:29:49] (03CR) 10Ssingh: "[Adding Brett]" [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T401489) (owner: 10Krinkle) [11:31:41] !log bwojtowicz@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [11:32:42] !log bwojtowicz@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [11:33:50] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lerickson - https://phabricator.wikimedia.org/T415406#11553222 (10DSantamaria) Approved! [11:45:42] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:46:16] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:02:10] (03PS4) 10Daniel Kinzler: rest route: support multiple rate limit policies at once [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228218 (https://phabricator.wikimedia.org/T413186) [12:02:24] (03CR) 10CI reject: [V:04-1] rest route: support multiple rate limit policies at once [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228218 (https://phabricator.wikimedia.org/T413186) (owner: 10Daniel Kinzler) [12:02:30] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7947/console" [puppet] - 10https://gerrit.wikimedia.org/r/1214155 (owner: 10Krinkle) [12:03:40] (03CR) 10BCornwall: [C:03+1] varnish: De-duplicate mediawiki::errorpage options and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/1214155 (owner: 10Krinkle) [12:06:00] (03PS1) 10Daniel Kinzler: api-gateway: Re-apply "Rest-gateway Read `ratelimit_class` and `user_id` from JWT" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233154 [12:06:40] (03Abandoned) 10Daniel Kinzler: WIP: restbase: Handle JWT passsed in cookies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211703 (owner: 10Pmiazga) [12:07:16] (03PS1) 10Daniel Kinzler: rest-gateway: re.apply "add support for sessionJwt cookies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233155 [12:07:27] (03CR) 10CI reject: [V:04-1] rest-gateway: re.apply "add support for sessionJwt cookies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233155 (owner: 10Daniel Kinzler) [12:07:35] (03PS2) 10Daniel Kinzler: rest-gateway: re.apply "add support for sessionJwt cookies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233155 [12:09:41] (03PS2) 10Daniel Kinzler: api-gateway: Re-apply "Rest-gateway Read `ratelimit_class` and `user_id` from JWT" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233154 (https://phabricator.wikimedia.org/T405578) [12:09:46] (03CR) 10BCornwall: [V:03+1 C:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T401489) (owner: 10Krinkle) [12:09:47] (03PS3) 10Daniel Kinzler: rest-gateway: re.apply "add support for sessionJwt cookies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233155 (https://phabricator.wikimedia.org/T405578) [12:09:58] (03PS4) 10Daniel Kinzler: rest-gateway: re.apply "add support for sessionJwt cookies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233155 (https://phabricator.wikimedia.org/T405578) [12:16:12] jouncebot: nowandnext [12:16:12] No deployments scheduled for the next 1 hour(s) and 43 minute(s) [12:16:12] In 1 hour(s) and 43 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260126T1400) [12:16:28] (03PS1) 10Urbanecm: Add messages for Jju Wikipedia (kajwiki) [extensions/WikimediaMessages] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233158 (https://phabricator.wikimedia.org/T413283) [12:16:42] zabe: wanna deploy sth? [12:16:54] yes [12:17:11] i guess your's shorter than "i18n backport", so go ahead? [12:17:29] (03CR) 10Zabe: [C:03+2] Start reading from il_target_id from all medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233127 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe) [12:18:30] (03Merged) 10jenkins-bot: Start reading from il_target_id from all medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233127 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe) [12:18:56] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1233127|Start reading from il_target_id from all medium wikis (T413669)]] [12:19:00] T413669: Set imagelinks migration to read new - https://phabricator.wikimedia.org/T413669 [12:19:57] yup, you can probably hit +2 on yours and I will be done until its merged [12:20:29] (03CR) 10Urbanecm: [C:03+2] Add messages for Jju Wikipedia (kajwiki) [extensions/WikimediaMessages] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233158 (https://phabricator.wikimedia.org/T413283) (owner: 10Urbanecm) [12:20:35] done [12:21:31] (03CR) 10Michael Große: [C:03+1] fix: avoid displaying incorrect additional userpage link [extensions/GrowthExperiments] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233117 (https://phabricator.wikimedia.org/T415291) (owner: 10Sergio Gimeno) [12:23:06] !log zabe@deploy2002 zabe: Backport for [[gerrit:1233127|Start reading from il_target_id from all medium wikis (T413669)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:23:33] !log zabe@deploy2002 zabe: Continuing with sync [12:25:05] (03PS1) 10Daniel Kinzler: rediscope: lower cpu and memoy limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233161 [12:27:48] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1233127|Start reading from il_target_id from all medium wikis (T413669)]] (duration: 08m 52s) [12:27:53] T413669: Set imagelinks migration to read new - https://phabricator.wikimedia.org/T413669 [12:30:53] (03PS1) 10Zabe: Start reading from il_target_id everywhere besides enwiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233171 (https://phabricator.wikimedia.org/T413669) [12:33:07] (03Merged) 10jenkins-bot: Add messages for Jju Wikipedia (kajwiki) [extensions/WikimediaMessages] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233158 (https://phabricator.wikimedia.org/T413283) (owner: 10Urbanecm) [12:36:01] zabe: can i start? [12:36:07] yes [12:36:45] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1233158|Add messages for Jju Wikipedia (kajwiki) (T413283)]] [12:36:50] T413283: Create Jju Wikipedia - https://phabricator.wikimedia.org/T413283 [13:00:14] (03CR) 10Dzahn: [C:03+2] pretrain: Run one hour later, at 02:00UTC [puppet] - 10https://gerrit.wikimedia.org/r/1230952 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [13:04:16] (03PS2) 10Jforrester: Defensively set Abstract Wikipedia feature flags to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225649 (https://phabricator.wikimedia.org/T411690) [13:04:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225649 (https://phabricator.wikimedia.org/T411690) (owner: 10Jforrester) [13:05:38] (03PS3) 10Jforrester: Defensively set Abstract Wikipedia feature flags to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225649 (https://phabricator.wikimedia.org/T411690) [13:06:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:53] !log pre-train sync shifted to one hour later T398873 [13:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:57] T398873: Move nightly image build from releases-jenkins to deployment.eqiad.wmnet - https://phabricator.wikimedia.org/T398873 [13:09:14] (03PS1) 10Gehel: admin(query-service): wdqs shell access for user lerickson [puppet] - 10https://gerrit.wikimedia.org/r/1233181 (https://phabricator.wikimedia.org/T415373) [13:09:58] (03CR) 10CI reject: [V:04-1] admin(query-service): wdqs shell access for user lerickson [puppet] - 10https://gerrit.wikimedia.org/r/1233181 (https://phabricator.wikimedia.org/T415373) (owner: 10Gehel) [13:13:05] 06SRE, 06Traffic, 07Documentation: Documentation error about TLS 1.2 on Wikimedia DNS DoH on metawiki - https://phabricator.wikimedia.org/T415449#11553611 (10Naruse_shiroha) > no DoH client in theory should not support anything other than 1.3 OkHttp DoH client may not support 1.3, too depending on the p... [13:19:11] FIRING: [2x] Temperature: Temp issue on wdqs1022:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1022 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [13:23:32] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1233158|Add messages for Jju Wikipedia (kajwiki) (T413283)]] (duration: 46m 47s) [13:23:37] T413283: Create Jju Wikipedia - https://phabricator.wikimedia.org/T413283 [13:23:58] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [13:24:11] RESOLVED: [2x] Temperature: Temp issue on wdqs1022:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1022 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [13:25:03] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 01m 04s) [13:29:41] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:37:56] (03PS2) 10Krinkle: varnish: De-duplicate mediawiki::errorpage options and add missing docs [puppet] - 10https://gerrit.wikimedia.org/r/1214155 [13:37:56] (03PS6) 10Krinkle: varnish: Move error message from footer to body for HTTP 4xx responses [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T401489) [13:37:56] (03PS1) 10Krinkle: varnish: Restrict unauth sitemap access to verified crawlers (cat B) [puppet] - 10https://gerrit.wikimedia.org/r/1233188 (https://phabricator.wikimedia.org/T407122) [13:38:38] (03PS2) 10Krinkle: varnish: Restrict unauth sitemap access to verified crawlers (cat B) [puppet] - 10https://gerrit.wikimedia.org/r/1233188 (https://phabricator.wikimedia.org/T407122) [13:38:41] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1233188 (https://phabricator.wikimedia.org/T407122) (owner: 10Krinkle) [13:39:14] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:39:47] (03CR) 10Dzahn: [C:03+2] add abstract.wikipedia.org to section for wikis not covered by langlist [dns] - 10https://gerrit.wikimedia.org/r/1227706 (https://phabricator.wikimedia.org/T411724) (owner: 10Dzahn) [13:41:15] !log dzahn@dns1004 START - running authdns-update [13:42:34] !log dzahn@dns1004 END - running authdns-update [13:44:12] (03PS3) 10Krinkle: varnish: Restrict unauth sitemap access to verified crawlers (cat B) [puppet] - 10https://gerrit.wikimedia.org/r/1233188 (https://phabricator.wikimedia.org/T407122) [13:44:19] 06SRE, 10DNS, 06Traffic, 06Abstract Wikipedia team (26Q3 (Jan–Mar)), and 2 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11553739 (10Dzahn) `abstract.wikipedia.org` has been added to DNS [13:44:22] !log DNS - added abstract.wikipedia.org [13:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:09] 06SRE, 10DNS, 06Traffic, 06Abstract Wikipedia team (26Q3 (Jan–Mar)), and 2 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11553743 (10Jdforrester-WMF) 05Open→03Resolved a:03Dzahn Thank you! [13:46:17] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:51:41] (03PS4) 10Krinkle: varnish: Restrict unauth sitemap access to verified crawlers (cat B) [puppet] - 10https://gerrit.wikimedia.org/r/1233188 (https://phabricator.wikimedia.org/T407122) [13:54:50] (03PS5) 10Krinkle: varnish: Restrict unauth sitemap access to verified crawlers (cat B) [puppet] - 10https://gerrit.wikimedia.org/r/1233188 (https://phabricator.wikimedia.org/T407122) [13:55:18] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11553789 (10Marostegui) @RobH when do you want to do this? dbproxy1028 is now ACTIVE but dbproxy1029 is still a passive host. So dbpro... [13:57:18] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11553795 (10ssingh) Hi @RobH: thanks for the update and for pursuing this. And yeah, that works; waiting for the week after the offsite is also fine since it is just one host. Thanks! [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260126T1400) [14:00:05] Seawolf35, Sergi0, and James_F: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] * James_F waves. [14:00:12] o/ [14:00:23] (03PS6) 10Krinkle: varnish: Restrict unauth sitemap access to verified crawlers (cat B) [puppet] - 10https://gerrit.wikimedia.org/r/1233188 (https://phabricator.wikimedia.org/T407122) [14:00:34] Here [14:00:53] o/ [14:00:58] I’ll need a deployer [14:01:02] I can deploy :) [14:02:11] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "The talk page part of the [on-wiki discussion](https://ro.wikipedia.org/wiki/Wikipedia:Cafenea#Noindex_la_paginile_din_spa%C8%9Biul_utiliz" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230498 (https://phabricator.wikimedia.org/T414992) (owner: 10Seawolf35gerrit) [14:02:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230498 (https://phabricator.wikimedia.org/T414992) (owner: 10Seawolf35gerrit) [14:02:50] we might see spiderpig job #1234 this window, exciting ^^ [14:03:29] (03Merged) 10jenkins-bot: rowiki: Set noindex for User: and User talk: Namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230498 (https://phabricator.wikimedia.org/T414992) (owner: 10Seawolf35gerrit) [14:03:46] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1230498|rowiki: Set noindex for User: and User talk: Namespaces (T414992)]] [14:03:51] T414992: Please noindex the user namespace on ro.wikipedia.org - https://phabricator.wikimedia.org/T414992 [14:03:51] Lucas_WMDE: Fancy. [14:07:45] !log lucaswerkmeister-wmde@deploy2002 seawolf35gerrit, lucaswerkmeister-wmde: Backport for [[gerrit:1230498|rowiki: Set noindex for User: and User talk: Namespaces (T414992)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:08:10] Not quite sure how to test an indexing change but I’ll make sure there are still user pages. [14:08:32] I think if you purge a user page there should be a difference in the HTML [14:10:08] yup, I purged https://ro.wikipedia.org/wiki/Utilizator:Rebel and the HTML changed from [14:10:08] to [14:10:08] [14:10:16] Success. [14:10:17] (along with other changes just due to the purge) [14:10:21] lgtm on my end, nothing seems broken [14:10:24] ok [14:10:29] !log lucaswerkmeister-wmde@deploy2002 seawolf35gerrit, lucaswerkmeister-wmde: Continuing with sync [14:10:30] thanks! [14:13:02] RESOLVED: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:13:22] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:13:57] (03CR) 10Dpogorzelski: [C:03+2] ml: fix torch dependencies in vLLM image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1233119 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [14:14:14] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:14:38] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [14:15:10] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:16:22] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1230498|rowiki: Set noindex for User: and User talk: Namespaces (T414992)]] (duration: 12m 36s) [14:16:27] T414992: Please noindex the user namespace on ro.wikipedia.org - https://phabricator.wikimedia.org/T414992 [14:16:35] sergi0: do you want to deploy your changes yourself? [14:16:44] yeah, can do [14:16:56] ok, the spiderpig is yours :) [14:17:27] Lucas_WMDE ty! [14:19:06] np :) [14:19:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233116 (https://phabricator.wikimedia.org/T415294) (owner: 10Sergio Gimeno) [14:19:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233117 (https://phabricator.wikimedia.org/T415291) (owner: 10Sergio Gimeno) [14:30:42] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml: fix torch dependencies in vLLM image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1233119 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [14:32:05] (03Merged) 10jenkins-bot: fix: avoid logging traffic from overridden experiment users [extensions/GrowthExperiments] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233116 (https://phabricator.wikimedia.org/T415294) (owner: 10Sergio Gimeno) [14:32:06] (03Merged) 10jenkins-bot: fix: avoid displaying incorrect additional userpage link [extensions/GrowthExperiments] (wmf/1.46.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1233117 (https://phabricator.wikimedia.org/T415291) (owner: 10Sergio Gimeno) [14:32:27] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1233116|fix: avoid logging traffic from overridden experiment users (T415294)]], [[gerrit:1233117|fix: avoid displaying incorrect additional userpage link (T415291)]] [14:32:33] T415294: '.experiment.sampling_unit' should be equal to one of the allowed values - https://phabricator.wikimedia.org/T415294 [14:32:34] T415291: ⧼userpage⧽ shows up in vector sticky header user menu - https://phabricator.wikimedia.org/T415291 [14:33:42] (03PS3) 10Dpogorzelski: admin: add the ml-builder-docker group [puppet] - 10https://gerrit.wikimedia.org/r/1230280 [14:33:50] (03CR) 10Dpogorzelski: admin: add the ml-builder-docker group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1230280 (owner: 10Dpogorzelski) [14:34:34] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1233116|fix: avoid logging traffic from overridden experiment users (T415294)]], [[gerrit:1233117|fix: avoid displaying incorrect additional userpage link (T415291)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:34:48] testing now [14:35:18] !log sgimeno@deploy2002 sgimeno: Continuing with sync [14:37:46] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:39:21] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1233116|fix: avoid logging traffic from overridden experiment users (T415294)]], [[gerrit:1233117|fix: avoid displaying incorrect additional userpage link (T415291)]] (duration: 06m 54s) [14:39:27] T415294: '.experiment.sampling_unit' should be equal to one of the allowed values - https://phabricator.wikimedia.org/T415294 [14:39:28] T415291: ⧼userpage⧽ shows up in vector sticky header user menu - https://phabricator.wikimedia.org/T415291 [14:40:21] James_F: are you next? Or Lucas_WMDE will you continue with the last one? [14:40:29] James_F is up next yeah [14:40:32] Cool. [14:40:34] I assume you’ll self deploy? [14:40:37] Mine is a no-op. [14:40:40] Can you deploy? [14:40:46] sure [14:40:51] Sorry, in meetings. (As always, the deploy windows at at the worst times.) [14:41:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225649 (https://phabricator.wikimedia.org/T411690) (owner: 10Jforrester) [14:42:08] (03Merged) 10jenkins-bot: Defensively set Abstract Wikipedia feature flags to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1225649 (https://phabricator.wikimedia.org/T411690) (owner: 10Jforrester) [14:42:10] <3 [14:42:18] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:42:18] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:42:27] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1225649|Defensively set Abstract Wikipedia feature flags to false (T411690 T411691)]] [14:42:34] T411690: Register the new content model, and set it for configured namespaces; 'Abstract:' => Z6091 by default - https://phabricator.wikimedia.org/T411690 [14:42:34] T411691: Add specific rights for creating and editing Abstract content, so permissions can be granted as needed - https://phabricator.wikimedia.org/T411691 [14:44:26] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, jforrester: Backport for [[gerrit:1225649|Defensively set Abstract Wikipedia feature flags to false (T411690 T411691)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:45:47] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, jforrester: Continuing with sync [14:47:16] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55567 bytes in 7.505 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:47:18] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.998 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:47:36] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:49:21] 06SRE, 06Traffic: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11553987 (10Peter17) Hello, I am encountering the same error with my bot, based on Pywikibot... I am still unable to fetch more than a few pages from Wikipedia. T... [14:49:52] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1225649|Defensively set Abstract Wikipedia feature flags to false (T411690 T411691)]] (duration: 07m 24s) [14:49:58] T411690: Register the new content model, and set it for configured namespaces; 'Abstract:' => Z6091 by default - https://phabricator.wikimedia.org/T411690 [14:49:58] T411691: Add specific rights for creating and editing Abstract content, so permissions can be granted as needed - https://phabricator.wikimedia.org/T411691 [14:50:12] !log UTC afternoon backport+config window done [14:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:10] Thanks Lucas_WMDE. [14:51:40] np [14:53:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87928 and previous config saved to /var/cache/conftool/dbconfig/20260126-145346-marostegui.json [14:53:54] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [14:53:54] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [15:03:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P87929 and previous config saved to /var/cache/conftool/dbconfig/20260126-150355-marostegui.json [15:09:14] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:13:59] (03PS1) 10Urbanecm: Prepare configuration for kajwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233204 (https://phabricator.wikimedia.org/T413283) [15:14:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P87930 and previous config saved to /var/cache/conftool/dbconfig/20260126-151403-marostegui.json [15:14:29] * urbanecm is going to be deploying [15:16:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233204 (https://phabricator.wikimedia.org/T413283) (owner: 10Urbanecm) [15:17:30] (03Merged) 10jenkins-bot: Prepare configuration for kajwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233204 (https://phabricator.wikimedia.org/T413283) (owner: 10Urbanecm) [15:17:50] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1233204|Prepare configuration for kajwiki (T413283)]] [15:17:56] T413283: Create Jju Wikipedia - https://phabricator.wikimedia.org/T413283 [15:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:19:45] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1233204|Prepare configuration for kajwiki (T413283)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:21:37] !log urbanecm@deploy2002 urbanecm: Continuing with sync [15:24:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87931 and previous config saved to /var/cache/conftool/dbconfig/20260126-152411-marostegui.json [15:24:18] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [15:24:18] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [15:24:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [15:24:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2181 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87932 and previous config saved to /var/cache/conftool/dbconfig/20260126-152436-marostegui.json [15:25:44] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1233204|Prepare configuration for kajwiki (T413283)]] (duration: 07m 54s) [15:25:48] T413283: Create Jju Wikipedia - https://phabricator.wikimedia.org/T413283 [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260126T1530) [15:34:14] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:40:35] (03PS1) 10Urbanecm: Activate kajwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233205 (https://phabricator.wikimedia.org/T413283) [15:41:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233205 (https://phabricator.wikimedia.org/T413283) (owner: 10Urbanecm) [15:41:51] (03Merged) 10jenkins-bot: Activate kajwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233205 (https://phabricator.wikimedia.org/T413283) (owner: 10Urbanecm) [15:42:12] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1233205|Activate kajwiki (T413283)]] [15:42:17] T413283: Create Jju Wikipedia - https://phabricator.wikimedia.org/T413283 [15:44:13] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1233205|Activate kajwiki (T413283)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:45:07] !log urbanecm@deploy2002 urbanecm: Continuing with sync [15:49:15] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1233205|Activate kajwiki (T413283)]] (duration: 07m 03s) [15:49:20] T413283: Create Jju Wikipedia - https://phabricator.wikimedia.org/T413283 [15:53:04] (03PS1) 10Urbanecm: kajwiki: Enable GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233206 (https://phabricator.wikimedia.org/T415027) [15:53:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233206 (https://phabricator.wikimedia.org/T415027) (owner: 10Urbanecm) [15:55:42] (03Merged) 10jenkins-bot: kajwiki: Enable GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233206 (https://phabricator.wikimedia.org/T415027) (owner: 10Urbanecm) [15:56:00] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1233206|kajwiki: Enable GrowthExperiments (T415027)]] [15:56:06] T415027: Enable GrowthExperiments on a brand new wiki - https://phabricator.wikimedia.org/T415027 [15:57:59] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1233206|kajwiki: Enable GrowthExperiments (T415027)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:01:57] !log urbanecm@deploy2002 urbanecm: Continuing with sync [16:06:04] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1233206|kajwiki: Enable GrowthExperiments (T415027)]] (duration: 10m 04s) [16:06:10] T415027: Enable GrowthExperiments on a brand new wiki - https://phabricator.wikimedia.org/T415027 [16:12:25] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1224 gradually with 4 steps - Pool db1224.eqiad.wmnet in after cloning [16:17:20] !log dancy@deploy2002 Installing scap version "4.236.0" for 2 host(s) [16:19:12] !log dancy@deploy2002 Installation of scap version "4.236.0" completed for 2 hosts [16:27:30] (03PS1) 10Urbanecm: kajwiki: Update logos and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233210 (https://phabricator.wikimedia.org/T413283) [16:27:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233210 (https://phabricator.wikimedia.org/T413283) (owner: 10Urbanecm) [16:28:56] (03Merged) 10jenkins-bot: kajwiki: Update logos and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233210 (https://phabricator.wikimedia.org/T413283) (owner: 10Urbanecm) [16:29:14] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1233210|kajwiki: Update logos and timezone (T413283)]] [16:29:21] T413283: Create Jju Wikipedia - https://phabricator.wikimedia.org/T413283 [16:30:05] jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260126T1630). [16:30:05] jdrewniak: A patch you scheduled for Wikimedia Portals Update is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:30:55] jan_drewniak: please give me a sec before starting, my scap's about to finish [16:31:12] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1233210|kajwiki: Update logos and timezone (T413283)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:31:41] !log urbanecm@deploy2002 urbanecm: Continuing with sync [16:34:12] (03PS2) 10Jdrewniak: Bumping portals submodule to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230961 (https://phabricator.wikimedia.org/T128546) [16:34:55] !log marostegui@cumin1003 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis kajwiki in section s5 [16:35:46] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1233210|kajwiki: Update logos and timezone (T413283)]] (duration: 06m 32s) [16:35:51] T413283: Create Jju Wikipedia - https://phabricator.wikimedia.org/T413283 [16:35:56] and done [16:36:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230961 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:38:00] (03Merged) 10jenkins-bot: Bumping portals submodule to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230961 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:38:18] !log jdrewniak@deploy2002 Started scap sync-world: Backport for [[gerrit:1230961|Bumping portals submodule to master (T128546)]] [16:38:23] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:40:18] !log jdrewniak@deploy2002 jdrewniak: Backport for [[gerrit:1230961|Bumping portals submodule to master (T128546)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:40:40] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233213 [16:41:32] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis kajwiki in section s5 [16:41:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis kajwiki in section s5 [16:43:03] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233213 (owner: 10PipelineBot) [16:43:12] !log jdrewniak@deploy2002 jdrewniak: Continuing with sync [16:44:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Checking sanitization for wikis kajwiki in section s5 [16:44:53] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233213 (owner: 10PipelineBot) [16:44:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230974 (https://phabricator.wikimedia.org/T415386) (owner: 10Arlolra) [16:46:04] (03PS1) 10Tiziano Fogli: centralauth: add recording rules for grafana widgets [puppet] - 10https://gerrit.wikimedia.org/r/1233214 (https://phabricator.wikimedia.org/T415035) [16:46:10] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:46:34] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:47:15] !log jdrewniak@deploy2002 Finished scap sync-world: Backport for [[gerrit:1230961|Bumping portals submodule to master (T128546)]] (duration: 08m 57s) [16:47:20] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:47:42] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [16:48:26] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [16:49:13] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [16:50:23] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [16:57:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1224 gradually with 4 steps - Pool db1224.eqiad.wmnet in after cloning [17:06:16] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:install (1) SSD each into franio200[1-3] - https://phabricator.wikimedia.org/T405982#11554629 (10Dwisehaupt) a:03Jhancock.wm Forgot to assign @Jhancock.wm after mentioning. Doing so to bubble it up. Thanks. [17:06:17] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:08:25] (03CR) 10Marostegui: Revert "db1264: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1233111 (owner: 10Marostegui) [17:08:27] (03CR) 10Marostegui: [C:03+2] Revert "db1264: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1233111 (owner: 10Marostegui) [17:19:33] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lerickson - https://phabricator.wikimedia.org/T415406#11554756 (10Arnoldokoth) [17:19:58] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lerickson - https://phabricator.wikimedia.org/T415406#11554758 (10Arnoldokoth) 05Open→03In progress a:03Arnoldokoth [17:29:41] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:15] jouncebot: nowandnext [17:45:15] No deployments scheduled for the next 0 hour(s) and 14 minute(s) [17:45:15] In 0 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260126T1800) [17:45:15] In 0 hour(s) and 14 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260126T1800) [17:46:34] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for Johannnes89 - https://phabricator.wikimedia.org/T414789#11554903 (10Arnoldokoth) @Johannnes89 Could you clear your cache and cookies and retry? [17:51:31] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lerickson - https://phabricator.wikimedia.org/T415406#11554917 (10Arnoldokoth) @Ottomata / @Milimetric / @Ahoelzl Kindly approve. [17:54:08] (03PS1) 10Dreamy Jazz: Add CheckUser Suggested Investigations stream to ext-EventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233228 [17:54:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233228 (owner: 10Dreamy Jazz) [17:55:40] (03Merged) 10jenkins-bot: Add CheckUser Suggested Investigations stream to ext-EventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233228 (owner: 10Dreamy Jazz) [17:56:01] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1233228|Add CheckUser Suggested Investigations stream to ext-EventLogging]] [17:57:54] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1233228|Add CheckUser Suggested Investigations stream to ext-EventLogging]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:59:33] Testing.... [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260126T1800) [18:00:04] ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260126T1800) [18:02:53] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [18:09:13] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1233228|Add CheckUser Suggested Investigations stream to ext-EventLogging]] (duration: 13m 12s) [18:11:24] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1264 gradually with 4 steps - Pool db1264.eqiad.wmnet in after cloning [18:23:37] (03PS17) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [18:24:21] (03CR) 10CI reject: [V:04-1] prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [18:56:54] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1264 gradually with 4 steps - Pool db1264.eqiad.wmnet in after cloning [18:56:56] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1224.eqiad.wmnet onto db1264.eqiad.wmnet [19:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:34:14] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:56:17] FIRING: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:01:46] (03PS18) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [20:07:22] (03CR) 10CDobbins: prometheus: add depooled cp* host check (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [20:18:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87942 and previous config saved to /var/cache/conftool/dbconfig/20260126-201834-marostegui.json [20:18:44] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:18:44] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:21:01] (03PS1) 10Ebernhardson: Disable wikidata completion search AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233258 (https://phabricator.wikimedia.org/T306644) [20:21:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233258 (https://phabricator.wikimedia.org/T306644) (owner: 10Ebernhardson) [20:26:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230931 (https://phabricator.wikimedia.org/T415372) (owner: 10Jdrewniak) [20:26:46] PROBLEM - SSH on stat1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:27:36] RECOVERY - SSH on stat1009 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:27:46] (03PS1) 10Pppery: Handle phutilnumber type [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1233259 [20:28:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P87943 and previous config saved to /var/cache/conftool/dbconfig/20260126-202843-marostegui.json [20:30:15] (03PS10) 10Pppery: Extract strings from US English locale as source strings and apply PLURAL [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421) [20:30:22] (03PS4) 10Pppery: Handle all format specifiers [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221131 (https://phabricator.wikimedia.org/T413529) [20:30:24] (03PS4) 10Pppery: Add a `bin/translatewiki roundtrip` workflow to validate the string-mangling code [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221180 (https://phabricator.wikimedia.org/T413532) [20:38:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P87944 and previous config saved to /var/cache/conftool/dbconfig/20260126-203851-marostegui.json [20:49:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87945 and previous config saved to /var/cache/conftool/dbconfig/20260126-204900-marostegui.json [20:49:07] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:49:07] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:49:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [20:49:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1203 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87946 and previous config saved to /var/cache/conftool/dbconfig/20260126-204925-marostegui.json [20:52:43] (03CR) 10Ayounsi: [C:03+1] Netops: link to more-specific dashboards for interface based alerts [alerts] - 10https://gerrit.wikimedia.org/r/1229163 (owner: 10Cathal Mooney) [20:57:27] (03CR) 10Ayounsi: [C:03+1] "The line might be too long for IRC and make it break over 2 lines." [alerts] - 10https://gerrit.wikimedia.org/r/1229163 (owner: 10Cathal Mooney) [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260126T2100). [21:00:05] arlolra, ebernhardson, and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:08] \o [21:00:33] hello, I can get started with my patch [21:01:01] o/ sounds good, I'll do mine last, it might take a while. [21:01:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230974 (https://phabricator.wikimedia.org/T415386) (owner: 10Arlolra) [21:02:12] (03Merged) 10jenkins-bot: Deploy PRV to 21 wikis + bump 3 top50 to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230974 (https://phabricator.wikimedia.org/T415386) (owner: 10Arlolra) [21:02:29] !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1230974|Deploy PRV to 21 wikis + bump 3 top50 to 100% (T415386)]] [21:02:34] T415386: Parsoid Read Views to deploy ~2026-01-26 - https://phabricator.wikimedia.org/T415386 [21:04:27] !log arlolra@deploy2002 arlolra: Backport for [[gerrit:1230974|Deploy PRV to 21 wikis + bump 3 top50 to 100% (T415386)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:06:17] !log arlolra@deploy2002 arlolra: Continuing with sync [21:10:25] !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1230974|Deploy PRV to 21 wikis + bump 3 top50 to 100% (T415386)]] (duration: 07m 56s) [21:10:31] T415386: Parsoid Read Views to deploy ~2026-01-26 - https://phabricator.wikimedia.org/T415386 [21:10:35] ebernhardson: all yours [21:14:40] arlolra: thanks [21:14:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233258 (https://phabricator.wikimedia.org/T306644) (owner: 10Ebernhardson) [21:15:50] (03Merged) 10jenkins-bot: Disable wikidata completion search AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233258 (https://phabricator.wikimedia.org/T306644) (owner: 10Ebernhardson) [21:16:09] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1233258|Disable wikidata completion search AB test (T306644)]] [21:16:14] T306644: re-run wbsearchentities optimization process - https://phabricator.wikimedia.org/T306644 [21:18:04] !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:1233258|Disable wikidata completion search AB test (T306644)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:20:54] !log ebernhardson@deploy2002 ebernhardson: Continuing with sync [21:24:58] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1233258|Disable wikidata completion search AB test (T306644)]] (duration: 08m 49s) [21:25:03] T306644: re-run wbsearchentities optimization process - https://phabricator.wikimedia.org/T306644 [21:25:09] jan_drewniak: it's ready for you [21:25:39] ebernhardson: thanks! [21:26:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230931 (https://phabricator.wikimedia.org/T415372) (owner: 10Jdrewniak) [21:27:24] (03Merged) 10jenkins-bot: WP25EasterEggs added to extension-list, config var, enabled on beta cluster. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230931 (https://phabricator.wikimedia.org/T415372) (owner: 10Jdrewniak) [21:27:41] !log jdrewniak@deploy2002 Started scap sync-world: Backport for [[gerrit:1230931|WP25EasterEggs added to extension-list, config var, enabled on beta cluster. (T415372)]] [21:27:45] T415372: Enable extension:WP25EasterEgg on beta cluster - https://phabricator.wikimedia.org/T415372 [21:28:15] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ssalgaonkar-wmf - https://phabricator.wikimedia.org/T415594 (10Sucheta-Salgaonkar-WMF) 03NEW [21:29:41] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:55:15] 06SRE, 06Traffic: High rate of broken thumbnails on Discord embeds of links to Wikimedia sites - https://phabricator.wikimedia.org/T415598#11555698 (10taavi) [21:55:40] 06SRE, 06Traffic: High rate of broken thumbnails on Discord embeds of links to Wikimedia sites - https://phabricator.wikimedia.org/T415598#11555699 (10AntiCompositeNumber) This started happening within the last month or so. It doesn't happen for every request for every thumbnail, but seems to be more likely af... [22:00:05] Reedy, sbassett, Maryum, and manfredi: Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260126T2200). Please do the needful. [22:00:39] !log jdrewniak@deploy2002 jdrewniak: Backport for [[gerrit:1230931|WP25EasterEggs added to extension-list, config var, enabled on beta cluster. (T415372)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:00:44] T415372: Enable extension:WP25EasterEgg on beta cluster - https://phabricator.wikimedia.org/T415372 [22:01:17] FIRING: [4x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:06:13] preparing to do a security deploy [22:10:16] !log jdrewniak@deploy2002 jdrewniak: Continuing with sync [22:12:58] security deploy in progress [22:19:10] scap is taking ages [22:21:23] going to try running scap again, looks like it's hanging [22:22:38] !log jdrewniak@deploy2002 Finished scap sync-world: Backport for [[gerrit:1230931|WP25EasterEggs added to extension-list, config var, enabled on beta cluster. (T415372)]] (duration: 54m 57s) [22:22:44] T415372: Enable extension:WP25EasterEgg on beta cluster - https://phabricator.wikimedia.org/T415372 [22:23:34] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for Johannnes89 - https://phabricator.wikimedia.org/T414789#11555775 (10Johannnes89) I'm very sorry I simply submitted the wrong username. Can't remember why but apparently I chose `j89` years ago: https://ldap.toolforge.org/user/j89 [22:24:13] scap running for security deploy [22:29:01] !log mstyles Deployed security patch for T412061 [22:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:05] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11555784 (10Izno) I just fixed https://en.wikipedia.org/wiki/User:Bradv/Scripts/ExpandDiffs.js#L-20 which was previously showing a bla... [22:32:39] security deploys finished for today [22:36:14] maryum: sorry I had a backport deploy run very long (Duration 56m 33s) wasn't looking at IRC but I hope it didn't interfere with the security deploy! (it was for beta cluster so I don't think it would). [22:44:31] jouncebot: nowandnext [22:44:31] For the next 1 hour(s) and 15 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260126T2200) [22:44:31] In 1 hour(s) and 15 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260127T0000) [22:46:23] (03CR) 10Zabe: [C:03+2] Start reading from il_target_id everywhere besides enwiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233171 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe) [22:47:38] (03Merged) 10jenkins-bot: Start reading from il_target_id everywhere besides enwiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233171 (https://phabricator.wikimedia.org/T413669) (owner: 10Zabe) [22:50:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:54:38] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1233171|Start reading from il_target_id everywhere besides enwiki and commons (T413669)]] [22:54:42] T413669: Set imagelinks migration to read new - https://phabricator.wikimedia.org/T413669 [22:56:32] !log zabe@deploy2002 zabe: Backport for [[gerrit:1233171|Start reading from il_target_id everywhere besides enwiki and commons (T413669)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:57:12] !log zabe@deploy2002 zabe: Continuing with sync [23:01:16] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1233171|Start reading from il_target_id everywhere besides enwiki and commons (T413669)]] (duration: 06m 38s) [23:01:20] T413669: Set imagelinks migration to read new - https://phabricator.wikimedia.org/T413669 [23:03:13] 06SRE, 06Traffic: High rate of broken thumbnails on Discord embeds of links to Wikimedia sites - https://phabricator.wikimedia.org/T415598#11555931 (10TheDJ) The ogp.me tag is lis... [23:03:36] 06SRE, 06Traffic: High rate of broken thumbnails on Discord embeds of links to Wikimedia sites - https://phabricator.wikimedia.org/T415598#11555944 (10TheDJ) [23:03:39] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11555945 (10TheDJ) [23:04:33] 06SRE, 06Traffic: OGP lists thumbnail version of fullsize instead the fullsize version itself - https://phabricator.wikimedia.org/T415598#11555947 (10TheDJ) [23:11:52] 06SRE, 06Traffic: OGP lists fullsize thumbnail version of original instead the original itself - https://phabricator.wikimedia.org/T415598#11555956 (10TheDJ) [23:13:18] 06SRE, 06Traffic: OGP lists fullsize thumbnail version of original instead the original itself - https://phabricator.wikimedia.org/T415598#11555961 (10AntiCompositeNumber) >>! In T415598#11555931, @TheDJ wrote: > The ogp.me tag is listing the thumbnail variant of the fullsize, instead of the fullsize I'm not... [23:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:34:14] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:41:08] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp2040 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:42:08] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp2040 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server