[00:01:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P88347 and previous config saved to /var/cache/conftool/dbconfig/20260131-000125-marostegui.json [00:16:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P88348 and previous config saved to /var/cache/conftool/dbconfig/20260131-001634-marostegui.json [00:31:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T415786)', diff saved to https://phabricator.wikimedia.org/P88349 and previous config saved to /var/cache/conftool/dbconfig/20260131-003142-marostegui.json [00:31:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [00:31:48] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [00:33:10] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:33:50] rzl, sukhe we've decided to wait until monday [00:34:02] cscott: understood, have a good weekend! [00:34:25] we'll have a lot more people home and recovered from their jet lag by then, that's much appreciated [00:40:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1235433 [00:40:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1235433 (owner: 10TrainBranchBot) [00:47:45] (03CR) 10ROSKO JENKINS TECKCORP: "greetings contact me at teckcorp.office@gmail.com i have questions thanks" [software] - 10https://gerrit.wikimedia.org/r/145509 (owner: 10Springle) [00:51:57] (03PS1) 10Zabe: Upgrading psy/psysh (v0.12.10 => v0.12.19) [vendor] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235434 (https://phabricator.wikimedia.org/T416050) [00:52:52] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1235433 (owner: 10TrainBranchBot) [00:53:12] (03Abandoned) 10Zabe: Upgrading psy/psysh (v0.12.10 => v0.12.19) [vendor] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235418 (https://phabricator.wikimedia.org/T416050) (owner: 10Reedy) [00:54:11] (03Restored) 10Zabe: Upgrading psy/psysh (v0.12.10 => v0.12.19) [vendor] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235418 (https://phabricator.wikimedia.org/T416050) (owner: 10Reedy) [00:54:13] (03Abandoned) 10Zabe: Upgrading psy/psysh (v0.12.10 => v0.12.19) [vendor] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235434 (https://phabricator.wikimedia.org/T416050) (owner: 10Zabe) [01:10:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1235435 [01:10:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1235435 (owner: 10TrainBranchBot) [01:12:10] 06SRE, 10LDAP-Access-Requests: Grant Access to bitu-account-managers(?) for reedy - https://phabricator.wikimedia.org/T416062 (10Reedy) 03NEW [01:13:16] 06SRE, 10LDAP-Access-Requests: Grant Access to bitu-account-managers(?) for reedy - https://phabricator.wikimedia.org/T416062#11571135 (10Reedy) [01:34:08] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1199 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 UGood : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T416066 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [01:34:24] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T416066 (10ops-monitoring-bot) 03NEW [01:39:28] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1235435 (owner: 10TrainBranchBot) [02:00:47] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:04:42] FIRING: [8x] SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:23:29] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 22m 42s) [02:23:52] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:30:42] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:19:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:45:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:33:10] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:02:10] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [05:03:20] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [05:04:15] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:05:11] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:09:15] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:04:42] FIRING: [8x] SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:37:54] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:45:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:07] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235193 (owner: 10RLazarus) [10:04:42] FIRING: [8x] SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:14:27] (03PS1) 10Kosta Harlan: BlockUtils: Log x-provenance and IP reputation fields [extensions/WikimediaEvents] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235458 (https://phabricator.wikimedia.org/T415354) [11:19:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:24:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [11:24:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [11:24:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:45:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:59:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [11:59:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [11:59:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:42] FIRING: [8x] SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:26:50] PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:09:15] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:12] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Mon 16 Feb 2026 03:10:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [15:19:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:26:50] RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:34:15] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:25:23] (03CR) 10Dzahn: [C:03+2] cumin/insetup_role_report: send wmcs report to Mark and Levi, not Alex [puppet] - 10https://gerrit.wikimedia.org/r/1235071 (owner: 10Dzahn) [16:33:48] (03PS2) 10Dzahn: admin/nagios/wmcs: offboard Alex (akosiaris) [puppet] - 10https://gerrit.wikimedia.org/r/1235066 [16:34:06] (03CR) 10CI reject: [V:04-1] admin/nagios/wmcs: offboard Alex (akosiaris) [puppet] - 10https://gerrit.wikimedia.org/r/1235066 (owner: 10Dzahn) [16:36:06] (03PS3) 10Dzahn: admin/nagios/wmcs: offboard Alex (akosiaris) [puppet] - 10https://gerrit.wikimedia.org/r/1235066 [17:13:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [17:13:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [17:13:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:24:37] (03PS1) 10Func: filebackend: Clean up removed config params for multi-write backends [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235491 (https://phabricator.wikimedia.org/T328872) [17:38:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [17:38:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [17:38:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:04:42] FIRING: [8x] SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:14:42] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 5/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:16:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:16:42] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:21:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:19:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:00:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:10:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:06:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [21:06:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [21:06:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [21:13:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:18:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:39:15] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:44:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:46:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [21:46:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [21:46:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [22:04:42] FIRING: [8x] SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:34:54] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:35:51] (03PS3) 10Reedy: Bump wikimedia/parsoid to 0.23.0-a13.1 [vendor] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235384 (https://phabricator.wikimedia.org/T415888) (owner: 10C. Scott Ananian) [22:36:40] (03CR) 10Reedy: "Rebased it ontop of the psysh patch due to T416050 (and likely CI failures). Feel free just to deploy that one too." [vendor] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235384 (https://phabricator.wikimedia.org/T415888) (owner: 10C. Scott Ananian) [22:38:44] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:14:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [23:14:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [23:14:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:19:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:24:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [23:24:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [23:24:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:31:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [23:31:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [23:31:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas