[00:10:27] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:39:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1221999 [00:39:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1221999 (owner: 10TrainBranchBot) [00:53:23] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1221999 (owner: 10TrainBranchBot) [01:00:51] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1222001 [01:10:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1222001 (owner: 10TrainBranchBot) [01:15:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:21:44] FIRING: [4x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:25:46] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:31:13] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1222001 (owner: 10TrainBranchBot) [01:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/x0zp5851 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:35:46] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:37:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:12:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:18:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:38:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:39:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:54:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:57:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:04:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:07:56] PROBLEM - Host bast6003 is DOWN: PING CRITICAL - Packet loss = 100% [03:07:56] PROBLEM - Host asw1-b12-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [03:08:30] RECOVERY - Host asw1-b12-drmrs is UP: PING OK - Packet loss = 0%, RTA = 91.15 ms [03:09:02] RECOVERY - Host bast6003 is UP: PING OK - Packet loss = 0%, RTA = 86.56 ms [03:19:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:47:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:48:39] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [03:56:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:10:27] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:21:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:26:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:53:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:01:16] RESOLVED: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:09:14] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:21:44] FIRING: [4x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/x0zp5851 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:34:14] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:40:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:2 (Transport: cr2-eqdfw:xe-0/1/0 (CyrusOne, IX1.DFW4.2/4) {#11402_12295-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [05:45:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:2 (Transport: cr2-eqdfw:xe-0/1/0 (CyrusOne, IX1.DFW4.2/4) {#11402_12295-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [05:47:25] !incidents [05:47:25] 7257 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr2-eqdfw:xe-0/1/0 (CyrusOne, IX1.DFW4.2/4) {#11402_12295-3} xe-1/1/1:2 gnmi codfw) [05:47:25] 7256 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [05:47:38] !ack 7257 [05:47:38] 7257 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr2-eqdfw:xe-0/1/0 (CyrusOne, IX1.DFW4.2/4) {#11402_12295-3} xe-1/1/1:2 gnmi codfw) [06:00:51] FIRING: [3x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:2 (Transport: cr2-eqdfw:xe-0/1/0 (CyrusOne, IX1.DFW4.2/4) {#11402_12295-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [06:01:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:03:24] <_joe_> !incidents [06:03:25] 7257 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr2-eqdfw:xe-0/1/0 (CyrusOne, IX1.DFW4.2/4) {#11402_12295-3} xe-1/1/1:2 gnmi codfw) [06:03:25] 7256 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [06:10:51] FIRING: [4x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:2 (Transport: cr2-eqdfw:xe-0/1/0 (CyrusOne, IX1.DFW4.2/4) {#11402_12295-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [06:15:51] FIRING: [4x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:2 (Transport: cr2-eqdfw:xe-0/1/0 (CyrusOne, IX1.DFW4.2/4) {#11402_12295-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [06:20:51] FIRING: [4x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:2 (Transport: cr2-eqdfw:xe-0/1/0 (CyrusOne, IX1.DFW4.2/4) {#11402_12295-3}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [06:25:51] RESOLVED: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [06:25:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [06:27:45] <_joe_> !ack [06:27:45] no value provided for parameter incident and no default available [06:27:45] Incident id must be an integer [06:27:53] <_joe_> !incidents [06:27:54] 7257 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr2-eqdfw:xe-0/1/0 (CyrusOne, IX1.DFW4.2/4) {#11402_12295-3} xe-1/1/1:2 gnmi codfw) [06:27:54] 7256 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [06:28:05] <_joe_> uhm so that last "page"... whatever [07:46:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:47:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:48:39] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251231T0800) [08:07:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:10:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:10:27] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:16:29] FIRING: [6x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:20:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:25:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:35:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:37:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:42:21] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:45:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:50:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:53:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:56:36] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:01:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:01:36] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:06:36] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:11:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:16:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:16:36] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:21:21] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/x0zp5851 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:10:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:15:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:20:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:26:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:04:58] 06SRE, 10API Platform: several HTTP 503 and 504 errors - https://phabricator.wikimedia.org/T413622#11487839 (10Peachey88) [11:07:08] PROBLEM - Host install7002 is DOWN: CRITICAL - Time to live exceeded (195.200.68.100) [11:07:10] PROBLEM - Host prometheus7002 is DOWN: CRITICAL - Time to live exceeded (10.140.2.5) [11:07:10] PROBLEM - Host ncredir7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.8) [11:07:10] PROBLEM - Host tcp-proxy7001 is DOWN: CRITICAL - Time to live exceeded (10.140.2.10) [11:07:10] PROBLEM - Host tcp-proxy7002 is DOWN: CRITICAL - Time to live exceeded (10.140.2.11) [11:07:10] PROBLEM - Host hcaptcha-proxy7002 is DOWN: CRITICAL - Time to live exceeded (195.200.68.103) [11:07:10] PROBLEM - Host ncredir7003 is DOWN: CRITICAL - Time to live exceeded (10.140.2.3) [11:07:10] PROBLEM - Host asw1-b3-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.130) [11:07:11] PROBLEM - Host asw1-b4-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.131) [11:07:12] PROBLEM - Host hcaptcha-proxy7001 is DOWN: CRITICAL - Time to live exceeded (195.200.68.102) [11:07:28] RECOVERY - Host install7002 is UP: PING OK - Packet loss = 0%, RTA = 136.98 ms [11:07:30] RECOVERY - Host tcp-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 136.98 ms [11:07:30] RECOVERY - Host hcaptcha-proxy7002 is UP: PING OK - Packet loss = 0%, RTA = 137.17 ms [11:07:30] RECOVERY - Host hcaptcha-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 137.01 ms [11:07:30] RECOVERY - Host tcp-proxy7002 is UP: PING OK - Packet loss = 0%, RTA = 137.03 ms [11:07:30] RECOVERY - Host prometheus7002 is UP: PING OK - Packet loss = 0%, RTA = 137.20 ms [11:07:32] RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 137.22 ms [11:07:32] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 137.14 ms [11:07:32] RECOVERY - Host asw1-b4-magru is UP: PING OK - Packet loss = 0%, RTA = 137.55 ms [11:07:32] RECOVERY - Host asw1-b3-magru is UP: PING OK - Packet loss = 0%, RTA = 144.58 ms [11:17:51] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:31:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:36:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:48:39] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [12:01:27] (03CR) 10Agamyasamuel: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1221980 (owner: 10Divyaratann Srivastava) [12:03:19] (03CR) 10CI reject: [V:04-1] T201491: Fix grammar in comments [puppet] - 10https://gerrit.wikimedia.org/r/1221980 (owner: 10Divyaratann Srivastava) [12:10:28] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:16:44] FIRING: [6x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:41:29] FIRING: [8x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:50:01] 06SRE, 10API Platform: several HTTP 503 and 504 errors - https://phabricator.wikimedia.org/T413622#11487874 (10Aklapper) @doctaxon: Which of the APIs is this about? [12:50:31] 06SRE, 10API Platform: several HTTP 503 and 504 errors - https://phabricator.wikimedia.org/T413622#11487875 (10Aklapper) And what's an example action? This is currently too vague that it could be investigated... [12:53:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:28] (03PS1) 10Divyaratann Srivastava: Fix typo in success messages for sending labels and descriptions in success messages for sending labels and descriptions [puppet] - 10https://gerrit.wikimedia.org/r/1222271 (https://phabricator.wikimedia.org/T201491) [13:07:40] PROBLEM - Host doh7004 is DOWN: CRITICAL - Time to live exceeded (195.200.68.101) [13:07:40] PROBLEM - Host doh7003 is DOWN: CRITICAL - Time to live exceeded (195.200.68.98) [13:08:00] PROBLEM - Host prometheus7002 is DOWN: CRITICAL - Time to live exceeded (10.140.2.5) [13:08:00] PROBLEM - Host tcp-proxy7002 is DOWN: CRITICAL - Time to live exceeded (10.140.2.11) [13:08:00] PROBLEM - Host ncredir7004 is DOWN: CRITICAL - Time to live exceeded (10.140.2.8) [13:08:00] PROBLEM - Host tcp-proxy7001 is DOWN: CRITICAL - Time to live exceeded (10.140.2.10) [13:08:00] PROBLEM - Host hcaptcha-proxy7002 is DOWN: CRITICAL - Time to live exceeded (195.200.68.103) [13:08:00] PROBLEM - Host asw1-b4-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.131) [13:08:00] PROBLEM - Host ncredir7003 is DOWN: CRITICAL - Time to live exceeded (10.140.2.3) [13:08:01] PROBLEM - Host asw1-b3-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.130) [13:08:01] RECOVERY - Host doh7004 is UP: PING WARNING - Packet loss = 66%, RTA = 271.91 ms [13:08:02] RECOVERY - Host doh7003 is UP: PING WARNING - Packet loss = 66%, RTA = 271.89 ms [13:08:14] RECOVERY - Host ncredir7004 is UP: PING OK - Packet loss = 0%, RTA = 137.22 ms [13:08:14] RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 137.01 ms [13:08:14] RECOVERY - Host hcaptcha-proxy7002 is UP: PING OK - Packet loss = 0%, RTA = 137.20 ms [13:08:16] RECOVERY - Host asw1-b4-magru is UP: PING OK - Packet loss = 0%, RTA = 143.32 ms [13:08:16] RECOVERY - Host asw1-b3-magru is UP: PING OK - Packet loss = 0%, RTA = 141.94 ms [13:08:32] RECOVERY - Host tcp-proxy7001 is UP: PING OK - Packet loss = 0%, RTA = 137.09 ms [13:08:32] RECOVERY - Host tcp-proxy7002 is UP: PING OK - Packet loss = 0%, RTA = 136.98 ms [13:08:32] RECOVERY - Host prometheus7002 is UP: PING OK - Packet loss = 0%, RTA = 136.87 ms [13:14:36] (03CR) 10Agamyasamuel: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1222271 (https://phabricator.wikimedia.org/T201491) (owner: 10Divyaratann Srivastava) [13:16:26] (03CR) 10CI reject: [V:04-1] Fix typo in success messages for sending labels and descriptions in success messages for sending labels and descriptions [puppet] - 10https://gerrit.wikimedia.org/r/1222271 (https://phabricator.wikimedia.org/T201491) (owner: 10Divyaratann Srivastava) [13:17:44] PROBLEM - Host ml-serve2004 is DOWN: PING CRITICAL - Packet loss = 100% [13:22:15] (03Abandoned) 10Divyaratann Srivastava: T201491: Fix grammar in comments [puppet] - 10https://gerrit.wikimedia.org/r/1221980 (owner: 10Divyaratann Srivastava) [13:22:50] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/x0zp5851 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:49:39] (03PS2) 10Divyaratann Srivastava: Fixed threads code and updated logic [puppet] - 10https://gerrit.wikimedia.org/r/1222271 (https://phabricator.wikimedia.org/T201491) [14:38:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [14:38:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [14:38:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:40:31] 06SRE, 10API Platform: several HTTP 503 and 504 errors - https://phabricator.wikimedia.org/T413622#11487930 (10doctaxon) As I stated in my original report, this occurs regardless of which API action is being performed and independent of the time of day. The issue is highly variable. That's precisely why I reco... [14:53:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [14:53:44] Deployment mw-jobrunner.codfw.main in mw-jobrunner at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.codfw.main - ... [14:53:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:09:14] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:26:31] 06SRE, 10API Platform: several HTTP 503 and 504 errors - https://phabricator.wikimedia.org/T413622#11487950 (10taavi) 05Open→03Invalid As Andre says this is unactionable without concrete example requests and responses. There is no such single thing as "the API" nor is it possible to magically identify... [15:34:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:44:28] (03CR) 10Agamyasamuel: "This is duplicate of the Change-Id: If412360d0b263111859be77430752044033d4cf8" [puppet] - 10https://gerrit.wikimedia.org/r/1222271 (https://phabricator.wikimedia.org/T201491) (owner: 10Divyaratann Srivastava) [15:48:39] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [16:10:28] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:41:44] FIRING: [8x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:23:05] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:24:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:29:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/x0zp5851 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:44:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:01:16] RESOLVED: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:03:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:48:39] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [20:10:28] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:38:30] 06SRE, 10Wikimedia-Mailing-lists: Add an owner for mailing list - https://phabricator.wikimedia.org/T413632#11488112 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Added you as an owner. [20:40:47] 06SRE, 10Wikimedia-Mailing-lists: Create mailing list steward-clerks-l - https://phabricator.wikimedia.org/T413631#11488116 (10Ladsgroup) We don't add -l anymore. It made sense when all email addresses were handled from one domain. Looking: https://meta.wikimedia.org/wiki/Mailing_lists/Standardization it shou... [20:41:44] FIRING: [8x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:23:05] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:34:03] FIRING: HelmReleaseBadStatus: Helm release mw-script/x0zp5851 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:06:01] 06SRE, 10Wikimedia-Mailing-lists: Create mailing list steward-clerks-l - https://phabricator.wikimedia.org/T413631#11488144 (10Xaosflux) [22:06:30] 06SRE, 10Wikimedia-Mailing-lists: Create mailing list steward-clerks-l - https://phabricator.wikimedia.org/T413631#11488145 (10Xaosflux) ok, updated above [23:03:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:18:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:21:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:39:16] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/7 UP : OSPFv3: 4/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:39:18] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:40:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 4/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:40:18] PROBLEM - Host ganeti3008 is DOWN: PING CRITICAL - Packet loss = 100% [23:40:24] PROBLEM - Host cp3068 is DOWN: PING CRITICAL - Packet loss = 100% [23:40:42] PROBLEM - Host cp3072 is DOWN: PING CRITICAL - Packet loss = 100% [23:40:44] PROBLEM - Host cp3079 is DOWN: PING CRITICAL - Packet loss = 100% [23:40:44] PROBLEM - Host cp3069 is DOWN: PING CRITICAL - Packet loss = 100% [23:40:48] PROBLEM - SSH on lvs3008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:40:52] RECOVERY - Host cp3072 is UP: PING WARNING - Packet loss = 66%, RTA = 91.83 ms [23:40:54] RECOVERY - Host cp3068 is UP: PING WARNING - Packet loss = 71%, RTA = 88.47 ms [23:40:54] RECOVERY - Host cp3079 is UP: PING WARNING - Packet loss = 50%, RTA = 85.67 ms [23:40:54] RECOVERY - Host cp3069 is UP: PING WARNING - Packet loss = 66%, RTA = 83.39 ms [23:41:06] RECOVERY - Host ganeti3008 is UP: PING OK - Packet loss = 0%, RTA = 78.35 ms [23:41:12] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:41:16] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:41:18] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:41:40] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3070 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:41:42] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:42:10] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.215 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:42:22] PROBLEM - Host cp3079 is DOWN: PING CRITICAL - Packet loss = 100% [23:42:30] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3070 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.332 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:42:38] RECOVERY - Host cp3079 is UP: PING WARNING - Packet loss = 71%, RTA = 85.90 ms [23:42:38] RECOVERY - SSH on lvs3008 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:42:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (2620:0:860:fe0a::1) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:42:40] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp3078 is OK: HTTP OK: HTTP/1.1 200 OK - 50301 bytes in 7.538 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:42:40] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3080 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:43:30] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp3080 is OK: HTTP OK: HTTP/1.0 200 OK - 37007 bytes in 0.294 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:45:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:47:10] RESOLVED: [8x] BFDdown: BFD session down between cr1-eqiad and 208.80.153.215 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:47:39] RESOLVED: [5x] CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (208.80.153.204) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:48:39] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown