[00:10:26] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:20:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:21:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:40:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1221104 [00:40:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1221104 (owner: 10TrainBranchBot) [00:44:44] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 75%, RTA = 5579.10 ms [00:45:10] RECOVERY - Host wikikube-worker1275 is UP: PING WARNING - Packet loss = 0%, RTA = 942.22 ms [00:46:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:52:25] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1221104 (owner: 10TrainBranchBot) [01:00:40] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:02] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:10:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1221105 [01:10:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1221105 (owner: 10TrainBranchBot) [01:11:02] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:24:51] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 24m 10s) [01:27:54] PROBLEM - Host ncredir7003 is DOWN: CRITICAL - Time to live exceeded (10.140.2.3) [01:28:08] RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 137.02 ms [01:32:28] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1221105 (owner: 10TrainBranchBot) [01:48:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:03:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:04:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:24:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:25:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:26:43] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:31:12] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 2917.26 ms [02:31:36] RECOVERY - Host wikikube-worker1275 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [02:35:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:23:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:34:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:48:38] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [03:49:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:51:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:01:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:06:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:10:26] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:26:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:27:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:52:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:53:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:03:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:33:32] PROBLEM - Host wikikube-worker1053 is DOWN: PING CRITICAL - Packet loss = 77%, RTA = 8443.39 ms [05:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:22] RECOVERY - Host wikikube-worker1053 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [05:53:14] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 100% [05:54:10] RECOVERY - Host wikikube-worker1275 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [06:04:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:14:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:21:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:26:43] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:36:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:23:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:35:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:48:38] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [08:10:26] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:20:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:21:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:26:43] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:23:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:30:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [11:32:22] o/ [11:32:50] acked the incidnet [11:32:53] *incident [11:33:48] I acked it in the VO app earlier, not sure if that propagated yet [11:33:50] PROBLEM - Host wikikube-worker1016 is DOWN: PING CRITICAL - Packet loss = 100% [11:34:18] RECOVERY - Host wikikube-worker1016 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [11:34:39] moritzm: o/ [11:48:38] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [11:51:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:10:26] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:14:19] !incidents [12:14:20] 7244 (ACKED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [12:14:20] 7243 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet esams) [12:14:20] 7242 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [12:14:21] 7241 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [12:30:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [12:50:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:10:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:20:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:23:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:33:47] FIRING: HelmReleaseBadStatus: Helm release mw-script/x0zp5851 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:10:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:26:43] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:14] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:48:38] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [16:10:26] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:13:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:16:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:21:10] PROBLEM - Ensure traffic_server is running for instance backend on cp7004 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:22:10] RECOVERY - Ensure traffic_server is running for instance backend on cp7004 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:31:05] (03PS1) 10Pppery: Delete the translations/ dir before regenerating [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221130 [16:31:44] (03PS2) 10Pppery: Delete the translations/ dir before regenerating [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221130 [17:11:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:14:41] (03PS1) 10Pppery: Handle all format specifiers [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221131 (https://phabricator.wikimedia.org/T413529) [17:18:50] (03CR) 10Thiemo Kreuz (WMDE): Delete the translations/ dir before regenerating (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221130 (owner: 10Pppery) [17:34:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/x0zp5851 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:35:03] (03CR) 10Pppery: Delete the translations/ dir before regenerating (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221130 (owner: 10Pppery) [17:39:45] (03CR) 10Thiemo Kreuz (WMDE): Delete the translations/ dir before regenerating (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221130 (owner: 10Pppery) [17:56:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:57:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:07:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:11:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:26:43] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:04:07] (03PS3) 10Pppery: Delete the translations/ dir before regenerating [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221130 [19:05:12] (03CR) 10Pppery: Delete the translations/ dir before regenerating (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221130 (owner: 10Pppery) [19:06:41] (03PS1) 10Pppery: Rebuild library map automatically after generating files [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221162 [19:08:24] (03CR) 10Pppery: "I just realized there's a surprising amount of subtlety here; with this patch the mainloop runs with the library map out of date. Since th" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221130 (owner: 10Pppery) [19:09:29] (03CR) 10Pppery: Rebuild library map automatically after generating files (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221162 (owner: 10Pppery) [19:10:24] PROBLEM - Host ncredir7003 is DOWN: CRITICAL - Time to live exceeded (10.140.2.3) [19:10:50] PROBLEM - Host doh7004 is DOWN: PING CRITICAL - Packet loss = 100% [19:10:50] PROBLEM - Host doh7003 is DOWN: PING CRITICAL - Packet loss = 100% [19:10:58] RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 137.32 ms [19:11:06] RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 137.07 ms [19:11:14] RECOVERY - Host doh7004 is UP: PING OK - Packet loss = 0%, RTA = 137.22 ms [19:15:36] PROBLEM - Host ncredir7003 is DOWN: PING CRITICAL - Packet loss = 100% [19:15:40] PROBLEM - Host doh7003 is DOWN: CRITICAL - Time to live exceeded (195.200.68.98) [19:15:40] PROBLEM - Host doh7004 is DOWN: CRITICAL - Time to live exceeded (195.200.68.101) [19:15:50] RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 137.20 ms [19:15:54] RECOVERY - Host doh7004 is UP: PING OK - Packet loss = 0%, RTA = 137.02 ms [19:16:02] RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 136.91 ms [19:23:55] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:26:40] PROBLEM - Host doh7003 is DOWN: CRITICAL - Time to live exceeded (195.200.68.98) [19:26:40] PROBLEM - Host doh7004 is DOWN: CRITICAL - Time to live exceeded (195.200.68.101) [19:26:58] RECOVERY - Host doh7004 is UP: PING OK - Packet loss = 0%, RTA = 137.12 ms [19:26:58] RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 137.09 ms [19:35:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [19:36:55] !incidents [19:36:56] 7245 (UNACKED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet esams) [19:36:56] 7244 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [19:36:56] 7243 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet esams) [19:36:56] 7242 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [19:37:37] !ack 7245 [19:37:38] 7245 (ACKED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet esams) [19:40:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [19:48:38] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [19:56:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:10:26] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:17:36] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 66%, RTA = 2819.33 ms [20:18:10] RECOVERY - Host wikikube-worker1275 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [20:54:28] PROBLEM - Thanos swift https on thanos-fe1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [20:55:26] RECOVERY - Thanos swift https on thanos-fe1007 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 7.990 second response time https://wikitech.wikimedia.org/wiki/Thanos [20:58:18] (03PS9) 10Pppery: Extract strings from US English locale as source strings and apply PLURAL [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217844 (https://phabricator.wikimedia.org/T412421) [20:58:18] (03PS2) 10Pppery: Handle all format specifiers [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221131 (https://phabricator.wikimedia.org/T413529) [21:01:30] (03PS1) 10Pppery: Add a `bin/translatewiki roundtrip` workflow to validate the string-mangling code [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221180 (https://phabricator.wikimedia.org/T413532) [21:07:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:08:19] o/ [21:12:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:12:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:30:26] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:34:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/x0zp5851 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:34:14] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:45:42] (03PS3) 10Pppery: Handle all format specifiers [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221131 (https://phabricator.wikimedia.org/T413529) [21:46:12] (03PS2) 10Pppery: Add a `bin/translatewiki roundtrip` workflow to validate the string-mangling code [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221180 (https://phabricator.wikimedia.org/T413532) [22:02:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:03:41] (03PS1) 10Pppery: Set up `arc lint`, make it pass, update README [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221191 (https://phabricator.wikimedia.org/T413531) [22:03:58] (03PS2) 10Pppery: Set up `arc lint`, make it pass, update README [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221191 (https://phabricator.wikimedia.org/T413531) [22:05:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:07:31] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:10:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:26:43] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:32:38] 06SRE, 06serviceops: Decide whether to exclude {api,rest}-gateway-ro from ATSBackendErrorsHigh - https://phabricator.wikimedia.org/T413544 (10RLazarus) 03NEW [22:33:11] (03PS1) 10RLazarus: team-sre: Exclude {api,rest}-gateway-ro from ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1221195 (https://phabricator.wikimedia.org/T413544) [22:34:39] 06SRE, 06serviceops, 13Patch-For-Review: Decide whether to exclude {api,rest}-gateway-ro from ATSBackendErrorsHigh - https://phabricator.wikimedia.org/T413544#11485690 (10RLazarus) Proposing https://gerrit.wikimedia.org/r/1221195 as an interim solution over the break, and once we're back we can either keep i... [22:39:11] (03CR) 10JHathaway: [C:03+1] "looks good" [alerts] - 10https://gerrit.wikimedia.org/r/1221195 (https://phabricator.wikimedia.org/T413544) (owner: 10RLazarus) [22:39:42] (03CR) 10RLazarus: [C:03+2] team-sre: Exclude {api,rest}-gateway-ro from ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1221195 (https://phabricator.wikimedia.org/T413544) (owner: 10RLazarus) [22:41:25] (03Merged) 10jenkins-bot: team-sre: Exclude {api,rest}-gateway-ro from ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1221195 (https://phabricator.wikimedia.org/T413544) (owner: 10RLazarus) [22:55:26] (03PS3) 10Pppery: Add a `bin/translatewiki roundtrip` workflow to validate the string-mangling code [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221180 (https://phabricator.wikimedia.org/T413532) [22:58:17] (03PS3) 10Pppery: Set up `arc lint`, make it pass, update README [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221191 (https://phabricator.wikimedia.org/T413531) [23:29:26] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 90%, RTA = 9100.04 ms [23:30:06] RECOVERY - Host wikikube-worker1275 is UP: PING WARNING - Packet loss = 50%, RTA = 1271.94 ms [23:32:16] RESOLVED: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:33:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:38:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:41:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:46:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:48:38] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [23:52:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:57:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate