[00:08:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1196988 [00:08:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1196988 (owner: 10TrainBranchBot) [00:13:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:13:57] (03PS2) 10Jdlrobson: [labs] Move namespaces to audience definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1194304 (https://phabricator.wikimedia.org/T404152) [00:27:21] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1196988 (owner: 10TrainBranchBot) [00:43:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:43:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:48:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:28:10] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:35:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:37:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:44:22] (03PS1) 10RLazarus: deployment_server: Add --priority to charlie [puppet] - 10https://gerrit.wikimedia.org/r/1196989 (https://phabricator.wikimedia.org/T406212) [02:44:24] (03PS1) 10RLazarus: deployment_server: Add --dangerously_fast to charlie [puppet] - 10https://gerrit.wikimedia.org/r/1196990 (https://phabricator.wikimedia.org/T406212) [02:48:23] (03CR) 10RLazarus: "Happy to bikeshed on the flag name!" [puppet] - 10https://gerrit.wikimedia.org/r/1196990 (https://phabricator.wikimedia.org/T406212) (owner: 10RLazarus) [03:43:51] FIRING: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: ... [03:43:51] Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [03:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:48:51] RESOLVED: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: ... [03:48:51] Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [03:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:51:51] FIRING: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: ... [03:51:51] Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [03:55:01] PROBLEM - PyBal backends health check on lvs2012 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb_443: Servers cp2040.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:56:01] RECOVERY - PyBal backends health check on lvs2012 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:56:51] RESOLVED: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: ... [03:56:51] Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [03:57:11] !incidents [03:57:13] 6883 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [03:57:13] 6882 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [03:59:03] PROBLEM - PyBal backends health check on lvs2012 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp2034.codfw.wmnet are marked down but pooled: uploadlb_443: Servers cp2034.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:00:51] FIRING: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: ... [04:00:51] Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [04:01:05] !incidents [04:01:05] 6884 (UNACKED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [04:01:05] 6883 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [04:01:05] 6882 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [04:01:10] !ack 6884 [04:01:10] 6884 (ACKED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [04:01:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [04:01:49] !incidents [04:01:50] 6884 (ACKED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [04:01:50] 6885 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [04:01:50] 6883 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [04:01:50] 6882 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [04:01:53] !ack 6885 [04:01:54] 6885 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [04:02:57] FIRING: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:05:51] RESOLVED: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: ... [04:05:51] Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [04:06:36] !incidents [04:06:36] 6885 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [04:06:37] 6886 (ACKED) ProbeDown sre (208.80.153.240 ip4 upload-https:443 probes/service http_upload-https_ip4 codfw) [04:06:37] 6884 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [04:06:37] 6883 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [04:06:37] 6882 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [04:06:53] FIRING: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:07:57] RESOLVED: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:09:23] RESOLVED: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:11:01] RECOVERY - PyBal backends health check on lvs2012 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:15:49] !incidents [04:15:49] 6885 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [04:15:50] 6886 (RESOLVED) ProbeDown sre (208.80.153.240 ip4 upload-https:443 probes/service http_upload-https_ip4 codfw) [04:15:50] 6884 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [04:15:50] 6883 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [04:15:50] 6882 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [04:16:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [04:43:43] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:08:28] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:28:11] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:35:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:37:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:38:28] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:06:51] (03PS2) 10Phuedx: MetricsPlatform: Initialize $wgMetricsPlatformExperimentStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196857 (https://phabricator.wikimedia.org/T406332) [06:07:12] (03CR) 10Phuedx: MetricsPlatform: Initialize $wgMetricsPlatformExperimentStreamNames (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196857 (https://phabricator.wikimedia.org/T406332) (owner: 10Phuedx) [06:08:32] (03PS3) 10Phuedx: MetricsPlatform: Initialize $wgMetricsPlatformExperimentStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196857 (https://phabricator.wikimedia.org/T406332) [06:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:37:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [06:37:57] FIRING: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:38:53] !incidents [06:38:54] 6887 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [06:38:54] 6888 (UNACKED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [06:38:54] 6885 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [06:38:54] 6886 (RESOLVED) ProbeDown sre (208.80.153.240 ip4 upload-https:443 probes/service http_upload-https_ip4 codfw) [06:38:54] 6884 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [06:38:55] 6883 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [06:38:55] 6882 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [06:38:57] !ack 6888 [06:38:58] 6888 (ACKED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [06:39:30] FIRING: [2x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [06:42:58] FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:43:40] !incidents [06:43:40] 6887 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [06:43:40] 6888 (ACKED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [06:43:41] 6885 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [06:43:41] 6886 (RESOLVED) ProbeDown sre (208.80.153.240 ip4 upload-https:443 probes/service http_upload-https_ip4 codfw) [06:43:41] 6884 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [06:43:41] 6883 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [06:43:41] 6882 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [06:44:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 7 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [06:45:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:48:28] FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:02:58] FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:03:07] !incidents [07:03:07] 6887 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [07:03:07] 6888 (ACKED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [07:03:08] 6885 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [07:03:08] 6886 (RESOLVED) ProbeDown sre (208.80.153.240 ip4 upload-https:443 probes/service http_upload-https_ip4 codfw) [07:03:08] 6884 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:03:08] 6883 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:03:08] 6882 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:03:28] RESOLVED: JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:05:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:12:57] RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:13:49] !incidents [07:13:50] 6887 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [07:13:50] 6888 (RESOLVED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [07:13:50] 6885 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [07:13:50] 6886 (RESOLVED) ProbeDown sre (208.80.153.240 ip4 upload-https:443 probes/service http_upload-https_ip4 codfw) [07:13:50] 6884 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:13:51] 6883 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:13:51] 6882 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr1-codfw:9804 Transit: Arelion (IC-308846) {#10905_12273-1} xe-1/0/1:0 gnmi codfw) [07:14:30] RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 3 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [07:17:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:11:07] (03CR) 10SD0001: [C:03+1] "Harmless no-op change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192528 (https://phabricator.wikimedia.org/T406023) (owner: 10TheDJ) [08:21:38] (03PS1) 10BCornwall: add enwp.org [dns] - 10https://gerrit.wikimedia.org/r/1196994 [08:22:01] (03PS2) 10BCornwall: Add enwp.org [dns] - 10https://gerrit.wikimedia.org/r/1196994 (https://phabricator.wikimedia.org/T332220) [08:22:44] (03CR) 10BCornwall: [V:03+2 C:03+2] Add enwp.org [dns] - 10https://gerrit.wikimedia.org/r/1196994 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [08:23:45] !log brett@dns1004 START - running authdns-update [08:25:04] !log brett@dns1004 END - running authdns-update [08:26:17] (03CR) 10BCornwall: [C:03+2] ncredir: Add enwp.org/c.enwp.org redirection [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [08:36:26] (03CR) 10Btullis: [V:03+1] Change the component from where we install elasticsearch-curator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [08:42:07] (03PS1) 10BCornwall: Add c CNAME to ncredir-parking template [dns] - 10https://gerrit.wikimedia.org/r/1196995 (https://phabricator.wikimedia.org/T332220) [08:43:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:44:03] (03CR) 10BCornwall: [C:03+2] Add c CNAME to ncredir-parking template [dns] - 10https://gerrit.wikimedia.org/r/1196995 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [08:44:17] !log brett@dns1004 START - running authdns-update [08:45:00] !log brett@dns1004 END - running authdns-update [08:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:28:11] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:35:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:37:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:57:48] (03PS1) 10Superpes15: [specieswiki] Enable USERLANGUAGE magic word [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197004 (https://phabricator.wikimedia.org/T406583) [12:43:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:28:11] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:35:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:37:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:08:28] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:28] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:43:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:28:11] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:35:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:37:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:43:44] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:52:33] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:10:14] help [21:28:11] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:35:07] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:37:11] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:17:44] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:38:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1197022 [23:38:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1197022 (owner: 10TrainBranchBot) [23:45:21] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:50:14] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1197022 (owner: 10TrainBranchBot) [23:51:48] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown