[00:00:39] RECOVERY - netbox HTTPS on netbox1001 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.787 second response time https://wikitech.wikimedia.org/wiki/Netbox [00:00:47] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:01:13] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:01:49] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:02:31] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:02:49] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:04:51] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:07:15] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 9 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:07:20] 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80% [00:07:33] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:08:01] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:08:09] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:08:41] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:09:09] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:10:05] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 45.91 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:16:25] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 72.06 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:17:27] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary inbound port utilisation over 80% [00:17:39] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:18:39] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:19:33] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 40.09 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:19:45] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:20:39] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:22:27] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:22:57] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:23:27] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:27:21] 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80% [00:28:13] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:28:49] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:29:07] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 34.96 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:29:19] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:29:49] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:30:23] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:32:15] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 134.9 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:33:10] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [00:33:20] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary inbound port utilisation over 80% [00:33:39] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [00:34:11] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 93%, RTA = 84.16 ms [00:34:17] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:34:36] PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:34:37] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:34:50] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:34:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:35:13] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:35:50] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.40 ms [00:36:11] PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:36:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:37:05] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 14.16 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:39:02] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [00:39:15] RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:40:29] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:40:33] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [00:40:54] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [00:40:57] RECOVERY - Host lvs3001 is UP: PING OK - Packet loss = 0%, RTA = 83.48 ms [00:40:58] RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 7.688 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:41:01] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:41:06] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 28%, RTA = 83.54 ms [00:41:07] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15796 bytes in 0.489 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:41:39] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:42:21] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:42:24] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.34 ms [00:43:51] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [00:44:01] RECOVERY - Host lvs3001 is UP: PING OK - Packet loss = 0%, RTA = 83.42 ms [00:44:54] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [00:45:55] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:46:29] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:47:18] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 73%, RTA = 83.37 ms [00:48:49] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:50:11] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:50:40] PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:50:56] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:51:47] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:52:00] PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:52:01] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:53:09] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [00:53:29] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 93%, RTA = 84.16 ms [00:58:30] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:00:07] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:00:17] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [01:01:21] RECOVERY - Host lvs3001 is UP: PING OK - Packet loss = 0%, RTA = 83.43 ms [01:01:34] RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:01:36] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 0.493 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:01:40] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.37 ms [01:02:07] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:04:22] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:27] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 165 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:06:57] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:10:02] PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:10:08] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 54%, RTA = 83.35 ms [01:10:18] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:12:15] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:13:08] RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:13:23] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:14:01] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 9.609 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:14:35] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:14:38] PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:14:39] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:16:13] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:17:15] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 2950 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:18:01] PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:18:11] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:19:07] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:19:47] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:20:45] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:22:41] RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:22:56] PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:26:22] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 7.598 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:26:55] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 13 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:27:26] RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:27:27] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15809 bytes in 0.521 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:27:42] RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:28:33] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 395.2 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:31:18] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:31:22] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:31:39] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:32:37] PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:33:15] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:33:59] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:34:23] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:34:59] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 56.33 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:35:30] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:36:35] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 114.3 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:37:13] PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp3032.esams.wmnet, cp3030.esams.wmnet are marked down but pooled: textlb_443: Servers cp3041.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3043.esams.wmnet, cp3033.esams.wmnet, cp3030.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:37:16] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 93%, RTA = 84.25 ms [01:37:19] RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:37:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:38:03] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:38:49] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:39:00] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.35 ms [01:39:13] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:39:17] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:39:41] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:40:32] PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:41:51] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [01:41:55] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:41:55] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:41:55] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:42:24] PROBLEM - LVS HTTP IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:42:24] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:42:24] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:42:24] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:42:24] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:42:25] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:42:25] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:42:26] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [01:42:35] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:42:35] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:42:35] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:42:35] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:42:36] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:42:43] PROBLEM - graphoid endpoints health on scb2003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:42:43] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:42:43] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [01:42:46] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:42:47] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:42:57] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:42:58] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:42:58] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [01:43:05] PROBLEM - graphoid endpoints health on scb2005 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:43:05] PROBLEM - graphoid endpoints health on scb2006 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:43:05] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:43:05] PROBLEM - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [01:46:46] PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:46:46] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:46:46] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:46:46] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:48:58] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:49:17] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 97.38 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:49:59] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 54 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [01:50:11] PROBLEM - PyBal backends health check on lvs3003 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 https://wikitech.wikimedia.org/wiki/PyBal [01:50:27] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:50:31] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:50:31] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:50:33] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:50:35] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:51:05] PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:51:06] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:51:06] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs1013 is CRITICAL: 6662 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1013&var-datasource=eqiad+prometheus/ops [01:51:10] !log stopped pybal + disabled puppet on lvs3003 [01:51:27] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 72.9 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:51:35] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 114 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [01:52:04] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:52:05] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.194 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:52:05] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:52:05] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:52:05] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:52:05] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [01:52:27] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:52:44] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [01:52:44] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [01:52:51] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 86%, RTA = 84.17 ms [01:52:51] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:52:51] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:52:51] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:52:56] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [01:52:56] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:52:57] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:53:00] PROBLEM - LVS HTTPS IPv6 #page on ncredir-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:53:05] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:53:05] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:53:05] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:53:16] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [01:53:16] PROBLEM - LVS HTTPS IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:53:16] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:54:29] PROBLEM - pybal on lvs3003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [01:54:41] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:54:49] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 36.77 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:54:55] PROBLEM - PyBal connections to etcd on lvs3003 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [01:55:05] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:55:11] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:55:19] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:55:19] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [01:55:19] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:55:21] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:55:21] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:55:21] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response wa [01:56:12] rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [01:56:13] PROBLEM - wiki content on commons #page on commons.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/1118/ [01:56:13] PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator [01:56:13] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api [01:56:13] ile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/page/metadata/{title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received h [01:56:13] ikimedia.org/wiki/RESTBase [01:56:14] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:56:14] 04Critical Alert for device cr2-eqord.wikimedia.org - Primary outbound port utilisation over 80% [01:56:14] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:56:14] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:56:15] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:56:15] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:56:16] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:56:16] PROBLEM - LVS HTTPS IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:56:17] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:56:19] PROBLEM - NTP peers on dns4002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [01:56:25] 04Critical Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% [01:56:30] PROBLEM - LVS HTTP IPv4 #page on ncredir-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:56:49] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:57:11] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:57:11] PROBLEM - Juniper alarms on mr1-ulsfo is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 198.35.26.194 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [01:57:11] RECOVERY - wiki content on commons #page on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 171113 bytes in 8.595 second response time https://phabricator.wikimedia.org/project/view/1118/ [01:57:17] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:57:17] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=ulsfo https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [01:57:29] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:57:45] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs1013 is CRITICAL: 5441 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1013&var-datasource=eqiad+prometheus/ops [01:57:48] PROBLEM - LVS HTTPS IPv4 #page on ncredir-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:57:53] PROBLEM - LVS HTTP IPv6 #page on ncredir-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:57:53] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:57:53] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:57:53] RECOVERY - NTP peers on dns4002 is OK: NTP OK: Offset -0.00051 secs https://wikitech.wikimedia.org/wiki/NTP [01:58:03] RECOVERY - LVS HTTP IPv4 #page on ncredir-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 1.021 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:58:09] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:58:12] RECOVERY - LVS HTTPS IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15773 bytes in 1.301 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:58:13] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:58:14] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% [01:58:32] PROBLEM - LVS HTTP IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:58:39] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.194 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:58:39] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:58:39] RECOVERY - Juniper alarms on mr1-ulsfo is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [01:59:05] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:59:16] PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:59:28] RECOVERY - LVS HTTP IPv6 #page on ncredir-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 3.062 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:59:41] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:59:53] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 23.27 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:00:04] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:00:04] RECOVERY - LVS HTTP IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:00:04] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 40, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:00:04] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:00:05] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [02:00:11] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [02:00:13] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [02:01:06] RECOVERY - LVS HTTPS IPv4 #page on ncredir-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 1.864 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:01:32] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs1013 is CRITICAL: 9388 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1013&var-datasource=eqiad+prometheus/ops [02:01:32] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 714 days) https://wikitech.wikimedia.org/wiki/Logs [02:01:32] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:01:41] PROBLEM - NTP peers on dns4001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [02:02:19] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [02:03:07] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [02:03:12] 08Warning Alert for device cr2-eqiad.wikimedia.org - Processor usage over 85% [02:03:19] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 165.5 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:03:19] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [02:03:23] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80% [02:03:37] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [02:04:14] 04Critical Alert for device cr1-codfw.wikimedia.org - Primary inbound port utilisation over 80% [02:04:38] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary inbound port utilisation over 80% [02:04:56] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% [02:07:22] PROBLEM - Maps edge ulsfo on upload-lb.ulsfo.wikimedia.org is CRITICAL: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received: /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps/RunBook [02:07:22] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:07:22] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:07:22] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 80%, RTA = 84.17 ms [02:07:22] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: [02:07:23] /mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/page/metadata/{title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest [02:07:23] g/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed ou [02:09:18] still under DDoS ? [02:09:38] PROBLEM - LVS HTTP IPv6 #page on ncredir-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:09:43] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - No response from remote host 198.35.26.193 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:09:55] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:09:55] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [02:10:01] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: No response from remote host 208.80.154.198 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:10:01] rxy, Yes, ops are working on it [02:10:07] PROBLEM - Check if active EventStreams endpoint is delivering messages. on icinga1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [02:10:14] thanks [02:10:15] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.194 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:10:18] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:10:25] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [02:10:29] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:10:31] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:10:34] RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:10:36] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 0.513 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:10:39] PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp3043.esams.wmnet, cp3042.esams.wmnet, cp3033.esams.wmnet, cp3041.esams.wmnet are marked down but pooled: textlb_443: Servers cp3032.esams.wmnet, cp3033.esams.wmnet, cp3042.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3032.esams.wmnet, cp3042.esams.wmnet, cp3030.esams.wmnet are marked down but pooled: t [02:10:39] cp3043.esams.wmnet, cp3032.esams.wmnet, cp3033.esams.wmnet, cp3041.esams.wmnet, cp3030.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:10:41] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:10:45] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:10:47] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:11:01] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:11:14] RECOVERY - LVS HTTP IPv6 #page on ncredir-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 6.739 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:11:16] PROBLEM - LVS HTTPS IPv4 #page on ncredir-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:11:19] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:11:21] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:11:21] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:11:21] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:11:23] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:11:25] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [02:11:33] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [02:11:33] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:11:33] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:11:35] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [02:11:38] PROBLEM - LVS HTTP IPv4 #page on text-lb.eqsin.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:11:38] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:11:50] PROBLEM - LVS HTTPS IPv4 #page on text-lb.eqsin.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:11:50] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [02:11:55] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 17.91 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:11:59] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [02:11:59] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:12:04] PROBLEM - LVS HTTP IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:12:07] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:12:20] 04Critical Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% [02:12:35] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [02:12:37] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [02:12:41] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [02:12:44] PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:12:50] RECOVERY - LVS HTTPS IPv4 #page on ncredir-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 0.641 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:12:51] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [02:12:57] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:12:57] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:13:01] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:13:01] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [02:13:15] RECOVERY - LVS HTTP IPv4 #page on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 0.488 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:13:16] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitec [02:13:16] iki/RESTBase [02:13:17] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 81.29 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:13:24] RECOVERY - LVS HTTPS IPv4 #page on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15822 bytes in 1.305 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:13:38] RECOVERY - LVS HTTP IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.489 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:13:47] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [02:14:03] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:14:28] RECOVERY - LVS HTTPS IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15835 bytes in 1.344 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:14:55] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:14:55] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:14:59] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:15:13] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:15:13] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 40, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:15:15] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:15:19] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:15:56] RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 3.193 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:16:01] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:16:01] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:16:05] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:16:10] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15798 bytes in 0.503 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:16:14] 04Critical Alert for device mr1-ulsfo.wikimedia.org - Primary outbound port utilisation over 80% [02:16:29] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:16:55] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 76.06 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:17:21] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=ulsfo https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [02:19:02] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:19:03] PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:19:25] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs1013 is OK: (C)3200 ge (W)1600 ge 639.5 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs1013&var-datasource=eqiad+prometheus/ops [02:20:07] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [02:20:13] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:20:53] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:20:57] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:21:09] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 93%, RTA = 84.17 ms [02:21:19] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:21:27] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:21:51] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:22:14] RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 3.550 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:22:15] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 3.675 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:22:21] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% [02:22:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:22:35] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:23:07] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:23:15] 04̶C̶r̶i̶t̶i̶c̶a̶l Device mr1-ulsfo.wikimedia.org recovered from Primary outbound port utilisation over 80% [02:23:26] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqord.wikimedia.org recovered from Primary outbound port utilisation over 80% [02:23:31] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:23:37] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [02:23:49] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [02:23:49] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:24:07] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [02:24:11] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:24:15] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:24:18] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% [02:24:37] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:26:04] PROBLEM - LVS HTTPS IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:26:13] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /ap [02:26:13] bile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/page/metadata/{title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest [02:26:13] ections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [02:26:20] 04Critical Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% [02:26:47] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:26:59] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [02:27:17] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:27:23] PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:27:32] PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:27:34] RECOVERY - LVS HTTPS IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15834 bytes in 1.360 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:28:33] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [02:28:47] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 6 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:29:29] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:30:05] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [02:31:39] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 93%, RTA = 84.20 ms [02:32:34] PROBLEM - LVS HTTPS IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:32:47] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:32:51] PROBLEM - BFD status on cr2-eqord is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:33:14] 04Critical Alert for device cr1-codfw.wikimedia.org - Primary inbound port utilisation over 80% [02:33:26] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80% [02:33:37] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:33:37] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat [02:33:37] formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/page/metadata/{title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page [02:33:37] et rev by title from storage) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was receiv [02:33:37] feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [02:33:39] PROBLEM - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_443: Servers cp5010.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:33:43] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% [02:33:59] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:34:03] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:34:04] RECOVERY - LVS HTTPS IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15835 bytes in 1.375 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:34:23] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:34:25] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:34:29] RECOVERY - BFD status on cr2-eqord is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:35:13] RECOVERY - PyBal backends health check on lvs5001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:35:13] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:35:19] PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp3042.esams.wmnet, cp3030.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:36:55] RECOVERY - PyBal backends health check on lvs3001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:38:20] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% [02:38:38] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% [02:40:15] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 24 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:40:33] RECOVERY - Check if active EventStreams endpoint is delivering messages. on icinga1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [02:42:13] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:42:17] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:43:11] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [02:43:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [02:43:25] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [02:43:43] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [02:52:12] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-eqiad.wikimedia.org recovered from Processor usage over 85% [03:03:49] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [03:04:07] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:05:25] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [03:05:41] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:43:55] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [03:46:53] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 120 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:47:01] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 56 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [03:59:25] PROBLEM - Host blog.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [04:01:01] RECOVERY - Host blog.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [04:03:39] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 31 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [04:09:03] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [04:12:05] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 56 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [04:16:59] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [04:18:27] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 56 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [04:21:05] RECOVERY - pybal on lvs3003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [04:21:51] RECOVERY - PyBal backends health check on lvs3003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:26:11] RECOVERY - PyBal connections to etcd on lvs3003 is OK: OK: 4 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [05:54:49] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 56.97 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:56:25] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 70.99 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:19:37] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [06:21:07] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 56 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [06:37:01] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [06:40:05] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 56 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [06:51:17] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [06:52:45] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 56 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [08:15:09] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [08:16:37] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 55 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [08:29:23] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [08:34:03] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 55 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [08:48:21] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [08:49:51] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 55 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [08:56:17] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [09:02:35] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 55 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [09:21:43] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [09:23:11] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 55 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [11:03:16] <_joe_> !log stopping HHVM on mw1317; then clean the bytecode cache [11:08:05] RECOVERY - HHVM rendering on mw1317 is OK: HTTP OK: HTTP/1.1 200 OK - 75715 bytes in 0.607 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:09:03] RECOVERY - Nginx local proxy to apache on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 591 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:09:17] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 591 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:10:07] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:56:34] (03Abandoned) 10CDanis: move FR/GB/RU traffic away from esams [dns] - 10https://gerrit.wikimedia.org/r/534866 (owner: 10CDanis) [13:58:10] Hey. I had to reset my phone and I lost my 2FA scratch codes. and now I can't get into my account, I was told to contact you here and wait, I have commited idenity set up but it is from before I tunred on 2FA. Thanks [13:59:32] Looks like the blogpost about the DoS is now making it to hn [14:00:11] hn? [14:00:27] hackernews [14:00:37] ah [14:02:03] https://news.ycombinator.com/item?id=20902399 [14:02:03] https://news.ycombinator.com/item?id=20902399 [14:02:26] Currently #3 [14:11:38] <_joe_> a precious thread, as usual on news.ycombinator [14:14:50] Some comments are nice [14:15:11] "Just want to mention, WMF has a very small but elite team of engineers. Amazed they maintain an Alexa top 5 site with many orders of magnitude less engineering staff than Facebook or Reddit. I think they must count ~100 engineers?" [14:20:33] LakesideMiners: you need to make a task on phabricator requesting the removal and then email ca@wikimedia.org from the email account registered to your wiki account referencing the phab ticket # [14:20:49] okay [14:25:50] done. thanks for pointing me there. [15:17:39] cdanis: Regarding https://phabricator.wikimedia.org/T232250 . Still timing out [15:18:42] multichill: hey! [15:18:49] traceroute and source IP if possible :) [15:20:36] (03PS1) 10Ayounsi: Change FNM notify email to noc@ [puppet] - 10https://gerrit.wikimedia.org/r/534918 [15:20:39] (03PS1) 10CDanis: excessive LVS RX alert: page [puppet] - 10https://gerrit.wikimedia.org/r/534919 [15:22:16] (03CR) 10Filippo Giunchedi: [C: 03+1] excessive LVS RX alert: page [puppet] - 10https://gerrit.wikimedia.org/r/534919 (owner: 10CDanis) [15:22:37] (03CR) 10Ayounsi: [C: 03+2] Change FNM notify email to noc@ [puppet] - 10https://gerrit.wikimedia.org/r/534918 (owner: 10Ayounsi) [15:24:05] (03CR) 10CDanis: [C: 03+2] excessive LVS RX alert: page [puppet] - 10https://gerrit.wikimedia.org/r/534919 (owner: 10CDanis) [15:24:38] (03PS2) 10Ayounsi: Change FNM notify email to noc@ [puppet] - 10https://gerrit.wikimedia.org/r/534918 [15:25:01] paravoid: Shared in private message. I seem to be coming in through cloudflare [15:25:37] For ipv4, for ipv6 I'm using abovenet/zayo [15:47:44] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:48:46] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:02:56] (03PS1) 10Filippo Giunchedi: graphite: fix mediawiki alerts dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/534921 [16:07:53] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: fix mediawiki alerts dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/534921 (owner: 10Filippo Giunchedi) [16:16:54] I’m not sure what the current status of the DoS is… is it expected that some people still have connectivity issues, or should they follow https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue ? [16:17:31] lucaswerkmeister: yeah, please follow that [16:17:37] ok thanks :) [16:17:50] lucaswerkmeister: please subscribe myself to the task you create :) [16:17:51] thanks [16:18:03] (it’s not actually me that has the issue, asking for a friend ^^) [16:18:07] (but will do) [16:20:55] cheers [16:26:17] <_joe_> even better [16:26:26] <_joe_> tag the task #operations [16:26:29] <_joe_> so we all see it [16:32:00] (#operations is automagically added when following https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue :) [18:24:39] hugops for folks who are doing / did DDoS mitigation! [18:24:58] yuvipanda!!! <3 [18:25:17] Aye! Good work! [18:34:42] oh hey yuvipanda, nice to see you [19:14:33] <_joe_> ah damn i missed yuvi [20:18:33] (03PS24) 10Zoranzoki21: Enable Extension:Newsletter on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151) [20:26:15] (03PS8) 10Zoranzoki21: IS.php: Add wgProofreadPagePageJoiner, set it per default on '-' and at zhwikisource on __PAGEJOIN__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482502 (https://phabricator.wikimedia.org/T205826) [20:32:34] (03PS4) 10Zoranzoki21: Change configuration of AbuseFilter extension for enwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533747 (https://phabricator.wikimedia.org/T231750) [20:32:43] (03CR) 10Zoranzoki21: "Status of this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533747 (https://phabricator.wikimedia.org/T231750) (owner: 10Zoranzoki21) [20:32:53] (03PS4) 10Zoranzoki21: Set noindex for user and user_talk on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534471 (https://phabricator.wikimedia.org/T231982) [22:29:13] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 23774 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [22:46:35] RECOVERY - Disk space on elastic1018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [22:48:11] 08Warning Alert for device cr1-eqiad.wikimedia.org - Memory over 85% [23:43:25] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [23:44:59] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [23:56:09] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps