[00:00:20] PROBLEM - Maps - OSM synchronization lag - codfw on einsteinium is CRITICAL: CRITICAL - maps-osm-sync-lag-codf is 172814 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [00:01:00] PROBLEM - Maps - OSM synchronization lag - eqiad on einsteinium is CRITICAL: CRITICAL - maps-osm-sync-lag-eqiad is 172854 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [02:42:50] RECOVERY - Maps - OSM synchronization lag - codfw on einsteinium is OK: OK - maps-osm-sync-lag-codf is 9762 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [02:44:20] RECOVERY - Maps - OSM synchronization lag - eqiad on einsteinium is OK: OK - maps-osm-sync-lag-eqiad is 9853 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [03:29:11] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 735.16 seconds [04:03:30] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 614.01 seconds [04:35:52] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 285.17 seconds [04:49:51] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 205.96 seconds [05:32:20] PROBLEM - MariaDB Slave Lag: s4 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 504.77 seconds [05:32:23] PROBLEM - MariaDB Slave Lag: s4 on db1064 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 506.55 seconds [05:33:37] <_joe_> uhm [05:33:57] <_joe_> 2 servers in s4 [05:37:38] <_joe_> well, one, db1102 is dependent on db1064 [05:37:52] <_joe_> and it's recovering anyways [05:38:30] RECOVERY - MariaDB Slave Lag: s4 on db1102 is OK: OK slave_sql_lag Replication lag: 5.55 seconds [05:38:33] RECOVERY - MariaDB Slave Lag: s4 on db1064 is OK: OK slave_sql_lag Replication lag: 0.43 seconds [05:38:50] <_joe_> ok, back to bed I guess [06:04:01] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [06:04:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:14:11] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [06:14:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:02:31] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:04:00] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:25:50] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:26:01] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:35:30] RECOVERY - Host wdqs2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.78 ms [07:42:11] PROBLEM - Host wdqs2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [07:58:00] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:58:20] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:06:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [08:06:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:43:30] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [08:43:50] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:51:31] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [08:52:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [10:05:31] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [10:06:20] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [10:06:20] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [10:14:21] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [10:16:21] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [10:16:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [10:24:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [10:26:30] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [10:26:30] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [10:27:21] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [10:35:31] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [10:36:30] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [10:36:31] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [10:36:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [10:52:40] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [10:53:13] _joe_: does not seem to get better [10:53:30] (nobody in ops clinic?) [10:53:41] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [10:53:45] PROBLEM - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2807 bytes in 1.699 second response time [10:54:05] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2807 bytes in 0.541 second response time [10:54:41] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [10:54:54] PROBLEM - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2789 bytes in 1.951 second response time [10:54:56] <_joe_> hey [10:55:03] what's up? [10:55:06] <_joe_> getting to the pc now, 1 min [10:55:10] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [10:55:11] * volans checking [10:55:12] <_joe_> volans: no idea, a ton of 5xx [10:55:52] hey, got paged [10:56:04] RECOVERY - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17452 bytes in 1.832 second response time [10:56:08] akosiaris_: yeah we don't know, started looking now [10:56:10] I got a page for ulsfo [10:56:15] high rate of errors before that [10:56:18] <_joe_> everything was apparently down [10:56:24] PROBLEM - LVS HTTPS IPv4 on text-lb.codfw.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2720 bytes in 0.222 second response time [10:56:28] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2736 bytes in 0.510 second response time [10:56:32] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2772 bytes in 0.542 second response time [10:56:45] hey, I'm here too [10:56:55] <_joe_> can someone go to oxygen and try to understand what's up? [10:57:05] <_joe_> I don't see errors navigating the sites myself [10:57:06] was about to cry but seems icinga was faster [10:57:08] PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2719 bytes in 0.510 second response time [10:57:08] 20K errirs/min [10:57:11] RECOVERY - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17447 bytes in 1.310 second response time [10:57:27] LBs on grafana seems ok: https://grafana.wikimedia.org/dashboard/db/load-balancers?orgId=1&from=now-1h&to=now [10:57:57] what's up? [10:58:04] Request from IP via cp1068 cp1068, Varnish XID 975995491 [10:58:04] Error: 503, Backend fetch failed at Sat, 10 Mar 2018 10:57:24 GMT incase someone wants it [10:58:15] bblack: 5xx on text and misc, no upload [10:58:17] RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17399 bytes in 0.507 second response time [10:58:19] <_joe_> bblack: various sites dow [10:58:19] same thing here [10:58:19] I get them as well [10:58:24] https://de.wikipedia.org/w/index.php?title=Aerifizieren&action=history for example [10:58:26] <_joe_> *down [10:58:34] I was on enwiki [10:58:42] <_joe_> I'm checking fatalmonitor [10:58:44] it appears and disappears randomly [10:58:47] RECOVERY - LVS HTTPS IPv4 on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17397 bytes in 0.299 second response time [10:58:55] <_joe_> but the fact it's so widespread seems to point to varnish [10:58:59] looks like text 5xx rates were flapping for some hours before these pages though [10:58:59] https://en.wikipedia.org/wiki/Special:Contributions/-revi just down [10:59:00] hey, just got a lot of pages. I'm around [10:59:05] https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-aggregate-client-status-code?orgId=1&from=now-30m&to=now&var-site=eqiad&var-site=codfw&var-site=esams&var-site=ulsfo&var-cache_type=varnish-text&var-cache_type=varnish-misc&var-status_type=5 [10:59:20] those are the 5xx on text + misc [10:59:28] misc are lower numbers so seems zero [10:59:30] <_joe_> yeah fatalmonitor doesn't show errors at the mediawiki levelo [10:59:32] but has the same pattern [10:59:32] so since around 10 minutes [10:59:48] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17398 bytes in 0.520 second response time [10:59:57] I see errors on text, on search [11:00:00] *about [11:00:01] also text-lb.eqiad didn't page (yet) [11:00:04] no pattern [11:00:14] (yet that I can see) [11:00:23] <_joe_> ok, so it's not mediawiki [11:00:29] <_joe_> I'm pretty sure about that [11:00:42] I think it slowed down now [11:00:55] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?orgId=1&var-site=All&var-cache_type=All&var-status_type=5&from=1520659341314&to=1520679637659 [11:00:58] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17451 bytes in 0.506 second response time [11:01:02] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17450 bytes in 0.505 second response time [11:01:08] no, we're just in the lull in the pattern of increasing spikes [11:02:21] <_joe_> I see some 503s, but not many right now on oxygen [11:02:34] before it was cp3030 [11:02:57] lots of connections there [11:02:57] <_joe_> it's all over the place [11:03:05] I think now it is all [11:03:23] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches [11:03:26] it's probably not site-specific when it's hitting all anyways [11:04:14] <_joe_> yes [11:04:16] just NOT eqiad, interestingly [11:04:16] jynus: I was about to link the same, yeah [11:04:18] most likely the problem truly exists either in eqiad caches, or behind varnish [11:04:24] yeah [11:04:25] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All [11:04:26] <_joe_> it's gone now [11:04:33] it's not gone, it will most likely come back [11:04:33] <_joe_> no I don't think it's mediawiki tbh [11:04:34] should we do a preventive restart? [11:04:39] <_joe_> it's mostly 503s [11:04:41] on 1 cache [11:04:44] *of [11:04:49] cp1068 [11:04:56] (a) apps can cause 503s and (b) MW isn't the only app [11:05:26] (c) the spikes are happening with a frequency and increasing severity, and from the eqiad pov move to a new CP machine each time: [11:05:29] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&from=1520675003008&to=1520679820862 [11:05:47] <_joe_> bblack: ok, I don't think it's mediawiki throwing fatals [11:06:22] we're in the ~20 minute low spot in the pattern, there's probably another spike coming [11:06:29] <_joe_> it started again [11:06:42] ? [11:06:46] <_joe_> cp1055 [11:06:53] for what I see, the user impact is some, but not as large as others in the past [11:07:12] <_joe_> all requests throwing 5xx atm are going through cp1055 [11:07:29] in the most recent spike [11:07:33] <_joe_> seen with tail -f 5xx.json | grep -v upload.wikimedia.org | jq .x_cache [11:07:37] it was 1068 [11:07:41] <_joe_> ok [11:07:48] 1055 was the previous spike [11:07:53] _joe_: I think uploads wasn't too affected [11:07:55] before that it was 1053 [11:08:00] seach spike gets bigger than the last [11:08:06] <_joe_> jynus: hence the removal [11:08:16] oh, sorry, didn't saw the -v [11:08:21] <_joe_> bblack: any idea what's happening? [11:10:13] spikes 1 per hour or less [11:10:23] in the last 6 hours, with different error rate [11:10:55] no idea yet [11:11:43] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [11:19:04] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:19:53] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [11:20:08] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1520678919286&to=1520679820862&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All [11:20:29] ^ there's the last wave of 5xx from esams pov, but no correlating ramp up to ~12K backend connections in the graph below it [11:20:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [11:21:30] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All [11:21:56] ^ this shows the eqiad pov, where we can see the actual backend conn pileups in the graph below it, but the limits seem awful low and they never plateau [11:22:18] (~2K conns and no flatness during the peak of it) [11:22:48] but they plateau in the per-host one on cp1068 at 1k [11:23:53] bblack: during cp1068's peak, expiry mailbox lag hit 150K [11:24:44] also spikes during cp1055's peak [11:25:29] yes [11:25:50] https://grafana.wikimedia.org/dashboard/db/varnish-mailbox-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All [11:26:00] but we've come to understand that our "expiry mailbox lag" invention is something that correlates with almost any problem from anywhere, it's like a gauge for "varnish isn't happy" [11:32:38] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2786 bytes in 0.542 second response time [11:33:35] cp1068 a second time [11:33:38] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17449 bytes in 0.503 second response time [11:33:39] <_joe_> yes [11:33:57] 1055 had follow-on patterns as well [11:34:03] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [11:34:13] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [11:35:13] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [11:35:24] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:40:57] 10Operations, 10UniversalLanguageSelector, 10I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#4040125 (10Aklapper) > Afterwards, fonts appeared as they used to appear before June 2016 @mehtab.ahmed: Is that good? is that bad? And what does that mean for this task? [11:51:13] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [11:55:23] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [11:56:43] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:57:23] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [13:07:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:08:02] Error: 503, Backend fetch failed at Sat, 10 Mar 2018 13:07:45 GMT [13:08:28] ongoing problem, Hauskatze, see topic :p [13:08:38] ack, thx [13:10:13] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [13:10:13] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:11:03] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [13:12:48] another spike, this one through cp1053 [13:13:04] it's already over now, the icinga 5xx alerts tend to lag behind reality [13:18:50] disk i/o rate doesn't have any anomaly until ~13:10:32, obj expiration rate dumps off at 13:05:32 [13:19:48] it's semi-interesting that the *frontend* instance shows a marked dropoff in object expiration rate at 13:02:32, which is way ahead of other things so far [13:20:38] backend req rate: 13:07:32 before anything significant [13:21:05] backend connection count shows signs at 13:05:32 [13:21:13] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [13:21:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:22:14] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [13:22:23] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:26:57] PROBLEM - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2772 bytes in 1.887 second response time [13:27:48] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2787 bytes in 0.540 second response time [13:28:28] PROBLEM - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2804 bytes in 1.874 second response time [13:29:07] RECOVERY - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17449 bytes in 1.502 second response time [13:29:23] !log cp1068/cp1055: restart varnish-be [13:29:24] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [13:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:37] RECOVERY - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17450 bytes in 1.537 second response time [13:29:54] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:29:58] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17451 bytes in 0.498 second response time [13:30:44] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [13:30:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:31:24] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [13:32:58] (03PS1) 10Ema: varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580 [13:33:28] (03CR) 10jerkins-bot: [V: 04-1] varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580 (owner: 10Ema) [13:34:36] so this is ongoing indeed. got paged again [13:35:23] it's being handled, you can go back to whatever you were doing unless you're bored [13:35:50] 😊 [13:36:03] ack, thx [13:36:04] thanks! [13:38:49] (03PS2) 10Ema: varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580 [13:40:26] cp3046 had mailbox spike [13:40:29] might need a restart [13:41:33] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [13:41:37] that's cache_upload, unrelated to whatever's going on in the text world [13:41:54] (but also, thin spikes tend not to mean long-term problems. ramps do tend to, though) [13:42:09] see how these eqiad text ones look more rampy than spiky? [13:42:11] https://grafana.wikimedia.org/dashboard/db/varnish-mailbox-lag?orgId=1&from=1520670072323&to=1520681224598&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All [13:42:43] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [13:42:53] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [13:42:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:43:13] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:43:58] Upgrading? [13:44:00] "Request from 88.97.96.89 via cp1068 cp1068, Varnish XID 196378886 [13:44:01] Error: 503, Backend fetch failed at Sat, 10 Mar 2018 13:25:42 GMT" [13:44:52] ShakespeareFan02 that's known, see topic [13:44:53] no see topic [13:54:26] Thanks [13:55:43] RECOVERY - Host wdqs2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.62 ms [14:12:44] PROBLEM - Host wdqs2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:28:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:36:43] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:46:51] (03CR) 10Urbanecm: "Thanks, fixed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [14:46:53] (03PS5) 10Urbanecm: Initial configuration for romdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) [14:47:02] (03PS6) 10Urbanecm: Initial configuration for romdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) [14:47:51] (03PS4) 10Urbanecm: Add romd.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/412898 (https://phabricator.wikimedia.org/T187184) [14:47:53] (03PS2) 10Urbanecm: Add romd.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/412896 (https://phabricator.wikimedia.org/T187184) [14:56:01] !log cp1053: restart varnish-be [14:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:52] (03PS4) 10MarcoAurelio: Disable abusefilter from collecting private data on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) [16:17:40] (03CR) 10MarcoAurelio: Disable abusefilter from collecting private data on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio) [16:27:17] (03CR) 10Huji: [C: 031] Disable abusefilter from collecting private data on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio) [16:43:13] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [16:43:13] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [16:43:13] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [16:43:13] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:43:13] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [16:43:13] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [16:43:23] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled [16:43:44] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1005.eqiad.wmnet are marked down but pooled [16:43:44] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [16:44:04] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled [16:45:30] elukey ^^ [16:46:13] PROBLEM - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) [16:46:42] huh [16:46:46] nothing obvious in syslog on a random aqs node [16:47:13] PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) [16:47:13] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1005.eqiad.wmnet]) [16:47:14] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy [16:47:53] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy [16:48:03] or druid [16:48:04] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1053 bytes in 0.002 second response time [16:48:04] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy [16:48:13] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy [16:48:50] and I don't see any processes having restarted around that point in time either [16:49:04] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy [16:49:08] must be something up in the complexity layers above OS-level stuff. I donno cassanda or java stuff or whatever [16:49:13] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [16:49:16] or nodejs as the case may be [16:49:50] * bblack puts on his dumb sysadmin hat and points at other people [16:51:38] I'm not gone, just reading wikitech aqs page :-/ [16:53:49] got nothin out of atop on random node (specifically aqs1008) [16:54:34] the aqs endpoint alerts seem to ebe clearing on their own [16:54:47] somewhat [16:54:54] druid not so much, the pybal thing [16:57:08] there is absolutely no point in me digging through these debug logs for cassandra, they are firehoses [16:57:09] meh [16:57:13] RECOVERY - PyBal IPVS diff check on lvs1003 is OK: OK: no difference between hosts in IPVS/PyBal [16:57:13] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [16:57:14] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [16:57:33] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [16:58:03] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [16:58:19] (03PS1) 10Jcrespo: Add support to global query execution limit [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/418593 [16:58:49] oh look it fixed itself, yay [16:59:03] well good [16:59:15] because me looking at these also firehose gc logs wasn't getting anywhere either [16:59:44] there's probably (I hope?) some dashboard somewhere with the right stuff but no idea where [17:01:13] RECOVERY - PyBal IPVS diff check on lvs1010 is OK: OK: no difference between hosts in IPVS/PyBal [17:02:04] https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-gc?refresh=1m&orgId=1 found this, not so enlightening [17:25:07] (03PS1) 10Jcrespo: Add script for dumping and recovering database sections [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/418595 [17:38:18] hello! [17:38:22] hey hey [17:38:29] hi [17:38:29] just seen the alarms sorry I wasn't online before :( [17:38:34] well the aqs whatever it was, is ok now [17:38:40] same here, i just came on IRC for something [17:38:49] but if you wanted to look just to tell us what the problem was, :-D or not [17:38:55] I mean since it is all ok [17:40:17] so AQS sits in front of Druid to power an API that returns metrics to wikistats 2, as starter :) [17:40:37] so the AQS endpoints health should be a side effect of Druid misbehaving probably [17:40:53] the AQS failures were only for /analytics.wikimedia.org/v1/edits/per-page [17:41:56] second step is checking the Druid dashboard [17:41:57] https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_public&var-druid_datasource=All&from=now-3h&to=now [17:42:18] (there are two clusters, "public" is the one that is behind AQS, "analytics" is the one behind pivot) [17:43:47] I can see a jump in req/s around the time of the first AQS alert [17:44:06] and also the broker response cache jumping up as well as consequence [17:46:01] druid was load tested by Dan in my team and we didn't get into these case, but I think that big queries to druid that are not yet cached might not be super efficient [17:46:06] causing things like these [17:47:59] the query count per second, I only see a little blip there around 16:40 or whatever, the big leap is later like a few minutes ago [17:48:02] if I'm looking at the right graph [17:49:03] well we jumped from 0 to 27/s, not a blip if the queries were big and not cached [17:49:15] I am talking about https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&panelId=43&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_public&var-druid_datasource=All&from=now-3h&to=now [17:49:42] big jump was at ~16:41 that matches timing wise [17:50:22] I am seeing timeouts in logs, yeah I am pretty sure this is what happened [17:51:31] huh [17:51:33] ah snap sorry there are multiple query counts [17:51:51] ok now i have to ask why the one at 17.33 didn't do anything to us [17:52:03] I think due to caching [17:52:14] huh [17:52:23] so druid runs multiple daemons, two of them are called "Broker" and "historical" [17:52:59] the broker is the one that answers to clients queries, and queries the "historical" daemons on all druid hosts for the data contained in the druid segments [17:53:23] ok [17:53:56] the broker might need to get data from multiple of them, aggregate/etc.. and then return it to the client [17:54:24] the response is then cached so subsequent queries don't hit the historicals [17:54:38] they might need to load data from HDFS if needed [17:55:55] ok, got it [17:55:57] and only yesterday Joseph was able to compute the february's data for mw-history using the hadoop cluster ( we had some issues) [17:56:19] but these spikes are for the historical one? [17:56:51] sorry for these very basic questions but it's the first time I look at aqs [17:57:21] nono I am trying to reason out loud about what happened to see if it makes sense, a brain bounce is always good :) [17:58:05] ok! [17:58:07] so from what I know one query to the broker can trigger multiple queries to the historicals, that may need to load data from HDFS and load them into page cache [17:58:35] once this happens (and mw-history fits in memory iirc) then it is super fast [17:58:40] but once they are there then no problems [17:58:41] ok [17:58:45] yeah exactly [17:58:55] so the first queries hit hard the historicals [17:58:59] ok [17:59:04] and we should see that on this graph [17:59:13] then when all the cache layers were in place even a big spike didn't cause any flinch to druid [17:59:21] that I don't understand [17:59:32] unless this shows somehow the 'front end' requests only [18:00:03] the graph, I mean. I'm not clear which requests it actually shows [18:01:30] in the prometheus dashboard we should have metrics for the broker, historical and coordinator (not important in this case) [18:01:46] it should be indicated with labes on the grafana rows [18:02:01] at 16:40 I can see the historicals being hit [18:04:32] (I am reading historical logs on druid1004 - /var/log/druid/historical.log) [18:05:32] ok [18:12:20] apergos: this is another good broker metric [18:12:21] https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&panelId=16&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_public&var-druid_datasource=All&from=now-3h&to=now [18:12:23] ok now that I see we're looking at the broker requests, it's making sense [18:13:13] the spike indicates a bunch of requests coming in for something not yet cached? [18:13:39] ah right and cache misses lines up right with that [18:13:47] yep exactly [18:13:54] causing overload and timeouts [18:13:57] ok! I am believing this story [18:14:03] PROBLEM - HHVM jobrunner on mw1305 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:14:05] soooo.... the next time we see this [18:14:29] no no no, really not interested in looking at the jobrunners too [18:14:37] (03CR) 10Ema: "As discussed on IRC, a potential issue here would be excessive logging by varnish-fe-slowreqs caused by slow clients. However, 5 minutes o" [puppet] - 10https://gerrit.wikimedia.org/r/418580 (owner: 10Ema) [18:14:45] I am going to open a task (probably tomorrow) to track this, I'll folow up with my team [18:15:03] PROBLEM - Nginx local proxy to apache on mw1305 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:15:11] ok, so we sort of cross our fingers and hope for quick recovery (assuming the graphs tell the same story) if it happens again, yes? [18:15:26] and look for you if you are around! [18:15:38] (03PS1) 10Dzahn: temp use bast-test.eqiad.wmnet for wmf4727 [dns] - 10https://gerrit.wikimedia.org/r/418598 (https://phabricator.wikimedia.org/T186623) [18:15:53] RECOVERY - Nginx local proxy to apache on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.006 second response time [18:15:53] RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [18:16:13] meh just about to hop over there and it's back [18:16:13] apergos: feel free to call me if you need me, but for the moment the API only powers the wikistats 2 beta website [18:16:20] ok! [18:16:25] super :) [18:16:38] thank you very much for the walkthrough [18:16:49] the HHVM alert is self-fixing because of the automatic restart script [18:17:18] the jobrunner alert just happened to be one of the others on mw1305 [18:17:42] yeah I figured it was hhvm [18:17:54] Mar 10 18:15:13 mw1305 systemd[1]: hhvm.service: main process exited, code=killed, status=11/SEGV [18:18:01] not super great though :D [18:18:04] no [18:18:21] very low in the list of super great things [18:18:25] someday we'll be pn php7 with a whole new host of problems [18:18:52] yes, but also common. and yes :p [18:18:55] I forgot about that auto restart though [18:19:54] isn't it systemd restarting hhvm after it dies? [18:20:13] the restart script should only restart hhvm when eligible [18:20:19] via cron [18:22:19] I have no idea unless I go look [18:22:23] * apergos pokes around in puppet [18:24:41] yeah it has restartalways, indeed [18:52:40] (03CR) 10Dzahn: [C: 032] temp use bast-test.eqiad.wmnet for wmf4727 [dns] - 10https://gerrit.wikimedia.org/r/418598 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn) [18:55:44] (03PS1) 10Dzahn: DHCP: test wmf4727 as bast-test.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/418600 (https://phabricator.wikimedia.org/T186623) [19:03:04] (03PS2) 10Dzahn: DHCP: test wmf4727 as bast-test.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/418600 (https://phabricator.wikimedia.org/T186623) [19:03:35] (03CR) 10Dzahn: [C: 032] "as a test becaues the switch port isn't configured for public VLAN" [puppet] - 10https://gerrit.wikimedia.org/r/418600 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn) [19:12:19] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#4040520 (10Dzahn) when using wmf4727 and testing stretch install: ``` No kernel modules were found. This probably is due to a mismatch │ │ between the kernel used by this version of the i... [19:55:24] (03PS1) 10Ema: varnishslowlog: add fetch overhead introduced by varnish [puppet] - 10https://gerrit.wikimedia.org/r/418603 [20:11:57] ibugs> (PS1) Dzahn: temp use bast-test.eqiad.wmnet for wmf4727 [dns] - https://gerrit.wikimedia.org/r/418598 (https://phabricator.wikimedia.org/T186623) [20:12:00] 19:15:53 < icinga-wm> RECOVERY - Nginx local proxy to apache on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.006 second response time [20:12:03] 19:15:53 < icinga-wm> RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [20:12:06] 19:15:58 -!- govg [~govg@unaffiliated/govg] has quit [Ping timeout: 264 seconds] [20:42:39] (03CR) 10Hashar: [C: 032] Fix typo in word compatibility [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418128 (owner: 10Umherirrender) [20:44:18] (03Merged) 10jenkins-bot: Fix typo in word compatibility [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418128 (owner: 10Umherirrender) [20:44:36] (03CR) 10jenkins-bot: Fix typo in word compatibility [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418128 (owner: 10Umherirrender) [21:43:06] (03PS2) 10Ema: varnishslowlog: add fetch overhead introduced by varnish [puppet] - 10https://gerrit.wikimedia.org/r/418603 [22:37:33] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled [22:37:34] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled [22:37:43] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:37:43] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [22:37:53] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled [22:38:03] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [22:38:03] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [22:38:03] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [22:38:13] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [22:38:13] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [22:38:33] PROBLEM - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1004.eqiad.wmnet]) [22:39:44] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) [22:39:44] PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) [22:43:39] (03PS1) 10Ahmed123: Revert "Restrict FlaggedRevs to only operated on NS_MAIN on arwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418700 [22:56:05] (03PS2) 10Ahmed123: Revert "Restrict FlaggedRevs to only operated on NS_MAIN on arwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418700 [23:42:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team: Decommission labstore100[12] and their disk shelves - https://phabricator.wikimedia.org/T187456#4040780 (10bd808) 05Open>03stalled