[00:00:20] <icinga-wm>	 PROBLEM - Maps - OSM synchronization lag - codfw on einsteinium is CRITICAL: CRITICAL - maps-osm-sync-lag-codf is 172814 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1
[00:01:00] <icinga-wm>	 PROBLEM - Maps - OSM synchronization lag - eqiad on einsteinium is CRITICAL: CRITICAL - maps-osm-sync-lag-eqiad is 172854 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1
[02:42:50] <icinga-wm>	 RECOVERY - Maps - OSM synchronization lag - codfw on einsteinium is OK: OK - maps-osm-sync-lag-codf is 9762 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1
[02:44:20] <icinga-wm>	 RECOVERY - Maps - OSM synchronization lag - eqiad on einsteinium is OK: OK - maps-osm-sync-lag-eqiad is 9853 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1
[03:29:11] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 735.16 seconds
[04:03:30] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 614.01 seconds
[04:35:52] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 285.17 seconds
[04:49:51] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 205.96 seconds
[05:32:20] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 504.77 seconds
[05:32:23] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db1064 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 506.55 seconds
[05:33:37] <_joe_>	 uhm
[05:33:57] <_joe_>	 2 servers in s4
[05:37:38] <_joe_>	 well, one, db1102 is dependent on db1064
[05:37:52] <_joe_>	 and it's recovering anyways
[05:38:30] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db1102 is OK: OK slave_sql_lag Replication lag: 5.55 seconds
[05:38:33] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db1064 is OK: OK slave_sql_lag Replication lag: 0.43 seconds
[05:38:50] <_joe_>	 ok, back to bed I guess
[06:04:01] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[06:04:21] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[06:14:11] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[06:14:30] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[07:02:31] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[07:04:00] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[07:25:50] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[07:26:01] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[07:35:30] <icinga-wm>	 RECOVERY - Host wdqs2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.78 ms
[07:42:11] <icinga-wm>	 PROBLEM - Host wdqs2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:58:00] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[07:58:20] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[08:06:10] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[08:06:30] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[08:43:30] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[08:43:50] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[08:51:31] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[08:52:50] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[10:05:31] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[10:06:20] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[10:06:20] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[10:14:21] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[10:16:21] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[10:16:40] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[10:24:41] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[10:26:30] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[10:26:30] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[10:27:21] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[10:35:31] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[10:36:30] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[10:36:31] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[10:36:50] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[10:52:40] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[10:53:13] <Nemo_bis>	 _joe_: does not seem to get better
[10:53:30] <Nemo_bis>	 (nobody in ops clinic?)
[10:53:41] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[10:53:45] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2807 bytes in 1.699 second response time
[10:54:05] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2807 bytes in 0.541 second response time
[10:54:41] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[10:54:54] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2789 bytes in 1.951 second response time
[10:54:56] <_joe_>	 hey
[10:55:03] <volans>	 what's up?
[10:55:06] <_joe_>	 getting to the pc now, 1 min
[10:55:10] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[10:55:11] * volans checking
[10:55:12] <_joe_>	 volans: no idea, a ton of 5xx
[10:55:52] <akosiaris_>	 hey, got paged
[10:56:04] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17452 bytes in 1.832 second response time
[10:56:08] <volans>	 akosiaris_: yeah we don't know, started looking now
[10:56:10] <apergos>	 I got a page for ulsfo 
[10:56:15] <jynus>	 high rate of errors before that
[10:56:18] <_joe_>	 everything was apparently down
[10:56:24] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.codfw.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2720 bytes in 0.222 second response time
[10:56:28] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2736 bytes in 0.510 second response time
[10:56:32] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2772 bytes in 0.542 second response time
[10:56:45] <godog>	 hey, I'm here too
[10:56:55] <_joe_>	 can someone go to oxygen and try to understand what's up?
[10:57:05] <_joe_>	 I don't see errors navigating the sites myself
[10:57:06] <revi>	 was about to cry but seems icinga was faster
[10:57:08] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2719 bytes in 0.510 second response time
[10:57:08] <jynus>	 20K errirs/min
[10:57:11] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17447 bytes in 1.310 second response time
[10:57:27] <volans>	 LBs on grafana seems ok: https://grafana.wikimedia.org/dashboard/db/load-balancers?orgId=1&from=now-1h&to=now
[10:57:57] <bblack>	 what's up?
[10:58:04] <revi>	 Request from IP via cp1068 cp1068, Varnish XID 975995491
[10:58:04] <revi>	 Error: 503, Backend fetch failed at Sat, 10 Mar 2018 10:57:24 GMT incase someone wants it
[10:58:15] <volans>	 bblack: 5xx on text and misc, no upload
[10:58:17] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17399 bytes in 0.507 second response time
[10:58:19] <_joe_>	 bblack: various sites dow
[10:58:19] <Wiki13>	 same thing here
[10:58:19] <Sagan>	 I get them as well
[10:58:24] <Sagan>	 https://de.wikipedia.org/w/index.php?title=Aerifizieren&action=history for example
[10:58:26] <_joe_>	 *down
[10:58:34] <revi>	 I was on enwiki
[10:58:42] <_joe_>	 I'm checking fatalmonitor
[10:58:44] <Sagan>	 it appears and disappears randomly
[10:58:47] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17397 bytes in 0.299 second response time
[10:58:55] <_joe_>	 but the fact it's so widespread seems to point to varnish
[10:58:59] <bblack>	 looks like text 5xx rates were flapping for some hours before these pages though
[10:58:59] <revi>	 https://en.wikipedia.org/wiki/Special:Contributions/-revi just down
[10:59:00] <arturo>	 hey, just got a lot of pages. I'm around
[10:59:05] <volans>	 https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-aggregate-client-status-code?orgId=1&from=now-30m&to=now&var-site=eqiad&var-site=codfw&var-site=esams&var-site=ulsfo&var-cache_type=varnish-text&var-cache_type=varnish-misc&var-status_type=5
[10:59:20] <volans>	 those are the 5xx on text + misc
[10:59:28] <volans>	 misc are lower numbers so seems zero
[10:59:30] <_joe_>	 yeah fatalmonitor doesn't show errors at the mediawiki levelo
[10:59:32] <volans>	 but has the same pattern
[10:59:32] <Sagan>	 so since around 10 minutes
[10:59:48] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17398 bytes in 0.520 second response time
[10:59:57] <jynus>	 I see errors on text, on search
[11:00:00] <Sagan>	 *about
[11:00:01] <godog>	 also text-lb.eqiad didn't page (yet)
[11:00:04] <jynus>	 no pattern
[11:00:14] <jynus>	 (yet that I can see)
[11:00:23] <_joe_>	 ok, so it's not mediawiki
[11:00:29] <_joe_>	 I'm pretty sure about that
[11:00:42] <jynus>	 I think it slowed down now
[11:00:55] <bblack>	 https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?orgId=1&var-site=All&var-cache_type=All&var-status_type=5&from=1520659341314&to=1520679637659
[11:00:58] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17451 bytes in 0.506 second response time
[11:01:02] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17450 bytes in 0.505 second response time
[11:01:08] <bblack>	 no, we're just in the lull in the pattern of increasing spikes
[11:02:21] <_joe_>	 I see some 503s, but not many right now on oxygen
[11:02:34] <jynus>	 before it was cp3030
[11:02:57] <jynus>	 lots of connections there
[11:02:57] <_joe_>	 it's all over the place
[11:03:05] <jynus>	 I think now it is all
[11:03:23] <jynus>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches
[11:03:26] <bblack>	 it's probably not site-specific when it's hitting all anyways
[11:04:14] <_joe_>	 yes
[11:04:16] <mark>	 just NOT eqiad, interestingly
[11:04:16] <volans>	 jynus: I was about to link the same, yeah
[11:04:18] <bblack>	 most likely the problem truly exists either in eqiad caches, or behind varnish
[11:04:24] <mark>	 yeah
[11:04:25] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All
[11:04:26] <_joe_>	 it's gone now
[11:04:33] <bblack>	 it's not gone, it will most likely come back
[11:04:33] <_joe_>	 no I don't think it's mediawiki tbh
[11:04:34] <jynus>	 should we do a preventive restart?
[11:04:39] <_joe_>	 it's mostly 503s
[11:04:41] <jynus>	 on 1 cache
[11:04:44] <jynus>	 *of
[11:04:49] <mark>	 cp1068
[11:04:56] <bblack>	 (a) apps can cause 503s and (b) MW isn't the only app
[11:05:26] <bblack>	 (c) the spikes are happening with a frequency and increasing severity, and from the eqiad pov move to a new CP machine each time:
[11:05:29] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&from=1520675003008&to=1520679820862
[11:05:47] <_joe_>	 bblack: ok, I don't think it's mediawiki throwing fatals
[11:06:22] <bblack>	 we're in the ~20 minute low spot in the pattern, there's probably another spike coming
[11:06:29] <_joe_>	 it started again
[11:06:42] <bblack>	 ?
[11:06:46] <_joe_>	 cp1055
[11:06:53] <jynus>	 for what I see, the user impact is some, but not as large as others in the past
[11:07:12] <_joe_>	 all requests throwing 5xx atm are going through cp1055
[11:07:29] <bblack>	 in the most recent spike
[11:07:33] <_joe_>	 seen with tail -f 5xx.json | grep -v upload.wikimedia.org | jq .x_cache
[11:07:37] <bblack>	 it was 1068
[11:07:41] <_joe_>	 ok
[11:07:48] <bblack>	 1055 was the previous spike
[11:07:53] <jynus>	 _joe_: I think uploads wasn't too affected
[11:07:55] <bblack>	 before that it was 1053
[11:08:00] <bblack>	 seach spike gets bigger than the last
[11:08:06] <_joe_>	 jynus: hence the removal
[11:08:16] <jynus>	 oh, sorry, didn't saw the -v
[11:08:21] <_joe_>	 bblack: any idea what's happening?
[11:10:13] <jynus>	 spikes 1 per hour or less
[11:10:23] <jynus>	 in the last 6 hours, with different error rate
[11:10:55] <bblack>	 no idea yet
[11:11:43] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[11:19:04] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[11:19:53] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[11:20:08] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1520678919286&to=1520679820862&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All
[11:20:29] <bblack>	 ^ there's the last wave of 5xx from esams pov, but no correlating ramp up to ~12K backend connections in the graph below it
[11:20:53] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[11:21:30] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All
[11:21:56] <bblack>	 ^ this shows the eqiad pov, where we can see the actual backend conn pileups in the graph below it, but the limits seem awful low and they never plateau
[11:22:18] <bblack>	 (~2K conns and no flatness during the peak of it)
[11:22:48] <volans>	 but they plateau in the per-host one on cp1068 at 1k
[11:23:53] <mark>	 bblack: during cp1068's peak, expiry mailbox lag hit 150K
[11:24:44] <mark>	 also spikes during cp1055's peak
[11:25:29] <bblack>	 yes
[11:25:50] <mark>	 https://grafana.wikimedia.org/dashboard/db/varnish-mailbox-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All
[11:26:00] <bblack>	 but we've come to understand that our "expiry mailbox lag" invention is something that correlates with almost any problem from anywhere, it's like a gauge for "varnish isn't happy"
[11:32:38] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2786 bytes in 0.542 second response time
[11:33:35] <apergos>	 cp1068 a second time
[11:33:38] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17449 bytes in 0.503 second response time
[11:33:39] <_joe_>	 yes
[11:33:57] <bblack>	 1055 had follow-on patterns as well
[11:34:03] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[11:34:13] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[11:35:13] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[11:35:24] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[11:40:57] <wikibugs>	 10Operations, 10UniversalLanguageSelector, 10I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#4040125 (10Aklapper) > Afterwards, fonts appeared as they used to appear before June 2016 @mehtab.ahmed: Is that good? is that bad? And what does that mean for this task?
[11:51:13] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[11:55:23] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[11:56:43] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[11:57:23] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[13:07:23] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[13:08:02] <Hauskatze>	 Error: 503, Backend fetch failed at Sat, 10 Mar 2018 13:07:45 GMT
[13:08:28] <Wiki13>	 ongoing problem, Hauskatze, see topic :p
[13:08:38] <Hauskatze>	 ack, thx
[13:10:13] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[13:10:13] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[13:11:03] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[13:12:48] <bblack>	 another spike, this one through cp1053
[13:13:04] <bblack>	 it's already over now, the icinga 5xx alerts tend to lag behind reality
[13:18:50] <bblack>	 disk i/o rate doesn't have any anomaly until ~13:10:32, obj expiration rate dumps off at 13:05:32
[13:19:48] <bblack>	 it's semi-interesting that the *frontend* instance shows a marked dropoff in object expiration rate at 13:02:32, which is way ahead of other things so far
[13:20:38] <bblack>	 backend req rate: 13:07:32 before anything significant
[13:21:05] <bblack>	 backend connection count shows signs at 13:05:32
[13:21:13] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[13:21:33] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[13:22:14] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[13:22:23] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[13:26:57] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2772 bytes in 1.887 second response time
[13:27:48] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2787 bytes in 0.540 second response time
[13:28:28] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2804 bytes in 1.874 second response time
[13:29:07] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17449 bytes in 1.502 second response time
[13:29:23] <ema>	 !log cp1068/cp1055: restart varnish-be
[13:29:24] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[13:29:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:37] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17450 bytes in 1.537 second response time
[13:29:54] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[13:29:58] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17451 bytes in 0.498 second response time
[13:30:44] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[13:30:44] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[13:31:24] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[13:32:58] <wikibugs>	 (03PS1) 10Ema: varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580
[13:33:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580 (owner: 10Ema)
[13:34:36] <akosiaris_>	 so this is ongoing indeed. got paged again
[13:35:23] <bblack>	 it's being handled, you can go back to whatever you were doing unless you're bored
[13:35:50] <akosiaris_>	 😊
[13:36:03] <volans|off>	 ack, thx 
[13:36:04] <akosiaris_>	 thanks!
[13:38:49] <wikibugs>	 (03PS2) 10Ema: varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580
[13:40:26] <mark>	 cp3046 had mailbox spike
[13:40:29] <mark>	 might need a restart
[13:41:33] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[13:41:37] <bblack>	 that's cache_upload, unrelated to whatever's going on in the text world
[13:41:54] <bblack>	 (but also, thin spikes tend not to mean long-term problems.  ramps do tend to, though)
[13:42:09] <bblack>	 see how these eqiad text ones look more rampy than spiky?
[13:42:11] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-mailbox-lag?orgId=1&from=1520670072323&to=1520681224598&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All
[13:42:43] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[13:42:53] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[13:42:53] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[13:43:13] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[13:43:58] <ShakespeareFan02>	 Upgrading?
[13:44:00] <ShakespeareFan02>	 "Request from 88.97.96.89 via cp1068 cp1068, Varnish XID 196378886
[13:44:01] <ShakespeareFan02>	 Error: 503, Backend fetch failed at Sat, 10 Mar 2018 13:25:42 GMT"
[13:44:52] <paladox>	 ShakespeareFan02 that's known, see topic
[13:44:53] <Wiki13>	 no see topic
[13:54:26] <ShakespeareFan02>	 Thanks
[13:55:43] <icinga-wm>	 RECOVERY - Host wdqs2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.62 ms
[14:12:44] <icinga-wm>	 PROBLEM - Host wdqs2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:28:34] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[14:36:43] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[14:46:51] <wikibugs>	 (03CR) 10Urbanecm: "Thanks, fixed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm)
[14:46:53] <wikibugs>	 (03PS5) 10Urbanecm: Initial configuration for romdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184)
[14:47:02] <wikibugs>	 (03PS6) 10Urbanecm: Initial configuration for romdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184)
[14:47:51] <wikibugs>	 (03PS4) 10Urbanecm: Add romd.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/412898 (https://phabricator.wikimedia.org/T187184)
[14:47:53] <wikibugs>	 (03PS2) 10Urbanecm: Add romd.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/412896 (https://phabricator.wikimedia.org/T187184)
[14:56:01] <ema>	 !log cp1053: restart varnish-be
[14:56:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:52] <wikibugs>	 (03PS4) 10MarcoAurelio: Disable abusefilter from collecting private data on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862)
[16:17:40] <wikibugs>	 (03CR) 10MarcoAurelio: Disable abusefilter from collecting private data on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio)
[16:27:17] <wikibugs>	 (03CR) 10Huji: [C: 031] Disable abusefilter from collecting private data on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio)
[16:43:13] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[16:43:13] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[16:43:13] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[16:43:13] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:43:13] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[16:43:13] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[16:43:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled
[16:43:44] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1005.eqiad.wmnet are marked down but pooled
[16:43:44] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[16:44:04] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled
[16:45:30] <paladox>	 elukey ^^
[16:46:13] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet])
[16:46:42] <apergos>	 huh
[16:46:46] <bblack>	 nothing obvious in syslog on a random aqs node
[16:47:13] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet])
[16:47:13] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1005.eqiad.wmnet])
[16:47:14] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy
[16:47:53] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy
[16:48:03] <bblack>	 or druid
[16:48:04] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1053 bytes in 0.002 second response time
[16:48:04] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy
[16:48:13] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy
[16:48:50] <bblack>	 and I don't see any processes having restarted around that point in time either
[16:49:04] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy
[16:49:08] <bblack>	 must be something up in the complexity layers above OS-level stuff.  I donno cassanda or java stuff or whatever
[16:49:13] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy
[16:49:16] <bblack>	 or nodejs as the case may be
[16:49:50] * bblack puts on his dumb sysadmin hat and points at other people
[16:51:38] <apergos>	 I'm not gone, just reading wikitech aqs page :-/
[16:53:49] <apergos>	 got nothin out of atop on random node (specifically aqs1008)
[16:54:34] <bblack>	 the aqs endpoint alerts seem to ebe clearing on their own
[16:54:47] <bblack>	 somewhat
[16:54:54] <bblack>	 druid not so much, the pybal thing
[16:57:08] <apergos>	 there is absolutely no point in me digging through these debug logs for cassandra, they are firehoses 
[16:57:09] <apergos>	 meh
[16:57:13] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1003 is OK: OK: no difference between hosts in IPVS/PyBal
[16:57:13] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal
[16:57:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[16:57:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy
[16:58:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[16:58:19] <wikibugs>	 (03PS1) 10Jcrespo: Add support to global query execution limit [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/418593
[16:58:49] <bblack>	 oh look it fixed itself, yay
[16:59:03] <apergos>	 well good
[16:59:15] <apergos>	 because me looking at these also firehose gc logs wasn't getting anywhere either
[16:59:44] <apergos>	 there's probably (I hope?) some dashboard somewhere with the right stuff but no idea where
[17:01:13] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1010 is OK: OK: no difference between hosts in IPVS/PyBal
[17:02:04] <apergos>	 https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-gc?refresh=1m&orgId=1 found this, not so enlightening
[17:25:07] <wikibugs>	 (03PS1) 10Jcrespo: Add script for dumping and recovering database sections [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/418595
[17:38:18] <elukey>	 hello!
[17:38:22] <apergos>	 hey hey
[17:38:29] <mutante>	 hi
[17:38:29] <elukey>	 just seen the alarms sorry I wasn't online before :(
[17:38:34] <apergos>	 well the aqs whatever it was, is ok now
[17:38:40] <mutante>	 same here, i just came on IRC for something 
[17:38:49] <apergos>	 but if you wanted to look just to tell us what the problem was, :-D    or not
[17:38:55] <apergos>	 I mean since it is all ok 
[17:40:17] <elukey>	 so AQS sits in front of Druid to power an API that returns metrics to wikistats 2, as starter :)
[17:40:37] <elukey>	 so the AQS endpoints health should be a side effect of Druid misbehaving probably
[17:40:53] <elukey>	 the AQS failures were only for /analytics.wikimedia.org/v1/edits/per-page
[17:41:56] <elukey>	 second step is checking the Druid dashboard
[17:41:57] <elukey>	 https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_public&var-druid_datasource=All&from=now-3h&to=now
[17:42:18] <elukey>	 (there are two clusters, "public" is the one that is behind AQS, "analytics" is the one behind pivot)
[17:43:47] <elukey>	 I can see a jump in req/s around the time of the first AQS alert
[17:44:06] <elukey>	 and also the broker response cache jumping up as well as consequence
[17:46:01] <elukey>	 druid was load tested by Dan in my team and we didn't get into these case, but I think that big queries to druid that are not yet cached might not be super efficient
[17:46:06] <elukey>	 causing things like these
[17:47:59] <apergos>	 the query count per second, I only see a little blip there around 16:40 or whatever, the big leap is later like a few minutes ago
[17:48:02] <apergos>	 if I'm looking at the right graph
[17:49:03] <elukey>	 well we jumped from 0 to 27/s, not a blip if the queries were big and not cached
[17:49:15] <elukey>	 I am talking about https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&panelId=43&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_public&var-druid_datasource=All&from=now-3h&to=now
[17:49:42] <elukey>	 big jump was at ~16:41 that matches timing wise
[17:50:22] <elukey>	 I am seeing timeouts in logs, yeah I am pretty sure this is what happened
[17:51:31] <apergos>	 huh
[17:51:33] <elukey>	 ah snap sorry there are multiple query counts
[17:51:51] <apergos>	 ok now i have to ask why the one at 17.33 didn't do anything to us
[17:52:03] <elukey>	 I think due to caching
[17:52:14] <apergos>	 huh
[17:52:23] <elukey>	 so druid runs multiple daemons, two of them are called "Broker" and "historical"
[17:52:59] <elukey>	 the broker is the one that answers to clients queries, and queries the "historical" daemons on all druid hosts for the data contained in the druid segments
[17:53:23] <apergos>	 ok
[17:53:56] <elukey>	 the broker might need to get data from multiple of them, aggregate/etc.. and then return it to the client
[17:54:24] <elukey>	 the response is then cached so subsequent queries don't hit the historicals
[17:54:38] <elukey>	 they might need to load data from HDFS if needed
[17:55:55] <apergos>	 ok, got it
[17:55:57] <elukey>	 and only yesterday Joseph was able to compute the february's data for mw-history using the hadoop cluster ( we had some issues)
[17:56:19] <apergos>	 but these spikes are for the historical one? 
[17:56:51] <apergos>	 sorry for these very basic questions but it's the first time I look at  aqs
[17:57:21] <elukey>	 nono I am trying to reason out loud about what happened to see if it makes sense, a brain bounce is always good :)
[17:58:05] <apergos>	 ok!
[17:58:07] <elukey>	 so from what I know one query to the broker can trigger multiple queries to the historicals, that may need to load data from HDFS and load them into page cache
[17:58:35] <elukey>	 once this happens (and mw-history fits in memory iirc) then it is super fast 
[17:58:40] <apergos>	 but once they are there then no problems
[17:58:41] <apergos>	 ok
[17:58:45] <elukey>	 yeah exactly
[17:58:55] <elukey>	 so the first queries hit hard the historicals
[17:58:59] <apergos>	 ok
[17:59:04] <apergos>	 and we should see that on this graph
[17:59:13] <elukey>	 then when all the cache layers were in place even a big spike didn't cause any flinch to druid
[17:59:21] <apergos>	 that I don't understand
[17:59:32] <apergos>	 unless this shows somehow the 'front end' requests only
[18:00:03] <apergos>	 the graph, I mean. I'm not clear which requests it actually shows
[18:01:30] <elukey>	 in the prometheus dashboard we should have metrics for the broker, historical and coordinator (not important in this case)
[18:01:46] <elukey>	 it should be indicated with labes on the grafana rows
[18:02:01] <elukey>	 at 16:40 I can see the historicals being hit
[18:04:32] <elukey>	 (I am reading historical logs on druid1004 - /var/log/druid/historical.log)
[18:05:32] <apergos>	 ok
[18:12:20] <elukey>	 apergos: this is another good broker metric
[18:12:21] <elukey>	 https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&panelId=16&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_public&var-druid_datasource=All&from=now-3h&to=now
[18:12:23] <apergos>	 ok now that I see we're looking at the broker requests, it's making sense
[18:13:13] <apergos>	 the spike indicates a bunch of requests coming in for something not yet cached?
[18:13:39] <apergos>	 ah right and cache misses lines up right with that
[18:13:47] <elukey>	 yep exactly
[18:13:54] <elukey>	 causing overload and timeouts
[18:13:57] <apergos>	 ok!  I am believing this story
[18:14:03] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1305 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:14:05] <apergos>	 soooo.... the next time we see this
[18:14:29] <apergos>	 no no no, really not interested in looking at the jobrunners too
[18:14:37] <wikibugs>	 (03CR) 10Ema: "As discussed on IRC, a potential issue here would be excessive logging by varnish-fe-slowreqs caused by slow clients. However, 5 minutes o" [puppet] - 10https://gerrit.wikimedia.org/r/418580 (owner: 10Ema)
[18:14:45] <elukey>	 I am going to open a task (probably tomorrow) to track this, I'll folow up with my team
[18:15:03] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1305 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:15:11] <apergos>	 ok, so we sort of cross our fingers and hope for quick recovery (assuming the graphs tell the same story) if it happens again, yes?
[18:15:26] <apergos>	 and look for you if you are around!
[18:15:38] <wikibugs>	 (03PS1) 10Dzahn: temp use bast-test.eqiad.wmnet for wmf4727 [dns] - 10https://gerrit.wikimedia.org/r/418598 (https://phabricator.wikimedia.org/T186623)
[18:15:53] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.006 second response time
[18:15:53] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time
[18:16:13] <apergos>	 meh just about to hop over there and it's back
[18:16:13] <elukey>	 apergos: feel free to call me if you need me, but for the moment the API only powers the wikistats 2 beta website
[18:16:20] <apergos>	 ok!
[18:16:25] <elukey>	 super :)
[18:16:38] <apergos>	 thank you very much for the walkthrough
[18:16:49] <mutante>	 the HHVM alert is self-fixing because of the automatic restart script
[18:17:18] <mutante>	 the jobrunner alert just happened to be one of the others on mw1305
[18:17:42] <apergos>	 yeah I figured it was hhvm
[18:17:54] <elukey>	 Mar 10 18:15:13 mw1305 systemd[1]: hhvm.service: main process exited, code=killed, status=11/SEGV
[18:18:01] <elukey>	 not super great though :D
[18:18:04] <apergos>	 no
[18:18:21] <ema>	 very low in the list of super great things
[18:18:25] <apergos>	 someday we'll be pn php7 with a whole new host of problems
[18:18:52] <mutante>	 yes, but also common. and yes :p
[18:18:55] <apergos>	 I forgot about that auto restart though
[18:19:54] <elukey>	 isn't it systemd restarting hhvm after it dies?
[18:20:13] <elukey>	 the restart script should only restart hhvm when eligible
[18:20:19] <elukey>	 via cron
[18:22:19] <apergos>	 I have no idea unless I go look
[18:22:23] * apergos pokes around in puppet
[18:24:41] <apergos>	 yeah it has restartalways, indeed
[18:52:40] <wikibugs>	 (03CR) 10Dzahn: [C: 032] temp use bast-test.eqiad.wmnet for wmf4727 [dns] - 10https://gerrit.wikimedia.org/r/418598 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn)
[18:55:44] <wikibugs>	 (03PS1) 10Dzahn: DHCP: test wmf4727 as bast-test.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/418600 (https://phabricator.wikimedia.org/T186623)
[19:03:04] <wikibugs>	 (03PS2) 10Dzahn: DHCP: test wmf4727 as bast-test.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/418600 (https://phabricator.wikimedia.org/T186623)
[19:03:35] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "as a test becaues the switch port isn't configured for public VLAN" [puppet] - 10https://gerrit.wikimedia.org/r/418600 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn)
[19:12:19] <wikibugs>	 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#4040520 (10Dzahn) when using wmf4727 and testing stretch install:   ``` No kernel modules were found. This probably is due to a mismatch      │        │ between the kernel used by this version of the i...
[19:55:24] <wikibugs>	 (03PS1) 10Ema: varnishslowlog: add fetch overhead introduced by varnish [puppet] - 10https://gerrit.wikimedia.org/r/418603
[20:11:57] <joal>	 ibugs> (PS1) Dzahn: temp use bast-test.eqiad.wmnet for wmf4727 [dns] - https://gerrit.wikimedia.org/r/418598 (https://phabricator.wikimedia.org/T186623)
[20:12:00] <joal>	 19:15:53 < icinga-wm> RECOVERY - Nginx local proxy to apache on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.006 second response time
[20:12:03] <joal>	 19:15:53 < icinga-wm> RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time
[20:12:06] <joal>	 19:15:58 -!- govg [~govg@unaffiliated/govg] has quit [Ping timeout: 264 seconds]
[20:42:39] <wikibugs>	 (03CR) 10Hashar: [C: 032] Fix typo in word compatibility [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418128 (owner: 10Umherirrender)
[20:44:18] <wikibugs>	 (03Merged) 10jenkins-bot: Fix typo in word compatibility [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418128 (owner: 10Umherirrender)
[20:44:36] <wikibugs>	 (03CR) 10jenkins-bot: Fix typo in word compatibility [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418128 (owner: 10Umherirrender)
[21:43:06] <wikibugs>	 (03PS2) 10Ema: varnishslowlog: add fetch overhead introduced by varnish [puppet] - 10https://gerrit.wikimedia.org/r/418603
[22:37:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled
[22:37:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled
[22:37:43] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:37:43] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[22:37:53] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled
[22:38:03] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[22:38:03] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[22:38:03] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[22:38:13] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[22:38:13] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received
[22:38:33] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1004.eqiad.wmnet])
[22:39:44] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet])
[22:39:44] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet])
[22:43:39] <wikibugs>	 (03PS1) 10Ahmed123: Revert "Restrict FlaggedRevs to only operated on NS_MAIN on arwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418700
[22:56:05] <wikibugs>	 (03PS2) 10Ahmed123: Revert "Restrict FlaggedRevs to only operated on NS_MAIN on arwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418700
[23:42:36] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team: Decommission labstore100[12] and their disk shelves - https://phabricator.wikimedia.org/T187456#4040780 (10bd808) 05Open>03stalled