[00:00:02] <Krenair>	 bblack would know of course
[00:00:39] <ori>	 it's correct, except there's an effort to eliminate the lvs from between varnish and apache and rely on varnish's own load-balancing capabilities
[00:02:44] <Toordog>	 and you probably use memcache on the backend to avoid too much database read?
[00:02:58] <Krenair>	 yep
[00:03:05] <YuviPanda>	 Toordog: we have a lot of layers of caching
[00:03:23] <YuviPanda>	 memcached, then there is 'parser cache', we also use redis for sessions (and job queues)
[00:04:06] <Toordog>	 I was wondering what was the use of redis
[00:04:23] <Krenair>	 parser cache sits in mysql doesn't it?
[00:04:24] <YuviPanda>	 Toordog: we use it for session storage because it can replicate cross DC
[00:04:30] <Toordog>	 ok
[00:04:31] <YuviPanda>	 Krenair: it's a separate mysql cluster, yeah
[00:04:34] <Krenair>	 yeah
[00:04:47] <Krenair>	 redis also gets used for sending the newer (non-irc) rc feeds
[00:05:39] <Toordog>	 what is squid for?  on the architecture drawing, it shows that varnish is only used for mobile and upload, while squid would be used for normal content?
[00:05:42] <Krenair>	 and something to do with file backends, and something to do with the profiler. according to grep on the mediawiki config
[00:05:55] <Krenair>	 squid was replaced with varnish, mostly
[00:06:09] <Toordog>	 ah ok, outdate drawing then :)
[00:06:14] <Toordog>	 that make more sense too
[00:08:09] <Krenair>	 There's still a bunch of things that refer to varnish as 'squid'
[00:08:39] <YuviPanda>	 yeah
[00:08:44] <Toordog>	 the schema make it hard to see which cluster run what, what is the front end and backend *other than obvious reason like a database*.
[00:08:53] <YuviPanda>	 what diagram is this?
[00:09:01] <Toordog>	 the one on the link you sent me about LVS
[00:09:25] <YuviPanda>	 ah
[00:09:37] <YuviPanda>	 that's just the english wikipedia article on LVS
[00:09:43] <YuviPanda>	 nothing to do with wikimedia's architecture
[00:10:00] <Toordog>	 aww, i though it was the wikimedia ofne 
[00:10:01] <Krenair>	 which shows wikimedia's network as an example, see the diagram
[00:10:01] <Toordog>	 hahaha
[00:10:03] <YuviPanda>	 oh lol
[00:10:04] <YuviPanda>	 it does
[00:10:12] <YuviPanda>	 yeah that's very outdated
[00:10:33] <Toordog>	 you have a more recent schema?
[00:10:48] <YuviPanda>	 nope
[00:11:02] <YuviPanda>	 someone should do one at some point.... >_>
[00:11:16] <Toordog>	 if i get hired i'll do it ;)
[00:11:26] <YuviPanda>	 heh
[00:12:02] <Krenair>	 YuviPanda, I suspect that has been suggested multiple times over the years, someone has drawn a diagram, and then it's left to rot until someone else starts again from scratch :/
[00:12:28] <YuviPanda>	 yup
[00:12:34] <YuviPanda>	 needs an extensible diagram
[00:13:09] <Krenair>	 this one is an SVG by Ryan Lane, two years old
[00:16:16] <Krenair>	 Oh, look, second entry on wikitech's Special:ListFiles:
[00:16:17] <Krenair>	 https://wikitech.wikimedia.org/wiki/File:Wikimedia-cluster.svg
[00:18:08] <Toordog>	 nice you are using logstash :)
[00:25:01] <YuviPanda>	 nice
[00:25:05] <YuviPanda>	 hat's pretty accurate
[00:41:58] <icinga-wm>	 PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.119601328904
[00:52:29] <wikibugs>	 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1634588 (10Tgr) 3NEW
[00:52:55] <wikibugs>	 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1634598 (10Tgr)
[01:06:08] <icinga-wm>	 RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.00333333333333
[01:16:20] <icinga-wm>	 PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100%
[01:18:09] <icinga-wm>	 RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 1.56 ms
[02:34:18] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.26wmf22/cache/l10n: l10nupdate for 1.26wmf22 (duration: 10m 13s)
[02:34:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:35:39] <icinga-wm>	 PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 21 connecting: (unnamed) not-conn: cp2015_v6, cp3017_v6, cp4011_v6
[02:37:39] <icinga-wm>	 RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK
[02:40:44] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate completed (1.26wmf22) at 2015-09-13 02:40:43+00:00
[02:40:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:05:49] <icinga-wm>	 PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied
[03:23:03] <grrrit-wm>	 (03CR) 10Tim Starling: "Yes, it was Jimmy's idea in the first place, but Jimmy is not blocking alteration of that setting. According to Greg Maxwell he was in fav" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237960 (owner: 10Kaldari)
[03:25:48] <icinga-wm>	 RECOVERY - Disk space on labstore1002 is OK: DISK OK
[03:35:49] <icinga-wm>	 PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Puppet has 1 failures
[03:36:09] <icinga-wm>	 PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Puppet has 1 failures
[03:36:48] <icinga-wm>	 PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Puppet has 1 failures
[03:49:09] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds
[04:02:01] <icinga-wm>	 RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:02:19] <icinga-wm>	 RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[04:02:58] <icinga-wm>	 RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:06:09] <icinga-wm>	 PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied
[04:10:09] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2628.38 Read Requests/Sec=2375.59 Write Requests/Sec=48.31 KBytes Read/Sec=9518.28 KBytes_Written/Sec=193.26
[04:11:38] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds
[04:16:09] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=2.41 Read Requests/Sec=0.00 Write Requests/Sec=0.30 KBytes Read/Sec=0.00 KBytes_Written/Sec=1.20
[04:35:49] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds
[04:45:49] <icinga-wm>	 PROBLEM - puppet last run on mw2031 is CRITICAL: CRITICAL: puppet fail
[04:45:58] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds
[04:51:58] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds
[04:55:59] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds
[05:13:59] <icinga-wm>	 RECOVERY - puppet last run on mw2031 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[05:38:28] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[06:02:52] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Sep 13 06:02:52 UTC 2015 (duration 2m 51s)
[06:03:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:15:28] <grrrit-wm>	 (03PS33) 10Gergő Tisza: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles)
[06:25:56] <wikibugs>	 6operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, 5ContentTranslation-Release6: Apertium Failed to load resource: net::ERR_SPDY_PROTOCOL_ERROR - https://phabricator.wikimedia.org/T112403#1634695 (10Amire80) p:5Triage>3Normal
[06:25:59] <icinga-wm>	 PROBLEM - Disk space on elastic1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%)
[06:27:59] <icinga-wm>	 RECOVERY - Disk space on elastic1001 is OK: DISK OK
[06:31:00] <icinga-wm>	 PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:31:28] <icinga-wm>	 PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:58] <icinga-wm>	 PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:20] <icinga-wm>	 PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:20] <icinga-wm>	 PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:28] <icinga-wm>	 PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:29] <icinga-wm>	 PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:39] <icinga-wm>	 PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:49] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:34:38] <wikibugs>	 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 6 others: Convert eqiad imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1634703 (10Bawolff) Just for cross-reference, a regression was found with rsvg - svgs > 10 MB (there's about 9000 such files) don't render. We'd ne...
[06:55:48] <icinga-wm>	 RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[06:56:09] <icinga-wm>	 RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[06:56:39] <icinga-wm>	 RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:39] <icinga-wm>	 RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:40] <icinga-wm>	 RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[06:56:48] <icinga-wm>	 RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[06:56:59] <icinga-wm>	 RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[06:57:10] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[06:57:19] <icinga-wm>	 RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:59:29] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds
[07:00:59] <icinga-wm>	 PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail
[07:09:39] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds
[07:27:09] <icinga-wm>	 RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[07:56:00] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[08:32:30] <grrrit-wm>	 (03CR) 10Ricordisamoa: [C: 031] Configure $wgExtraSignatureNamespaces for it.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237331 (https://phabricator.wikimedia.org/T7645) (owner: 10Nemo bis)
[10:18:02] <wikibugs>	 6operations, 10netops: Wikimedia sites not reachable through CenturyLink ISP - https://phabricator.wikimedia.org/T112396#1635062 (10faidon)
[10:19:47] <wikibugs>	 6operations, 10netops: Wikimedia sites not reachable through CenturyLink ISP - https://phabricator.wikimedia.org/T112396#1634165 (10faidon) OK. It's impossible to find out what happened without more data unfortunately :(  I'm going to resolve this for now but feel free to reopen (or open a new task) if somethi...
[10:21:48] <wikibugs>	 6operations, 10netops: Wikimedia sites not reachable through CenturyLink ISP - https://phabricator.wikimedia.org/T112396#1635069 (10faidon) 5Open>3Resolved a:3faidon
[12:20:31] <wikibugs>	 6operations, 7Pybal: jessie pybals get restarted every day by logrotate, resetting BGP sessions - https://phabricator.wikimedia.org/T112457#1635133 (10faidon) 3NEW
[12:26:47] <wikibugs>	 6operations, 7Pybal: jessie pybals get restarted every day by logrotate, resetting BGP sessions - https://phabricator.wikimedia.org/T112457#1635140 (10faidon) The easy fix here is to use `reload` rather than `force-reload` from logrotate. It's handled identically by the init script but isn't being interpreted...
[12:29:58] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Use "reload" instead of "force-reload" from logrotate [debs/pybal] - 10https://gerrit.wikimedia.org/r/237986 (https://phabricator.wikimedia.org/T112457) 
[13:19:19] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] Use "reload" instead of "force-reload" from logrotate [debs/pybal] - 10https://gerrit.wikimedia.org/r/237986 (https://phabricator.wikimedia.org/T112457) (owner: 10Faidon Liambotis)
[14:21:18] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192
[14:21:35] <sjoerddebruin>	 Stuff broken?
[14:22:00] <ShakespeareFan00>	 "Request: GET http://en.wikipedia.org/wiki/User_talk:Orion_2012, from 10.20.0.105 via cp1052 cp1052 ([10.64.32.104]:3128), Varnish XID 78792582
[14:22:01] <ShakespeareFan00>	 Forwarded for: 80.176.129.180, 10.20.0.176, 10.20.0.176, 10.20.0.105
[14:22:02] <ShakespeareFan00>	 Error: 503, Service Unavailable at Sun, 13 Sep 2015 14:21:35 GMT "
[14:22:10] <ShakespeareFan00>	 Planned outage guys?
[14:22:10] <Vito>	 errore 503 being served by esams
[14:22:18] <Vito>	 *error
[14:22:33] <ShakespeareFan00>	 Who had calamari for lunch ;) XD
[14:22:48] <Nemo_bis>	 Better?
[14:22:50] <Vito>	 [16:21:18] <icinga-wm> PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 <-- sounds serious
[14:23:00] <sjoerddebruin>	 Working here again.
[14:23:16] <Vito>	 same here
[14:23:55] <wikibugs>	 6operations: 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635273 (10Multichill) 3NEW
[14:26:03] <wikibugs>	 6operations: 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635281 (10Vituzzu) The same at it.wiki.
[14:27:54] <mafk>	 Error: 503, Service Unavailable at Sun, 13 Sep 2015 14:27:28 GMT
[14:27:58] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1247 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2]
[14:28:09] <sjoerddebruin>	 hmpf
[14:28:15] <ShakespeareFan00>	 Up but intermittent here - (UK 
[14:28:40] <Elfix>	 dead here too
[14:28:41] <ShakespeareFan00>	 Someone make a big change?
[14:28:50] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1170 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [115.2]
[14:29:10] <ShakespeareFan00>	 I've noted this sort of things sometimes happens when a template change propogates across a number of pages
[14:29:30] <wikibugs>	 6operations: Intermittent 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635285 (10Multichill) p:5Triage>3Unbreak!
[14:29:49] <icinga-wm>	 PROBLEM - HHVM rendering on mw1241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:29:49] <icinga-wm>	 PROBLEM - Apache HTTP on mw1251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:29:50] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1043 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:29:50] <icinga-wm>	 PROBLEM - Apache HTTP on mw1252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:29:53] <Elfix>	 ShakespeareFan00: not sure it would cause a complete blackout
[14:29:59] <multichill>	 Hi guys, https://phabricator.wikimedia.org/T112463
[14:30:13] <wikibugs>	 6operations: Intermittent 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635290 (10MarcoAurelio) Same at eswiki:  Request: GET `[removed]`, from 10.20.0.176 via cp1066 cp1066 ([10.64.0.103]:3128), Varnish XID 2216592190 Forwarded for: `***.***.***.***`, 10.20.0.109, 10.20.0.109, 10.20.0.176...
[14:30:29] <icinga-wm>	 PROBLEM - HHVM queue size on mw1068 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:30:29] <multichill>	 Intermittent 503's on multiple sites (Wikipedia, Commons, Wikidata, and probably more)
[14:30:39] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [500.0]
[14:30:49] <icinga-wm>	 PROBLEM - HHVM queue size on mw1247 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0]
[14:31:09] <icinga-wm>	 PROBLEM - HHVM queue size on mw1253 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0]
[14:31:48] <icinga-wm>	 RECOVERY - HHVM rendering on mw1241 is OK: HTTP OK: HTTP/1.1 200 OK - 65419 bytes in 0.082 second response time
[14:31:48] <icinga-wm>	 RECOVERY - Apache HTTP on mw1251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time
[14:31:49] <icinga-wm>	 RECOVERY - Apache HTTP on mw1252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.034 second response time
[14:31:59] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1043 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:32:18] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1247 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:32:18] <paravoid>	 hey
[14:32:19] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1071 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:32:29] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1237 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2]
[14:32:29] <icinga-wm>	 PROBLEM - HHVM queue size on mw1071 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:32:29] <icinga-wm>	 RECOVERY - HHVM queue size on mw1068 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:32:29] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1176 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2]
[14:32:39] <wikibugs>	 6operations: Intermittent 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635293 (10Lluis_tgn) Same for ca.wiki, en.wiki, phabricator... Now is OK.  ```  Request: GET http://ca.wikipedia.org/wiki/Shakira, from 10.20.0.112 via cp1068 cp1068 ([10.64.0.105]:3128), Varnish XID 700001853 Forwarde...
[14:32:44] <paravoid>	 http://ganglia.wikimedia.org/latest/?r=hour&tab=ch&hreg[]=^db1055
[14:32:50] <icinga-wm>	 RECOVERY - HHVM queue size on mw1247 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:32:59] <icinga-wm>	 PROBLEM - HHVM queue size on mw1258 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:33:09] <icinga-wm>	 PROBLEM - HHVM queue size on mw1029 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0]
[14:33:09] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1170 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:33:38] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1243 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2]
[14:33:40] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1036 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:33:49] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1242 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2]
[14:33:49] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1034 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4]
[14:33:59] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1064 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4]
[14:34:03] <paravoid>	 s1 is in trouble
[14:34:06] <paravoid>	 lots of queries
[14:34:09] <ShakespeareFan00>	 Heavey load?
[14:34:09] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1070 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [86.4]
[14:34:16] <ShakespeareFan00>	 I wonder why
[14:34:18] <paravoid>	 yeah but it passed now so I don't know what it was
[14:34:19] <icinga-wm>	 PROBLEM - HHVM queue size on mw1100 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [80.0]
[14:34:28] <icinga-wm>	 PROBLEM - HHVM queue size on mw1055 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [80.0]
[14:34:28] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1216 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [115.2]
[14:34:28] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1055 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [86.4]
[14:34:29] <icinga-wm>	 PROBLEM - HHVM queue size on mw1110 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:34:29] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1176 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:34:30] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1029 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:34:40] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1241 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2]
[14:34:59] <icinga-wm>	 PROBLEM - HHVM queue size on mw1025 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0]
[14:34:59] <icinga-wm>	 PROBLEM - HHVM queue size on mw1051 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [80.0]
[14:34:59] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1046 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [86.4]
[14:35:09] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1214 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2]
[14:35:21] <icinga-wm>	 RECOVERY - HHVM queue size on mw1253 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:35:21] <icinga-wm>	 PROBLEM - HHVM queue size on mw1074 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:35:22] <icinga-wm>	 PROBLEM - HHVM queue size on mw1241 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [80.0]
[14:35:22] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1081 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:35:29] <paravoid>	 stupid checks
[14:35:34] <mafk>	 I'm cloning mediawiki core from gerrit hope that's not the issue
[14:35:40] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1110 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:35:40] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1243 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:35:43] <paravoid>	 no, that's not the issue mafk  :)
[14:35:48] <icinga-wm>	 PROBLEM - HHVM queue size on mw1050 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:35:49] <icinga-wm>	 PROBLEM - HHVM queue size on mw1242 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:35:49] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1102 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [86.4]
[14:35:49] <icinga-wm>	 PROBLEM - HHVM queue size on mw1036 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:35:49] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1034 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:35:58] <icinga-wm>	 PROBLEM - HHVM queue size on mw1081 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [80.0]
[14:35:58] <icinga-wm>	 PROBLEM - HHVM queue size on mw1022 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:35:58] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1245 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [115.2]
[14:35:58] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1250 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2]
[14:35:59] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1064 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:36:09] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1051 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:36:09] <icinga-wm>	 PROBLEM - HHVM queue size on mw1070 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [80.0]
[14:36:18] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1027 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:36:19] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1104 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:36:29] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1091 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:36:29] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1216 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:36:29] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1166 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2]
[14:36:32] <mafk>	 we had a centralnotice for this
[14:36:40] <icinga-wm>	 PROBLEM - HHVM queue size on mw1188 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [80.0]
[14:36:43] <ShakespeareFan00>	 Centralnotice?
[14:37:00] <icinga-wm>	 PROBLEM - HHVM queue size on mw1024 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [80.0]
[14:37:08] <icinga-wm>	 RECOVERY - HHVM queue size on mw1025 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:37:08] <icinga-wm>	 RECOVERY - HHVM queue size on mw1258 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:37:09] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1214 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:37:10] <icinga-wm>	 RECOVERY - HHVM queue size on mw1029 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:37:19] <icinga-wm>	 RECOVERY - HHVM queue size on mw1074 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:37:19] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1081 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:37:40] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1110 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:37:48] <icinga-wm>	 RECOVERY - HHVM queue size on mw1050 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:37:49] <icinga-wm>	 RECOVERY - HHVM queue size on mw1242 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:37:58] <icinga-wm>	 RECOVERY - HHVM queue size on mw1036 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:37:58] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1242 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:37:58] <icinga-wm>	 RECOVERY - HHVM queue size on mw1022 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:37:59] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1250 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:38:10] <icinga-wm>	 RECOVERY - HHVM queue size on mw1070 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:38:19] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1070 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:38:28] <icinga-wm>	 RECOVERY - HHVM queue size on mw1100 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:38:29] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1071 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:38:29] <icinga-wm>	 RECOVERY - HHVM queue size on mw1055 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:38:30] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1091 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:38:30] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1166 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:38:30] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1055 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:38:38] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1237 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:38:38] <icinga-wm>	 RECOVERY - HHVM queue size on mw1110 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:38:39] <icinga-wm>	 RECOVERY - HHVM queue size on mw1071 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:38:39] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1029 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:38:48] <icinga-wm>	 RECOVERY - HHVM queue size on mw1188 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:38:49] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1241 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:39:08] <icinga-wm>	 RECOVERY - HHVM queue size on mw1024 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:39:09] <icinga-wm>	 RECOVERY - HHVM queue size on mw1051 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:39:09] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1046 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:39:19] <sjoerddebruin>	 Yikes
[14:39:39] <Luke081515|away>	 dewiki is down at the moment
[14:39:48] <Bsadowski1>	 Yep
[14:39:51] <sjoerddebruin>	 Everything is down, Luke081515|away.
[14:39:52] <Bsadowski1>	 (I guess)
[14:39:53] <wikibugs>	 6operations: Intermittent 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635297 (10faidon) Seems to have been a cascading failure: application servers were backed up waiting for databases which resulted into a full-outage. The underlying issue seems to have been a database overload across a...
[14:39:56] <Luke081515|away>	 oh, ok
[14:39:56] <paravoid>	 down again?
[14:39:59] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1036 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:39:59] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1102 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:40:00] <icinga-wm>	 RECOVERY - HHVM queue size on mw1081 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:40:00] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1245 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:40:18] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1051 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:40:20] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1027 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:40:28] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1104 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:41:03] <ShakespeareFan00>	 So is it a bug, hardware or a 'slam'?
[14:41:08] <icinga-wm>	 PROBLEM - HHVM rendering on mw1257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:09] <icinga-wm>	 PROBLEM - Apache HTTP on mw1052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:09] <icinga-wm>	 PROBLEM - HHVM rendering on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:09] <icinga-wm>	 PROBLEM - HHVM rendering on mw1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:09] <icinga-wm>	 PROBLEM - HHVM rendering on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:09] <icinga-wm>	 PROBLEM - HHVM rendering on mw1067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:09] <icinga-wm>	 PROBLEM - HHVM rendering on mw1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:10] <icinga-wm>	 PROBLEM - HHVM rendering on mw1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:10] <icinga-wm>	 PROBLEM - HHVM rendering on mw1108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:11] <icinga-wm>	 PROBLEM - HHVM rendering on mw1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:11] <icinga-wm>	 PROBLEM - HHVM rendering on mw1101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:12] <icinga-wm>	 PROBLEM - HHVM rendering on mw1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:12] <icinga-wm>	 PROBLEM - HHVM rendering on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:13] <icinga-wm>	 PROBLEM - HHVM rendering on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:14] <paravoid>	 oh great
[14:41:31] <icinga-wm>	 PROBLEM - HHVM rendering on mw1250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:41:31] <icinga-wm>	 RECOVERY - HHVM queue size on mw1241 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:41:39] <Luke081515|away>	 o.O
[14:41:41] <mafk>	 be calm, everything will be solved
[14:41:52] <Luke081515|away>	 at the moment, dewiki works again
[14:42:02] <mafk>	 vandals for now can't edit
[14:42:03] <ShakespeareFan00>	 mafk: The high load looks unusual
[14:42:32] <mafk>	 ShakespeareFan00: I don't know, I hardly understand, but I trust our ops team.
[14:42:58] <icinga-wm>	 RECOVERY - HHVM rendering on mw1081 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.141 second response time
[14:42:59] <icinga-wm>	 RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time
[14:42:59] <icinga-wm>	 RECOVERY - HHVM rendering on mw1067 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.142 second response time
[14:42:59] <icinga-wm>	 RECOVERY - HHVM rendering on mw1065 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.125 second response time
[14:42:59] <icinga-wm>	 RECOVERY - HHVM rendering on mw1101 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.172 second response time
[14:43:00] <icinga-wm>	 RECOVERY - HHVM rendering on mw1057 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.151 second response time
[14:43:00] <icinga-wm>	 RECOVERY - HHVM rendering on mw1108 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.147 second response time
[14:43:00] <icinga-wm>	 RECOVERY - HHVM rendering on mw1086 is OK: HTTP OK: HTTP/1.1 200 OK - 65406 bytes in 0.139 second response time
[14:43:00] <icinga-wm>	 RECOVERY - HHVM rendering on mw1257 is OK: HTTP OK: HTTP/1.1 200 OK - 65406 bytes in 0.133 second response time
[14:43:01] <icinga-wm>	 RECOVERY - HHVM rendering on mw1150 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.162 second response time
[14:43:01] <icinga-wm>	 RECOVERY - HHVM rendering on mw1044 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.124 second response time
[14:43:02] <icinga-wm>	 RECOVERY - HHVM rendering on mw1025 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.242 second response time
[14:43:02] <icinga-wm>	 RECOVERY - HHVM rendering on mw1075 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.251 second response time
[14:43:03] <icinga-wm>	 RECOVERY - HHVM rendering on mw1054 is OK: HTTP OK: HTTP/1.1 200 OK - 65407 bytes in 0.131 second response time
[14:43:08] <Luke081515>	 argh
[14:43:13] <Bsadowski1>	 See, everything is fine
[14:43:18] <icinga-wm>	 RECOVERY - Apache HTTP on mw1246 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.027 second response time
[14:43:20] <icinga-wm>	 RECOVERY - HHVM rendering on mw1250 is OK: HTTP OK: HTTP/1.1 200 OK - 65406 bytes in 0.103 second response time
[14:43:21] <paravoid>	 we know there's a problem, I'm investigating
[14:44:12] <mafk>	 Bsadowski1: be more careful next time with the switches :P
[14:44:49] <Bsadowski1>	 :P
[14:44:59] <Bsadowski1>	 But I'm at home
[14:45:18] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1244 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [115.2]
[14:45:29] <icinga-wm>	 PROBLEM - HHVM queue size on mw1253 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0]
[14:46:29] <icinga-wm>	 PROBLEM - HHVM queue size on mw1093 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:46:39] <icinga-wm>	 PROBLEM - HHVM queue size on mw1063 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:46:39] <icinga-wm>	 PROBLEM - HHVM queue size on mw1239 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [80.0]
[14:46:49] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1073 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:46:50] <icinga-wm>	 PROBLEM - HHVM queue size on mw1042 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:46:50] <icinga-wm>	 PROBLEM - HHVM queue size on mw1149 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [80.0]
[14:46:59] <icinga-wm>	 PROBLEM - HHVM queue size on mw1088 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:46:59] <icinga-wm>	 PROBLEM - HHVM queue size on mw1039 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [80.0]
[14:47:18] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1083 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [86.4]
[14:47:29] <icinga-wm>	 PROBLEM - HHVM queue size on mw1085 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:47:29] <icinga-wm>	 PROBLEM - HHVM queue size on mw1082 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [80.0]
[14:47:29] <icinga-wm>	 PROBLEM - HHVM queue size on mw1214 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:47:39] <icinga-wm>	 PROBLEM - HHVM queue size on mw1072 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [80.0]
[14:47:58] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1243 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [115.2]
[14:47:58] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1054 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:47:59] <icinga-wm>	 PROBLEM - HHVM queue size on mw1113 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:47:59] <icinga-wm>	 PROBLEM - HHVM queue size on mw1103 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [80.0]
[14:48:09] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1246 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [115.2]
[14:48:18] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1256 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [115.2]
[14:48:18] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1106 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:48:18] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1064 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:48:28] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1107 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4]
[14:48:28] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1093 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [86.4]
[14:48:28] <icinga-wm>	 PROBLEM - HHVM queue size on mw1168 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [80.0]
[14:48:29] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1098 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:48:39] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1065 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:48:39] <icinga-wm>	 PROBLEM - HHVM queue size on mw1073 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [80.0]
[14:48:49] <icinga-wm>	 PROBLEM - HHVM queue size on mw1171 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:48:59] <icinga-wm>	 PROBLEM - HHVM queue size on mw1092 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [80.0]
[14:49:00] <icinga-wm>	 PROBLEM - HHVM queue size on mw1188 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0]
[14:49:00] <icinga-wm>	 PROBLEM - HHVM queue size on mw1151 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0]
[14:49:18] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1083 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:49:18] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1244 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:49:29] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1039 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4]
[14:49:30] <icinga-wm>	 PROBLEM - HHVM queue size on mw1018 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [80.0]
[14:49:30] <icinga-wm>	 RECOVERY - HHVM queue size on mw1085 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:49:30] <icinga-wm>	 RECOVERY - HHVM queue size on mw1082 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:49:38] <icinga-wm>	 RECOVERY - HHVM queue size on mw1253 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:49:38] <icinga-wm>	 RECOVERY - HHVM queue size on mw1214 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:49:40] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1108 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [86.4]
[14:49:40] <icinga-wm>	 RECOVERY - HHVM queue size on mw1072 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:49:59] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1243 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:49:59] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1054 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:49:59] <icinga-wm>	 RECOVERY - HHVM queue size on mw1113 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:49:59] <icinga-wm>	 RECOVERY - HHVM queue size on mw1103 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:50:10] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1246 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:50:18] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1106 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:50:19] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1256 is OK: OK: Less than 30.00% above the threshold [76.8]
[14:50:28] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1107 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:50:31] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1093 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:50:31] <icinga-wm>	 RECOVERY - HHVM queue size on mw1168 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:50:31] <icinga-wm>	 RECOVERY - HHVM queue size on mw1093 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:50:31] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1098 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:50:39] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1065 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:50:40] <icinga-wm>	 RECOVERY - HHVM queue size on mw1063 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:50:40] <icinga-wm>	 RECOVERY - HHVM queue size on mw1073 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:50:40] <icinga-wm>	 RECOVERY - HHVM queue size on mw1239 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:50:50] <icinga-wm>	 RECOVERY - HHVM queue size on mw1171 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:50:58] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1073 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:50:59] <icinga-wm>	 RECOVERY - HHVM queue size on mw1042 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:50:59] <icinga-wm>	 RECOVERY - HHVM queue size on mw1149 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:50:59] <icinga-wm>	 RECOVERY - HHVM queue size on mw1188 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:50:59] <icinga-wm>	 RECOVERY - HHVM queue size on mw1092 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:51:00] <icinga-wm>	 RECOVERY - HHVM queue size on mw1151 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:51:08] <icinga-wm>	 RECOVERY - HHVM queue size on mw1039 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:51:08] <icinga-wm>	 RECOVERY - HHVM queue size on mw1088 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:51:29] <icinga-wm>	 RECOVERY - HHVM queue size on mw1018 is OK: OK: Less than 30.00% above the threshold [10.0]
[14:51:41] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1108 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:52:19] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1064 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:53:29] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1039 is OK: OK: Less than 30.00% above the threshold [57.6]
[14:53:34] <sjoerddebruin>	 "mwoauthdatastore-callback-not-found"
[14:57:09] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[15:09:35] <ShakespeareFan00>	 Things seemingly a bit more stable?
[15:09:43] <wikibugs>	 6operations: Intermittent 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635329 (10faidon) 5Open>3Resolved a:3faidon The cause was determined to be an attack (in three bursts) on our servers, in a successful attempt to overload them. I won't say more — we have a policy on not document...
[15:12:21] <mafk>	 paravoid: any account you wish to have globally locked or IPs globally blocked?
[15:12:40] <paravoid>	 no, thanks
[15:12:53] <mafk>	 ok, please ping if you change your mind :)
[15:13:01] <mafk>	 and thanks for resolving this
[15:13:55] <ShakespeareFan00>	 mafk:  So what happened?  I appreciate there may not be a lot you can says for operational reasons
[15:14:11] <mafk>	 ShakespeareFan00: I know as much as you.
[15:14:20] <mafk>	 --> afk
[15:15:23] <wikibugs>	 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1635337 (10Aklapper) Well, if it's 2.40.10 it'll also fix T112421 and T97758
[15:16:24] <wikibugs>	 6operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, 5ContentTranslation-Release6: Apertium Failed to load resource: net::ERR_SPDY_PROTOCOL_ERROR - https://phabricator.wikimedia.org/T112403#1635339 (10Oscar) 5Open>3Resolved a:3Oscar It was the webshield...
[15:17:41] <ShakespeareFan00>	 ^ Avast seems to have issues
[15:18:07] <ShakespeareFan00>	 I've logged a bug in Mozilla's Bugzilla concerning it's apparent incompatability with Firefox
[15:18:16] <wikibugs>	 6operations: Intermittent 503's on multiple sites - https://phabricator.wikimedia.org/T112463#1635345 (10Luke081515) @faidon Thanks for quick resolving
[15:18:52] <ShakespeareFan00>	 The view I got in #firefox about Avast was it was also causing problems for Firefox support people as well
[15:20:50] <ShakespeareFan00>	 Might be worth looking into that because an anti virus shouldn't break normal use ;)
[15:20:53] * ShakespeareFan00 out
[15:34:04] <wikibugs>	 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1635353 (10Ricordisamoa) >>! In T112421#1635337, @Aklapper wrote: > Well, if it's 2.40.10 it'll also fix T112421 and T97758  The first one is this one.
[15:46:59] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Bump disk_allocation_ratio up a bit more. [puppet] - 10https://gerrit.wikimedia.org/r/237995 (https://phabricator.wikimedia.org/T111988) 
[15:48:02] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Bump disk_allocation_ratio up a bit more. [puppet] - 10https://gerrit.wikimedia.org/r/237995 (https://phabricator.wikimedia.org/T111988) (owner: 10Andrew Bogott)
[15:56:31] <multichill>	 paravoid: DDOS season started again about two weeks ago :-(
[16:00:07] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: mediawiki: kill HHVM graphite checks [puppet] - 10https://gerrit.wikimedia.org/r/237998 
[16:00:25] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: mediawiki: kill HHVM graphite checks [puppet] - 10https://gerrit.wikimedia.org/r/237998 
[16:12:29] <icinga-wm>	 PROBLEM - puppet last run on mw2086 is CRITICAL: CRITICAL: puppet fail
[16:26:19] <grrrit-wm>	 (03CR) 10Brian Wolff: [C: 031] Add *.ggpht.com to Wikimedia Commons upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234980 (https://phabricator.wikimedia.org/T110869) (owner: 10Dereckson)
[16:38:48] <icinga-wm>	 RECOVERY - puppet last run on mw2086 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[16:50:35] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1635456 (10Jalexander) What's the planned timeline for the migration/change? Still planned for this coming week?
[16:52:41] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1635457 (10JohnLewis) >>! In T110949#1635456, @Jalexander wrote: > What's the planned timeline for the migration/change? Still planned for this coming week?  To my knowledge, we've still not resch...
[17:06:17] <wikibugs>	 6operations, 7Database: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#1635469 (10jcrespo) 3NEW a:3jcrespo
[17:06:26] <wikibugs>	 6operations, 7Database: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#1635477 (10jcrespo) p:5Triage>3High
[17:11:10] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 86 data above and 8 below the confidence bounds
[17:17:29] <icinga-wm>	 PROBLEM - puppet last run on mw2010 is CRITICAL: CRITICAL: puppet fail
[17:45:58] <icinga-wm>	 RECOVERY - puppet last run on mw2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:48:30] <wikibugs>	 6operations, 7Database: Upgrade db1055 mysql version and configuration, and reduce its pool weight - https://phabricator.wikimedia.org/T112478#1635580 (10jcrespo) 3NEW a:3jcrespo
[17:52:04] <wikibugs>	 6operations, 7Database: Check, test and tune pool-of-connections and max_connections configuration - https://phabricator.wikimedia.org/T112479#1635595 (10jcrespo) 3NEW a:3jcrespo
[18:08:11] <Krenair>	 ori, please disable https://phabricator.wikimedia.org/p/Jmiguel2902/
[18:09:15] <Krenair>	 user is blocked as a vandalism-only account on mediawiki.org too
[18:12:20] <wikibugs>	 6operations, 6Release-Engineering-Team: tin disk space at 5% - https://phabricator.wikimedia.org/T112391#1635641 (10jcrespo) Original report of <6% disk space alert is gone. I am retiring operations from the projects (sadly, we do not have a "Done" column), and let tin's users full control of it (close the tic...
[18:21:23] <wikibugs>	 6operations, 7Database: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#1635662 (10jcrespo) In particular, not only should we monitor the current connections, but the peak since the last check, as otherwise it may be undetected for...
[18:47:18] <icinga-wm>	 RECOVERY - Disk space on labstore1002 is OK: DISK OK
[18:49:33] <ori>	 Krenair: k, done
[18:49:46] <Krenair>	 thanks
[19:21:49] <icinga-wm>	 PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: puppet fail
[19:48:09] <icinga-wm>	 RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[21:03:43] <wikibugs>	 6operations: determine nik everett's shell/production access levels - https://phabricator.wikimedia.org/T109390#1635839 (10Aklapper) @RobH: So... what's left to do here? (Task is still open but last comment says "just reopen this task".)
[21:08:18] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
[21:08:33] <wikibugs>	 6operations: audit hr staff and tracking sheet (2015-08-17 revision) against shell access/ldap wmf group - https://phabricator.wikimedia.org/T109382#1635845 (10JohnLewis)
[21:08:35] <wikibugs>	 6operations: determine nik everett's shell/production access levels - https://phabricator.wikimedia.org/T109390#1635843 (10JohnLewis) 5Open>3Resolved @aklapper nothing. seems he just forgot to resolve the task.
[21:10:19] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10772 bytes in 0.334 second response time
[21:20:35] <grrrit-wm>	 (03PS1) 10Ori.livneh: HHVM: enable stats collection for MySQL usage on canary servers [puppet] - 10https://gerrit.wikimedia.org/r/238073 
[21:21:34] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] HHVM: enable stats collection for MySQL usage on canary servers [puppet] - 10https://gerrit.wikimedia.org/r/238073 (owner: 10Ori.livneh)
[21:23:51] <wikibugs>	 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1635868 (10Tgr)
[21:24:06] <wikibugs>	 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1634588 (10Tgr)
[21:54:05] <YuviPanda>	 !bash <anomie> I spotted another one of these: Criminal Minds, "The Hunt", at timestamp 26:10 on Netflix. Apparently the NewPP limit report is part of the login process for some underground girl-auctioning site.
[22:44:06] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 031] "Good catch." [debs/pybal] - 10https://gerrit.wikimedia.org/r/237986 (https://phabricator.wikimedia.org/T112457) (owner: 10Faidon Liambotis)