[00:51:36] <icinga-wm>	 PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 323 seconds  
[00:52:04] <icinga-wm>	 PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 354 seconds  
[00:53:04] <icinga-wm>	 RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds  
[00:53:43] <icinga-wm>	 RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds  
[03:09:58] <icinga-wm>	 PROBLEM - puppet last run on amssq40 is CRITICAL: CRITICAL: puppet fail  
[03:28:38] <icinga-wm>	 RECOVERY - puppet last run on amssq40 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures  
[04:14:08] <icinga-wm>	 PROBLEM - RAID on nickel is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[04:23:31] <icinga-wm>	 RECOVERY - RAID on nickel is OK: OK: Active: 3, Working: 3, Failed: 0, Spare: 0  
[04:53:08] <icinga-wm>	 PROBLEM - puppet last run on mw1087 is CRITICAL: CRITICAL: Puppet has 1 failures  
[05:10:27] <icinga-wm>	 RECOVERY - puppet last run on mw1087 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures  
[06:28:18] <icinga-wm>	 PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:28:28] <icinga-wm>	 PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: puppet fail  
[06:28:37] <icinga-wm>	 PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: puppet fail  
[06:28:38] <icinga-wm>	 PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail  
[06:28:47] <icinga-wm>	 PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:28:52] <icinga-wm>	 PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 3 failures  
[06:34:48] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.076 second response time  
[06:45:47] <icinga-wm>	 RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures  
[06:45:50] <icinga-wm>	 RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures  
[06:45:59] <icinga-wm>	 RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures  
[06:46:48] <icinga-wm>	 RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures  
[06:46:57] <icinga-wm>	 RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures  
[06:46:57] <icinga-wm>	 RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures  
[06:49:38] <icinga-wm>	 PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Puppet has 2 failures  
[06:50:28] <icinga-wm>	 PROBLEM - puppet last run on amssq51 is CRITICAL: CRITICAL: puppet fail  
[07:08:07] <icinga-wm>	 RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures  
[07:09:04] <icinga-wm>	 RECOVERY - puppet last run on amssq51 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures  
[07:57:55] <grrrit-wm>	 (03PS1) 10Ori.livneh: Add unit tests for pybal.ipvs.IPVSManager [debs/pybal] - 10https://gerrit.wikimedia.org/r/172102 
[08:14:28] <icinga-wm>	 PROBLEM - puppet last run on db2003 is CRITICAL: CRITICAL: Puppet has 1 failures  
[08:31:49] <icinga-wm>	 RECOVERY - puppet last run on db2003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures  
[12:05:30] <grrrit-wm>	 (03CR) 10Hashar: [C: 04-1] "Mind adding a tox.ini as well? From there one could easily add Jenkins jobs to run the tests ( see https://www.mediawiki.org/wiki/Continu" [debs/pybal] - 10https://gerrit.wikimedia.org/r/172018 (owner: 10Ori.livneh)
[14:28:54] <grrrit-wm>	 (03PS1) 10Nemo bis: Task recommendations experiment is over [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172110 
[14:31:57] <grrrit-wm>	 (03CR) 10MZMcBride: Task recommendations experiment is over (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172110 (owner: 10Nemo bis)
[14:41:06] <grrrit-wm>	 (03CR) 10Nemo bis: Task recommendations experiment is over (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172110 (owner: 10Nemo bis)
[14:46:50] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[14:47:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.904 second response time  
[15:12:30] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[15:13:30] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.556 second response time  
[15:13:54] <grrrit-wm>	 (03CR) 10MZMcBride: Task recommendations experiment is over (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172110 (owner: 10Nemo bis)
[15:15:14] <grrrit-wm>	 (03CR) 10MZMcBride: "I should add... thank you for submitting this! Actively pruning and de-cluttering the configuration files is important work. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172110 (owner: 10Nemo bis)
[15:32:00] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[15:33:01] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.417 second response time  
[15:37:11] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[15:43:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.615 second response time  
[15:44:59] <grrrit-wm>	 (03PS1) 10Matanya: (bug 73197) enable Patrolled edits on Hebrew Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172112 
[15:50:53] <grrrit-wm>	 (03CR) 10Glaisher: (bug 73197) enable Patrolled edits on Hebrew Wiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172112 (owner: 10Matanya)
[15:53:35] <grrrit-wm>	 (03PS2) 10Matanya: (bug 73197) enable Patrolled edits on Hebrew Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172112 
[15:53:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[15:53:59] <grrrit-wm>	 (03CR) 10Matanya: (bug 73197) enable Patrolled edits on Hebrew Wiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172112 (owner: 10Matanya)
[15:55:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.427 second response time  
[16:05:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[16:08:20] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.526 second response time  
[16:20:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[16:25:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.200 second response time  
[16:40:12] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[16:43:10] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time  
[16:51:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[16:53:40] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.906 second response time  
[16:53:49] <hoo>	 <dt>256 requests currently being processed, 0 idle workers</dt>
[16:53:51] <hoo>	 wow
[16:56:23] <hoo>	 _joe_: Around?
[17:04:11] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:05:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.873 second response time  
[17:08:11] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:09:10] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.462 second response time  
[17:11:19] <hoo>	 !log mw1192 stuck with almost no idle workers as most workers are in the "Gracefully finishing" state. Attempted to gracefully restart it, but that (to no surprise) didn't help.
[17:11:24] <morebots>	 Logged the message, Master
[17:11:50] <hoo>	 Someones needs to apply some un-graceful force to it :D
[17:17:33] <matanya>	 hoo: it is like this since almost 4 hours
[17:17:41] <hoo>	 ...
[17:17:54] <hoo>	 I can't depool it, so I better keep my hands of the dirty tools :P
[17:18:29] <hoo>	 they're all probably waiting for a file handle or udp2log or something
[17:23:14] <icinga-wm>	 PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:24:01] <icinga-wm>	 RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0)  
[17:37:13] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:39:10] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.055 second response time  
[17:56:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:57:08] <hoo>	 could someone please depool that thing ... :P
[17:59:43] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.964 second response time  
[18:09:21] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 108.24, 101.10, 97.79  
[18:14:25] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 101.39, 100.56, 98.46  
[18:17:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:17:33] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 99.03, 100.44, 98.83  
[18:18:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.774 second response time  
[18:34:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:37:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.609 second response time  
[18:47:21] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 113.88, 102.90, 99.51  
[19:03:41] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 111.93, 101.29, 99.49  
[19:03:53] <icinga-wm>	 PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:05:41] <icinga-wm>	 RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.345 second response time  
[19:08:54] <grrrit-wm>	 (03PS1) 10Brian Wolff: Increase max file size of url downloader proxy to 1010mb [puppet] - 10https://gerrit.wikimedia.org/r/172120 (https://bugzilla.wikimedia.org/73200) 
[19:09:52] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 103.42, 100.10, 99.25  
[19:10:56] <_joe_>	 hoo|away: I am now
[19:12:40] <_joe_>	 !log restarted apache on mw1192, this time an hard restart
[19:12:44] <morebots>	 Logged the message, Master
[19:15:03] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 100.83, 100.30, 99.41  
[19:16:01] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 862.599976  
[19:17:01] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 102.13, 101.02, 99.78  
[19:18:20] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 248.333328  
[19:19:21] <hoo>	 _joe_: Thanks, that's all
[19:19:31] <hoo>	 Just out of curiosity... did you depool it first?
[19:20:10] <_joe_>	 it was already temporarily depooled by pybal, so no need
[19:20:24] <_joe_>	 In general when doing an hard restart, I do depool servers
[19:20:53] <hoo>	 Yeah, thought so
[19:21:00] <_joe_>	 (note that we still served quite a few errors)
[19:21:26] <_joe_>	 hoo: the plan is to make it so that the depooling is automatic whenever we have to do an hard restart
[19:21:52] <hoo>	 I think I picked up something about that
[19:22:07] <hoo>	 could be done on the upstart level
[19:22:23] <_joe_>	 (because, well, hhvm sucks at restarting gracefully)
[19:27:30] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[19:31:00] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 469.966675  
[19:32:41] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 101.46, 100.01, 99.85  
[19:35:55] <bawolff>	 I happened to notice that modules/url_downloader/templates/squid.conf.erb has acls blocking esams and eqiad but not ulsfo. Is that a problem, or does it not matter?
[19:36:30] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 634.033325  
[19:39:30] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[19:43:23] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[19:46:24] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 100.78, 100.19, 99.65  
[19:53:26] <Krinkle>	 labmon1001 DISK WARNING - free space: /srv 95356 MB (4% inode=97%): 	
[19:53:31] <Krinkle>	 YuviPanda: ^
[20:52:01] <icinga-wm>	 PROBLEM - puppet last run on amssq40 is CRITICAL: CRITICAL: Puppet has 1 failures  
[21:09:21] <icinga-wm>	 RECOVERY - puppet last run on amssq40 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures  
[21:51:43] <icinga-wm>	 PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: puppet fail  
[21:56:37] <_joe_>	 !log depooling mw1189 from the api pool, see https://phabricator.wikimedia.org/T1194
[21:56:41] <morebots>	 Logged the message, Master
[22:09:53] <grrrit-wm>	 (03CR) 10Ori.livneh: "I don't mind, but I wish you didn't -1 for that." [debs/pybal] - 10https://gerrit.wikimedia.org/r/172018 (owner: 10Ori.livneh)
[22:11:22] <icinga-wm>	 RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures  
[22:34:12] <grrrit-wm>	 (03PS2) 10Ori.livneh: Add unit tests for `pybal.util.LogFile` [debs/pybal] - 10https://gerrit.wikimedia.org/r/172089 
[22:34:14] <grrrit-wm>	 (03PS2) 10Ori.livneh: Add .travis.yml file to enable automated tests on Travis CI [debs/pybal] - 10https://gerrit.wikimedia.org/r/172019 
[22:34:16] <grrrit-wm>	 (03PS2) 10Ori.livneh: Add tests for pybal.util.ConfigDict [debs/pybal] - 10https://gerrit.wikimedia.org/r/172018 
[22:34:18] <grrrit-wm>	 (03PS2) 10Ori.livneh: Add unit tests for pybal.ipvs.IPVSManager [debs/pybal] - 10https://gerrit.wikimedia.org/r/172102 
[22:41:51] <grrrit-wm>	 (03PS1) 10John F. Lewis: map ipv6 on magnesium [puppet] - 10https://gerrit.wikimedia.org/r/172179 
[22:42:29] <grrrit-wm>	 (03PS2) 10John F. Lewis: map ipv6 on magnesium [puppet] - 10https://gerrit.wikimedia.org/r/172179 
[22:44:11] <grrrit-wm>	 (03CR) 10Ori.livneh: "@hashar: Amended to add tox.ini. Zuul config done in I2390d14ab." [debs/pybal] - 10https://gerrit.wikimedia.org/r/172018 (owner: 10Ori.livneh)
[22:56:10] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 352739 msg: ocg_render_job_queue 3519 msg (=3000 critical)  
[22:56:40] <icinga-wm>	 PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 352792 msg: ocg_render_job_queue 3303 msg (=3000 critical)  
[22:57:01] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 352838 msg: ocg_render_job_queue 3117 msg (=3000 critical)  
[23:03:00] <icinga-wm>	 RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 353510 msg: ocg_render_job_queue 156 msg  
[23:03:02] <icinga-wm>	 RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 353529 msg: ocg_render_job_queue 72 msg  
[23:03:22] <icinga-wm>	 RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 353540 msg: ocg_render_job_queue 0 msg  
[23:44:27] <hoo>	 !log Changed the email for a global account. Bug 73014.
[23:44:33] <morebots>	 Logged the message, Master