[00:51:36] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 323 seconds [00:52:04] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 354 seconds [00:53:04] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:53:43] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds [03:09:58] PROBLEM - puppet last run on amssq40 is CRITICAL: CRITICAL: puppet fail [03:28:38] RECOVERY - puppet last run on amssq40 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [04:14:08] PROBLEM - RAID on nickel is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:23:31] RECOVERY - RAID on nickel is OK: OK: Active: 3, Working: 3, Failed: 0, Spare: 0 [04:53:08] PROBLEM - puppet last run on mw1087 is CRITICAL: CRITICAL: Puppet has 1 failures [05:10:27] RECOVERY - puppet last run on mw1087 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:28:18] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:28] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: puppet fail [06:28:37] PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: puppet fail [06:28:38] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail [06:28:47] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:52] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 3 failures [06:34:48] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.076 second response time [06:45:47] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:45:50] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:49:38] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Puppet has 2 failures [06:50:28] PROBLEM - puppet last run on amssq51 is CRITICAL: CRITICAL: puppet fail [07:08:07] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:09:04] RECOVERY - puppet last run on amssq51 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:57:55] (03PS1) 10Ori.livneh: Add unit tests for pybal.ipvs.IPVSManager [debs/pybal] - 10https://gerrit.wikimedia.org/r/172102 [08:14:28] PROBLEM - puppet last run on db2003 is CRITICAL: CRITICAL: Puppet has 1 failures [08:31:49] RECOVERY - puppet last run on db2003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:05:30] (03CR) 10Hashar: [C: 04-1] "Mind adding a tox.ini as well? From there one could easily add Jenkins jobs to run the tests ( see https://www.mediawiki.org/wiki/Continu" [debs/pybal] - 10https://gerrit.wikimedia.org/r/172018 (owner: 10Ori.livneh) [14:28:54] (03PS1) 10Nemo bis: Task recommendations experiment is over [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172110 [14:31:57] (03CR) 10MZMcBride: Task recommendations experiment is over (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172110 (owner: 10Nemo bis) [14:41:06] (03CR) 10Nemo bis: Task recommendations experiment is over (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172110 (owner: 10Nemo bis) [14:46:50] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:51] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.904 second response time [15:12:30] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:13:30] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.556 second response time [15:13:54] (03CR) 10MZMcBride: Task recommendations experiment is over (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172110 (owner: 10Nemo bis) [15:15:14] (03CR) 10MZMcBride: "I should add... thank you for submitting this! Actively pruning and de-cluttering the configuration files is important work. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172110 (owner: 10Nemo bis) [15:32:00] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:01] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.417 second response time [15:37:11] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:43:31] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.615 second response time [15:44:59] (03PS1) 10Matanya: (bug 73197) enable Patrolled edits on Hebrew Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172112 [15:50:53] (03CR) 10Glaisher: (bug 73197) enable Patrolled edits on Hebrew Wiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172112 (owner: 10Matanya) [15:53:35] (03PS2) 10Matanya: (bug 73197) enable Patrolled edits on Hebrew Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172112 [15:53:51] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:59] (03CR) 10Matanya: (bug 73197) enable Patrolled edits on Hebrew Wiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172112 (owner: 10Matanya) [15:55:50] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.427 second response time [16:05:15] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:08:20] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.526 second response time [16:20:51] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:50] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.200 second response time [16:40:12] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:43:10] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [16:51:42] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:40] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.906 second response time [16:53:49]
256 requests currently being processed, 0 idle workers
[16:53:51] wow [16:56:23] _joe_: Around? [17:04:11] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:02] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.873 second response time [17:08:11] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:09:10] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.462 second response time [17:11:19] !log mw1192 stuck with almost no idle workers as most workers are in the "Gracefully finishing" state. Attempted to gracefully restart it, but that (to no surprise) didn't help. [17:11:24] Logged the message, Master [17:11:50] Someones needs to apply some un-graceful force to it :D [17:17:33] hoo: it is like this since almost 4 hours [17:17:41] ... [17:17:54] I can't depool it, so I better keep my hands of the dirty tools :P [17:18:29] they're all probably waiting for a file handle or udp2log or something [17:23:14] PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:01] RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [17:37:13] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:10] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.055 second response time [17:56:31] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:08] could someone please depool that thing ... :P [17:59:43] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.964 second response time [18:09:21] PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 108.24, 101.10, 97.79 [18:14:25] PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 101.39, 100.56, 98.46 [18:17:32] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:33] PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 99.03, 100.44, 98.83 [18:18:21] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.774 second response time [18:34:41] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:37:51] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.609 second response time [18:47:21] PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 113.88, 102.90, 99.51 [19:03:41] PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 111.93, 101.29, 99.49 [19:03:53] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:41] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.345 second response time [19:08:54] (03PS1) 10Brian Wolff: Increase max file size of url downloader proxy to 1010mb [puppet] - 10https://gerrit.wikimedia.org/r/172120 (https://bugzilla.wikimedia.org/73200) [19:09:52] PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 103.42, 100.10, 99.25 [19:10:56] <_joe_> hoo|away: I am now [19:12:40] <_joe_> !log restarted apache on mw1192, this time an hard restart [19:12:44] Logged the message, Master [19:15:03] PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 100.83, 100.30, 99.41 [19:16:01] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 862.599976 [19:17:01] PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 102.13, 101.02, 99.78 [19:18:20] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 248.333328 [19:19:21] _joe_: Thanks, that's all [19:19:31] Just out of curiosity... did you depool it first? [19:20:10] <_joe_> it was already temporarily depooled by pybal, so no need [19:20:24] <_joe_> In general when doing an hard restart, I do depool servers [19:20:53] Yeah, thought so [19:21:00] <_joe_> (note that we still served quite a few errors) [19:21:26] <_joe_> hoo: the plan is to make it so that the depooling is automatic whenever we have to do an hard restart [19:21:52] I think I picked up something about that [19:22:07] could be done on the upstart level [19:22:23] <_joe_> (because, well, hhvm sucks at restarting gracefully) [19:27:30] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:31:00] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 469.966675 [19:32:41] PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 101.46, 100.01, 99.85 [19:35:55] I happened to notice that modules/url_downloader/templates/squid.conf.erb has acls blocking esams and eqiad but not ulsfo. Is that a problem, or does it not matter? [19:36:30] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 634.033325 [19:39:30] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:43:23] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:46:24] PROBLEM - very high load average likely xfs on ms-be2011 is CRITICAL: CRITICAL - load average: 100.78, 100.19, 99.65 [19:53:26] labmon1001 DISK WARNING - free space: /srv 95356 MB (4% inode=97%): [19:53:31] YuviPanda: ^ [20:52:01] PROBLEM - puppet last run on amssq40 is CRITICAL: CRITICAL: Puppet has 1 failures [21:09:21] RECOVERY - puppet last run on amssq40 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [21:51:43] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: puppet fail [21:56:37] <_joe_> !log depooling mw1189 from the api pool, see https://phabricator.wikimedia.org/T1194 [21:56:41] Logged the message, Master [22:09:53] (03CR) 10Ori.livneh: "I don't mind, but I wish you didn't -1 for that." [debs/pybal] - 10https://gerrit.wikimedia.org/r/172018 (owner: 10Ori.livneh) [22:11:22] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [22:34:12] (03PS2) 10Ori.livneh: Add unit tests for `pybal.util.LogFile` [debs/pybal] - 10https://gerrit.wikimedia.org/r/172089 [22:34:14] (03PS2) 10Ori.livneh: Add .travis.yml file to enable automated tests on Travis CI [debs/pybal] - 10https://gerrit.wikimedia.org/r/172019 [22:34:16] (03PS2) 10Ori.livneh: Add tests for pybal.util.ConfigDict [debs/pybal] - 10https://gerrit.wikimedia.org/r/172018 [22:34:18] (03PS2) 10Ori.livneh: Add unit tests for pybal.ipvs.IPVSManager [debs/pybal] - 10https://gerrit.wikimedia.org/r/172102 [22:41:51] (03PS1) 10John F. Lewis: map ipv6 on magnesium [puppet] - 10https://gerrit.wikimedia.org/r/172179 [22:42:29] (03PS2) 10John F. Lewis: map ipv6 on magnesium [puppet] - 10https://gerrit.wikimedia.org/r/172179 [22:44:11] (03CR) 10Ori.livneh: "@hashar: Amended to add tox.ini. Zuul config done in I2390d14ab." [debs/pybal] - 10https://gerrit.wikimedia.org/r/172018 (owner: 10Ori.livneh) [22:56:10] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 352739 msg: ocg_render_job_queue 3519 msg (=3000 critical) [22:56:40] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 352792 msg: ocg_render_job_queue 3303 msg (=3000 critical) [22:57:01] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 352838 msg: ocg_render_job_queue 3117 msg (=3000 critical) [23:03:00] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 353510 msg: ocg_render_job_queue 156 msg [23:03:02] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 353529 msg: ocg_render_job_queue 72 msg [23:03:22] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 353540 msg: ocg_render_job_queue 0 msg [23:44:27] !log Changed the email for a global account. Bug 73014. [23:44:33] Logged the message, Master