[00:02:53] <nagios-wm>	 PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host
[00:03:47] <nagios-wm>	 PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host
[00:11:35] <nagios-wm>	 PROBLEM - Puppet freshness on srv245 is CRITICAL: Puppet has not run in the last 10 hours
[00:32:35] <nagios-wm>	 PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours
[00:34:23] <nagios-wm>	 RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho
[00:35:17] <nagios-wm>	 RECOVERY - MySQL disk space on neon is OK: DISK OK
[00:43:45] <nagios-wm>	 RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000
[00:46:05] <nagios-wm>	 RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000
[01:10:32] <nagios-wm>	 PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours
[01:57:23] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 196 seconds
[01:58:25] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 239 seconds
[01:58:26] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 218 seconds
[01:58:52] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 232 seconds
[02:02:10] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 1 seconds
[02:02:28] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 1 seconds
[02:12:49] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 189 seconds
[02:13:34] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 197 seconds
[02:15:22] <nagios-wm>	 PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (315333), Total (318470)
[02:17:28] <nagios-wm>	 PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (306538), Total (313663)
[02:25:07] <nagios-wm>	 PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours
[02:29:30] <logmsgbot>	 !log LocalisationUpdate completed (1.21wmf10) at Sun Feb 24 02:29:29 UTC 2013
[02:29:34] <morebots>	 Logged the message, Master
[02:34:25] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 1 seconds
[02:35:10] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds
[02:54:31] <logmsgbot>	 !log LocalisationUpdate completed (1.21wmf9) at Sun Feb 24 02:54:30 UTC 2013
[02:54:33] <morebots>	 Logged the message, Master
[03:50:31] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 184 seconds
[03:56:13] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds
[03:58:46] <nagios-wm>	 PROBLEM - Puppet freshness on sq73 is CRITICAL: Puppet has not run in the last 10 hours
[03:59:49] <nagios-wm>	 PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours
[04:21:43] <nagios-wm>	 PROBLEM - Puppet freshness on cp3003 is CRITICAL: Puppet has not run in the last 10 hours
[04:27:43] <nagios-wm>	 PROBLEM - Puppet freshness on cp3004 is CRITICAL: Puppet has not run in the last 10 hours
[04:38:49] <nagios-wm>	 PROBLEM - Puppet freshness on lardner is CRITICAL: Puppet has not run in the last 10 hours
[04:40:11] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 189 seconds
[04:40:37] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 187 seconds
[04:41:58] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds
[04:42:25] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds
[04:47:49] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours
[04:47:49] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be3003 is CRITICAL: Puppet has not run in the last 10 hours
[04:55:28] <nagios-wm>	 PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host
[04:55:55] <nagios-wm>	 PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host
[05:10:46] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be3001 is CRITICAL: Puppet has not run in the last 10 hours
[05:23:49] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 207 seconds
[05:23:58] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 181 seconds
[05:24:07] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 219 seconds
[05:25:46] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 1 seconds
[05:32:17] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out
[05:35:17] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 189 seconds
[05:35:45] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.033 second response time on port 8123
[05:35:55] <jeremyb_>	 9 seconds...
[05:36:02] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 195 seconds
[05:36:09] <jeremyb_>	 that's higher than past recoveries
[05:36:29] <jeremyb_>	 and is just below the imperical threshold for the intermittent search problems bug
[05:38:02] <jeremyb_>	 errrr, empirical
[05:38:03] <jeremyb_>	 *
[05:38:08] * jeremyb_  is sleepy?
[05:39:38] <nagios-wm>	 PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out
[05:39:50] <jeremyb_>	 hrm
[05:40:14] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 185 seconds
[05:40:14] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 185 seconds
[05:41:08] <nagios-wm>	 RECOVERY - Lucene on search1016 is OK: TCP OK - 3.025 second response time on port 8123
[05:49:41] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out
[05:52:03] <jeremyb_>	 apergos: ^
[05:52:44] <jeremyb_>	 (twice in 15 mins and also the search1016 flap too)
[05:56:44] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.023 second response time on port 8123
[06:01:05] <nagios-wm>	 RECOVERY - MySQL disk space on neon is OK: DISK OK
[06:03:02] <nagios-wm>	 RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho
[06:09:38] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 6 seconds
[06:10:50] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds
[06:11:45] <jeremyb_>	 pool4 (probably search1016) definitely needs a kick
[06:14:17] <nagios-wm>	 PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out
[06:15:22] <jeremyb_>	 wooo
[06:21:47] <nagios-wm>	 RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123
[06:22:41] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out
[06:24:12] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123
[06:28:55] <apergos>	 !log restarted search1016 lucene
[06:28:57] <morebots>	 Logged the message, Master
[06:30:49] <jeremyb_>	 good morning
[06:31:06] <apergos>	 mostly not here of course (sunday am)
[06:32:25] <jeremyb_>	 yah
[06:34:21] * jeremyb_  runs away
[06:36:10] <jeremyb_>	 (all 3 of my test queries on commons that were returning empty every time and slowly now work fairly fast and not emmpty.)
[06:40:32] <nagios-wm>	 PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100%
[06:51:29] <nagios-wm>	 RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.25 ms
[06:53:17] <nagios-wm>	 PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours
[06:55:14] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 188 seconds
[06:55:50] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 200 seconds
[07:09:38] <nagios-wm>	 RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000
[07:11:53] <nagios-wm>	 RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000
[07:14:10] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds
[07:14:44] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds
[07:17:27] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds
[07:18:29] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds
[07:49:32] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:55:14] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.066 seconds
[07:56:53] <nagios-wm>	 PROBLEM - Puppet freshness on mw64 is CRITICAL: Puppet has not run in the last 10 hours
[07:56:53] <nagios-wm>	 PROBLEM - Puppet freshness on mw1039 is CRITICAL: Puppet has not run in the last 10 hours
[08:02:26] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 183 seconds
[08:03:56] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 239 seconds
[08:24:29] <nagios-wm>	 PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (487561), Total (488689)
[08:25:05] <nagios-wm>	 PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (486030), Total (487924)
[08:26:08] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:37:23] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.886 seconds
[08:39:56] <nagios-wm>	 PROBLEM - Puppet freshness on db1009 is CRITICAL: Puppet has not run in the last 10 hours
[08:41:53] <nagios-wm>	 PROBLEM - Puppet freshness on mw1134 is CRITICAL: Puppet has not run in the last 10 hours
[09:18:28] <nagios-wm>	 RECOVERY - Puppet freshness on srv245 is OK: puppet ran at Sun Feb 24 09:18:13 UTC 2013
[09:23:25] <nagios-wm>	 PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host
[09:24:01] <nagios-wm>	 PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host
[09:54:37] <nagios-wm>	 RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho
[09:54:37] <nagios-wm>	 RECOVERY - MySQL disk space on neon is OK: DISK OK
[10:08:43] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 195 seconds
[10:09:37] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 218 seconds
[10:34:16] <nagios-wm>	 PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours
[10:36:40] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds
[10:37:07] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds
[11:11:19] <nagios-wm>	 PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours
[11:22:55] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 190 seconds
[11:23:23] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 192 seconds
[11:23:49] <nagios-wm>	 PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100%
[11:35:40] <nagios-wm>	 RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.31 ms
[11:36:07] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds
[11:36:16] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds
[12:26:31] <nagios-wm>	 PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours
[12:29:49] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 193 seconds
[12:30:24] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 221 seconds
[12:37:34] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds
[12:38:01] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds
[13:26:13] <nagios-wm>	 PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:26:23] <nagios-wm>	 PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:30:07] <nagios-wm>	 RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[13:32:04] <nagios-wm>	 RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.054 seconds
[14:00:34] <nagios-wm>	 PROBLEM - Puppet freshness on sq73 is CRITICAL: Puppet has not run in the last 10 hours
[14:01:28] <nagios-wm>	 PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours
[14:18:08] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out
[14:21:25] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123
[14:29:45] <nagios-wm>	 PROBLEM - Puppet freshness on cp3004 is CRITICAL: Puppet has not run in the last 10 hours
[14:30:21] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 197 seconds
[14:30:48] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 207 seconds
[14:34:24] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds
[14:34:44] <gerrit-wm>	 Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50467
[14:35:45] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 22 seconds
[14:39:57] <nagios-wm>	 PROBLEM - Puppet freshness on lardner is CRITICAL: Puppet has not run in the last 10 hours
[14:48:57] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours
[14:48:57] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be3003 is CRITICAL: Puppet has not run in the last 10 hours
[15:08:27] <nagios-wm>	 PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100%
[15:08:54] <nagios-wm>	 PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:12:03] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be3001 is CRITICAL: Puppet has not run in the last 10 hours
[15:13:15] <nagios-wm>	 PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:17:09] <nagios-wm>	 RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[15:17:45] <nagios-wm>	 RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000
[15:19:15] <nagios-wm>	 RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.055 seconds
[15:20:36] <nagios-wm>	 RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.50 ms
[15:26:36] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out
[15:28:42] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123
[15:43:53] <nagios-wm>	 PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (614061), Total (620125)
[15:53:11] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 181 seconds
[15:53:47] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 195 seconds
[16:06:24] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds
[16:06:59] <nagios-wm>	 PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host
[16:07:35] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds
[16:07:44] <nagios-wm>	 PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host
[16:13:29] <gerrit-wm>	 New patchset: Krinkle; "checkoutMediaWiki: Remove redundant argument to git-submodule." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50608
[16:13:32] <gerrit-wm>	 New review: Krinkle; "(1 comment)" [operations/mediawiki-multiversion] (master) - https://gerrit.wikimedia.org/r/14864
[16:13:46] <gerrit-wm>	 New review: Krinkle; "(1 comment)" [operations/mediawiki-multiversion] (master) - https://gerrit.wikimedia.org/r/14864
[16:34:35] <nagios-wm>	 PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:36:41] <nagios-wm>	 PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:36:50] <nagios-wm>	 RECOVERY - MySQL disk space on neon is OK: DISK OK
[16:37:35] <nagios-wm>	 RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho
[16:38:56] <nagios-wm>	 PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100%
[16:51:05] <nagios-wm>	 RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.94 ms
[16:52:36] <nagios-wm>	 RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[16:54:24] <nagios-wm>	 PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours
[16:56:30] <gerrit-wm>	 New patchset: Krinkle; "Remove dead Aliases "for 1.17 wikis" from bits conf." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/50609
[16:57:00] <gerrit-wm>	 New review: Krinkle; "These directories no longer exist in live-1.5." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/50609
[17:00:32] <nagios-wm>	 PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host
[17:01:17] <nagios-wm>	 PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host
[17:08:38] <nagios-wm>	 RECOVERY - Puppet freshness on mw1134 is OK: puppet ran at Sun Feb 24 17:08:21 UTC 2013
[17:27:37] <gerrit-wm>	 New patchset: Krinkle; "Clean up old multiversion readme." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50611
[17:30:14] <nagios-wm>	 RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho
[17:31:22] <gerrit-wm>	 New patchset: Krinkle; "Clean up common/README." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50612
[17:31:44] <nagios-wm>	 RECOVERY - MySQL disk space on neon is OK: DISK OK
[17:33:10] <gerrit-wm>	 New patchset: Krinkle; "Clean up common/README." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50612
[17:33:32] <nagios-wm>	 RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Sun Feb 24 17:33:28 UTC 2013
[17:37:08] <nagios-wm>	 RECOVERY - Puppet freshness on mw64 is OK: puppet ran at Sun Feb 24 17:36:53 UTC 2013
[17:37:35] <nagios-wm>	 PROBLEM - SSH on lvs1002 is CRITICAL: Server answer:
[17:38:02] <nagios-wm>	 RECOVERY - Puppet freshness on mw1039 is OK: puppet ran at Sun Feb 24 17:37:56 UTC 2013
[17:41:11] <nagios-wm>	 RECOVERY - SSH on lvs1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[17:51:51] <nagios-wm>	 PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:53:29] <nagios-wm>	 PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:02:23] <nagios-wm>	 PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours
[18:02:23] <nagios-wm>	 PROBLEM - Puppet freshness on mw1070 is CRITICAL: Puppet has not run in the last 10 hours
[18:05:53] * jeremyb_  sees some unresolved worrisome alerts. e.g. varnish bits. but idk where bits is served from these days
[18:06:03] <jeremyb_>	 and lvs1002/cp3003 were both bouncing
[18:07:20] <nagios-wm>	 RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 9.057 seconds
[18:07:56] <nagios-wm>	 RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[18:25:38] <nagios-wm>	 PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:26:14] <nagios-wm>	 PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:30:26] <nagios-wm>	 PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours
[18:31:56] <nagios-wm>	 RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[18:32:23] <nagios-wm>	 PROBLEM - Puppet freshness on mw1059 is CRITICAL: Puppet has not run in the last 10 hours
[18:33:26] <nagios-wm>	 PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours
[18:33:26] <nagios-wm>	 PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours
[18:33:26] <nagios-wm>	 PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours
[18:33:27] <nagios-wm>	 PROBLEM - Puppet freshness on mc1001 is CRITICAL: Puppet has not run in the last 10 hours
[18:33:27] <nagios-wm>	 PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours
[18:33:27] <nagios-wm>	 PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours
[18:35:23] <nagios-wm>	 PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours
[18:35:23] <nagios-wm>	 PROBLEM - Puppet freshness on mw1157 is CRITICAL: Puppet has not run in the last 10 hours
[18:35:23] <nagios-wm>	 PROBLEM - Puppet freshness on srv235 is CRITICAL: Puppet has not run in the last 10 hours
[18:37:20] <nagios-wm>	 PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours
[18:37:38] <nagios-wm>	 PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:40:47] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 184 seconds
[18:40:56] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 189 seconds
[18:41:05] <nagios-wm>	 RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 9.125 seconds
[18:41:14] <nagios-wm>	 RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[18:47:05] <nagios-wm>	 PROBLEM - Varnish HTTP bits on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:50:59] <nagios-wm>	 PROBLEM - SSH on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:51:53] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 8 seconds
[18:52:30] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds
[19:02:49] <nagios-wm>	 RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000
[19:04:10] <nagios-wm>	 RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000
[19:08:05] <jeremyb_>	 wooo, wikidata's job queue rollercoaster :)
[19:10:19] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 1 seconds
[19:10:28] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds
[19:15:52] <nagios-wm>	 PROBLEM - NTP on strontium is CRITICAL: NTP CRITICAL: No response from NTP server
[19:28:23] <nagios-wm>	 PROBLEM - MySQL Slave Running on db35 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Cant find record in page_restrictions on query. Default da
[19:35:39] <jeremyb_>	 hrm, no Tim-away/binasher/domas
[19:35:47] <jeremyb_>	 (for the repl error above)
[19:48:43] <nagios-wm>	 PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (702571), plwiki (47506), Total (751563)
[19:48:54] <MaxSem>	 jeremyb_, probably not deadly urgent as it's a pmtpa slave
[19:49:28] <nagios-wm>	 PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (699362), plwiki (45112), Total (745335)
[19:49:44] <jeremyb_>	 MaxSem: yeah, i never know what's important there. at least toolserver is more often slaving off an arbitrary slave than off an intermediate master?
[19:50:32] <jeremyb_>	 and last time we had a replication error like that it was 3 boxen at once not just the one
[19:50:40] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 202 seconds
[19:51:13] <jeremyb_>	 (i doubt it's an emergency but i think it's at least worth looking into sometime today)
[19:51:16] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 220 seconds
[20:20:17] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds
[20:20:17] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 187 seconds
[20:20:44] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 2 seconds
[20:20:53] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 198 seconds
[20:23:53] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds
[20:24:29] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds
[20:35:53] <nagios-wm>	 PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours
[21:08:44] <nagios-wm>	 RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 0.053 seconds
[21:12:56] <nagios-wm>	 PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours
[21:14:35] <nagios-wm>	 PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out
[21:18:57] <nagios-wm>	 RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123
[21:26:36] <nagios-wm>	 PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host
[21:28:24] <nagios-wm>	 PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host
[21:52:30] <gerrit-wm>	 New patchset: Legoktm; "(bug 45079) Add P: as alias for Property namespace" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50709
[21:57:12] <nagios-wm>	 RECOVERY - MySQL disk space on neon is OK: DISK OK
[21:57:12] <nagios-wm>	 RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho
[22:10:06] <nagios-wm>	 PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100%
[22:22:05] <nagios-wm>	 RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.53 ms
[22:28:05] <nagios-wm>	 PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours
[22:34:50] <nagios-wm>	 PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host
[22:35:45] <nagios-wm>	 PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host
[22:57:38] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 210 seconds
[22:58:06] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 218 seconds
[23:06:20] <nagios-wm>	 RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho
[23:07:14] <nagios-wm>	 RECOVERY - MySQL disk space on neon is OK: DISK OK
[23:51:49] <nagios-wm>	 PROBLEM - Varnish HTTP bits on niobium is CRITICAL: HTTP CRITICAL - No data received from host
[23:51:58] <nagios-wm>	 PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds