[00:38:24] RECOVERY - Lucene on search6 is OK: TCP OK - 9.005 second response time on port 8123 [00:58:14] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.007 second response time on port 8123 [01:55:27] PROBLEM - DPKG on search11 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [01:59:46] PROBLEM - DPKG on search7 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [03:57:57] RECOVERY - DPKG on search7 is OK: All packages OK [04:01:17] RECOVERY - DPKG on search11 is OK: All packages OK [04:37:47] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [05:44:46] RECOVERY - Disk space on search6 is OK: DISK OK [06:38:17] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [07:02:37] PROBLEM - Disk space on search6 is CRITICAL: DISK CRITICAL - free space: /a 5206 MB (3% inode=99%): [07:35:47] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [07:52:55] LVS ^^^ [08:01:08] PROBLEM - Lucene on search3 is CRITICAL: Connection timed out [08:04:08] PROBLEM - Lucene on search4 is CRITICAL: Connection timed out [08:04:59] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [08:05:59] PROBLEM - Lucene on search1 is CRITICAL: Connection timed out [08:13:58] PROBLEM - Disk space on search6 is CRITICAL: DISK CRITICAL - free space: /a 5121 MB (3% inode=99%): [08:35:38] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 8.994 second response time on port 8123 [09:08:19] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [09:35:09] PROBLEM - Disk space on search6 is CRITICAL: DISK CRITICAL - free space: /a 5056 MB (3% inode=99%): [09:38:41] !log removed log.1 from /a/search/logs on search6, it was 35gb [09:38:42] Logged the message, Master [09:41:59] RECOVERY - Lucene on search3 is OK: TCP OK - 0.002 second response time on port 8123 [09:45:24] RECOVERY - Disk space on search6 is OK: DISK OK [09:48:04] RECOVERY - Lucene on search4 is OK: TCP OK - 0.001 second response time on port 8123 [09:48:04] RECOVERY - Lucene on search9 is OK: TCP OK - 0.003 second response time on port 8123 [09:49:17] !log restarted lucene search on srch 10, 11, then later on 3,4,9,1 [09:49:19] Logged the message, Master [09:49:44] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [09:49:45] RECOVERY - Lucene on search1 is OK: TCP OK - 0.000 second response time on port 8123 [09:51:44] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 453075 MB (3% inode=99%): [09:52:44] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 448042 MB (3% inode=99%): [10:00:34] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:31:14] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [13:08:23] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [13:27:53] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [15:48:58] PROBLEM - Auth DNS on ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [16:01:38] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 7.887 seconds response time. www.wikipedia.org returns 208.80.152.201 [16:53:01] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [20:41:18] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [20:42:03] Ryan_Lane, want to give ^ a kick? [20:42:21] what happened that that? [20:44:01] guessing the service died [20:45:09] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [20:49:33] hmm [20:49:42] I restarted the process, but it isn't reporting OK [20:50:08] though I'm getting responses from it [20:50:11] maybe nagios is fucked up [20:51:01] Well, that wouldn't be a first [20:58:44] indeed [21:06:46] RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 9.243 seconds response time. www.wikipedia.org returns 208.80.152.201 [21:22:23] i'm late to this party [21:24:57] PROBLEM - Recursive DNS on 208.80.152.131 is CRITICAL: CRITICAL - Plugin timed out while executing system call [21:30:09] !log restarted pdns on ns2 about an hour ago [21:30:13] Logged the message, Master [21:31:18] !log restarted pdns-recursor on dobson [21:31:19] Logged the message, Master [22:20:47] PROBLEM - Recursive DNS on 91.198.174.6 is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:27:27] RECOVERY - Recursive DNS on 208.80.152.131 is OK: DNS OK: 7.516 seconds response time. www.wikipedia.org returns 208.80.152.201 [22:32:58] RECOVERY - Recursive DNS on 91.198.174.6 is OK: DNS OK: 8.879 seconds response time. www.wikipedia.org returns 91.198.174.225 [23:03:06] PROBLEM - check_job_queue on spence is CRITICAL: (Service Check Timed Out) [23:03:47] today's a bad day for nagios... at least it's not a flood [23:19:19] PROBLEM - Recursive DNS on 91.198.174.6 is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:21:32] PROBLEM - Auth DNS on ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:27:10] PROBLEM - Auth DNS on ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:37:42] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [23:38:03] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [23:46:19] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:52:02] PROBLEM - Recursive DNS on 208.80.152.131 is CRITICAL: CRITICAL - Plugin timed out while executing system call