[00:03:05] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [00:03:05] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [00:03:05] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [00:08:02] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 00:07:55 UTC 2013 [00:08:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:09:12] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 00:09:02 UTC 2013 [00:09:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:10:12] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 00:10:05 UTC 2013 [00:11:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:52] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 00:11:49 UTC 2013 [00:12:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:12:42] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 00:12:34 UTC 2013 [00:13:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:13:12] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 00:13:09 UTC 2013 [00:14:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:17:02] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 00:17:00 UTC 2013 [00:17:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:18:52] PROBLEM - Puppet freshness on db1017 is CRITICAL: No successful Puppet run in the last 10 hours [00:38:17] New review: MZMcBride; "Nice." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64493 [01:32:04] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.001851201057 secs [01:32:34] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.0008449554443 secs [02:01:20] !log LocalisationUpdate completed (1.22wmf4) at Sun May 19 02:01:19 UTC 2013 [02:01:30] Logged the message, Master [02:01:54] !log LocalisationUpdate completed (1.22wmf3) at Sun May 19 02:01:54 UTC 2013 [02:02:04] Logged the message, Master [02:06:55] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun May 19 02:06:55 UTC 2013 [02:07:04] Logged the message, Master [03:27:00] PROBLEM - mysqld processes on db1054 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld [03:28:00] RECOVERY - mysqld processes on db1054 is OK: PROCS OK: 3 processes with command name mysqld [03:32:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [04:02:19] New patchset: Tim Landscheidt; "Add Ganglia statistics to grid engine." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64511 [04:03:09] New review: Tim Landscheidt; "Needs further review by Coren after the Amsterdam hackathon." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/64511 [04:08:03] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 04:07:57 UTC 2013 [04:08:53] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [04:09:13] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 04:09:06 UTC 2013 [04:10:34] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [04:15:04] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [04:15:04] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [04:40:54] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [04:57:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:58:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [05:44:27] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:56:27] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [06:04:47] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [06:08:25] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:09:15] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [06:11:55] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [06:28:55] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 06:28:45 UTC 2013 [06:29:05] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [06:29:55] RECOVERY - Puppet freshness on mc15 is OK: puppet ran at Sun May 19 06:29:51 UTC 2013 [06:30:15] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 06:30:12 UTC 2013 [06:31:05] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [06:31:35] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 06:31:32 UTC 2013 [06:32:05] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [06:32:55] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 06:32:45 UTC 2013 [06:33:05] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [06:33:55] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 06:33:51 UTC 2013 [06:34:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [06:57:15] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [07:27:22] PROBLEM - search indices - check lucene status page on search1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 327 bytes in 0.005 second response time [08:01:22] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset 0.00177025795 secs [08:08:53] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 08:08:49 UTC 2013 [08:09:33] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [08:10:13] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 08:10:04 UTC 2013 [08:10:33] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [08:11:23] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 08:11:14 UTC 2013 [08:11:33] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [08:12:23] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 08:12:16 UTC 2013 [08:12:33] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [08:13:23] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 08:13:13 UTC 2013 [08:13:33] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:03] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 08:14:01 UTC 2013 [08:14:33] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:53] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 08:14:45 UTC 2013 [08:15:33] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [08:15:53] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 08:15:52 UTC 2013 [08:16:33] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [08:30:03] PROBLEM - Puppet freshness on colby is CRITICAL: No successful Puppet run in the last 10 hours [08:32:33] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.001383662224 secs [08:45:13] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 08:45:03 UTC 2013 [08:45:33] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [10:03:48] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [10:03:48] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [10:03:48] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [10:19:10] PROBLEM - Puppet freshness on db1017 is CRITICAL: No successful Puppet run in the last 10 hours [12:08:00] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 12:07:55 UTC 2013 [12:08:40] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:00] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 12:08:59 UTC 2013 [12:09:40] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:10:00] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 12:09:57 UTC 2013 [12:10:40] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:10:50] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 12:10:47 UTC 2013 [12:11:40] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:12:10] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 12:12:06 UTC 2013 [12:12:40] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:14:50] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 12:14:44 UTC 2013 [12:15:40] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:21:10] PROBLEM - SSH on stat1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:22:00] RECOVERY - SSH on stat1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:16:01] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [14:16:01] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [14:35:07] @notify binasher [14:35:07] I'll let you know when I see binasher around here [14:40:55] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [16:06:02] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [16:07:05] New patchset: Alex Monk; "Remove wgArticleRobotPolicies" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64526 [16:08:02] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 16:07:55 UTC 2013 [16:08:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:09:12] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 16:09:05 UTC 2013 [16:09:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:10:12] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 16:10:09 UTC 2013 [16:11:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:11:12] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 16:11:07 UTC 2013 [16:12:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:12:52] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 16:12:43 UTC 2013 [16:13:13] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:13:22] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 16:13:15 UTC 2013 [16:14:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:14:52] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 16:14:43 UTC 2013 [16:15:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:57:20] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [17:03:50] RECOVERY - mysqld processes on labsdb1002 is OK: PROCS OK: 3 processes with command name mysqld [17:44:39] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [17:45:19] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [18:13:53] PROBLEM - Disk space on ms-be1006 is CRITICAL: DISK CRITICAL - /var/lib/ceph/osd/ceph-67 is not accessible: Input/output error [18:14:23] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [18:14:33] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:14:43] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:15:03] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:15:23] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:15:34] looking [18:16:03] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:16:13] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.046 second response time [18:16:13] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.046 second response time [18:16:23] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 63542 bytes in 0.184 second response time [18:16:33] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.061 second response time [18:16:53] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.045 second response time [18:16:53] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.044 second response time [18:17:53] RECOVERY - Disk space on ms-be1006 is OK: DISK OK [18:19:12] what's going on? [18:19:34] a disk failed [18:19:47] and ceph was't too happy about that [18:19:52] urgh [18:20:29] I hope future versions will be a little more flexible about such things [18:21:13] it's a thing i'm debugging with them [18:21:18] but there's also a secondary problem here [18:21:27] the disk failed and it's neighbouring disks were also stuck for a while [18:22:34] 67 failed, but 60, 62, 64, 65, 66, 69 and 70 were marked as down [18:23:01] ouch [18:23:10] smells like a controller thing [18:23:20] this is with the H710s? [18:23:22] yes [18:23:29] we've seen this before [18:23:33] kind of [18:23:47] I'm gonna be sad if they are broken in some way we can't work around [18:23:56] a disk was failed and a mere run of megacli made a whole box i/o stuck for a few seconds [18:24:42] *sigh* [18:25:28] yeah tell me about it [18:26:52] well as it's stable for now I will do something about dinner, like cook some [18:26:58] hehe [18:27:06] yeah, no worries [18:27:23] the good thing about ceph is that when disks fail, it reacts automatically [18:27:48] so now it's reallocates the replicas osd.67 had elsewhere in the cluste [18:28:36] you know swift does that too: it starts shuffling around partitions as long as it's not the entire node that went out, just individual disks [18:28:46] I know [18:29:06] the rings still say bs though :) [18:29:17] yep, they do [18:29:26] ok, off to the kitchen! [18:29:38] bon appetit [18:30:43] PROBLEM - Puppet freshness on colby is CRITICAL: No successful Puppet run in the last 10 hours [19:23:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [19:29:00] LeslieCarr: You around-ish? [20:04:07] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [20:04:07] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [20:04:07] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [20:07:57] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 20:07:56 UTC 2013 [20:08:27] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [20:09:17] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 20:09:12 UTC 2013 [20:09:27] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [20:10:07] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 20:10:06 UTC 2013 [20:10:27] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [20:11:07] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 20:11:02 UTC 2013 [20:11:27] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [20:11:57] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 20:11:51 UTC 2013 [20:12:27] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [20:12:37] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 20:12:33 UTC 2013 [20:13:27] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [20:14:57] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 20:14:50 UTC 2013 [20:15:27] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [20:20:07] PROBLEM - Puppet freshness on db1017 is CRITICAL: No successful Puppet run in the last 10 hours [20:45:00] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Sun May 19 20:44:59 UTC 2013 [20:45:20] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [21:01:11] New patchset: Odder; "(bug 48620) Enable Translate extension on Wikimedia Commons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64539 [21:11:49] New review: Nemo bis; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64539 [21:13:08] heh. [23:34:21] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [23:39:42] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server