[00:40:35] PROBLEM - Memory using more than expected on kaulen is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [01:00:50] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:15:14] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:42:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 287 seconds [01:45:05] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:51:35] New patchset: Bhartshorne; "correcting path to memory check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8994 [01:51:56] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8994 [01:51:56] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8994 [02:12:08] New patchset: Bhartshorne; "bumping the threshold based on normal usage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8995 [02:12:27] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8995 [02:12:27] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8995 [03:10:27] PROBLEM - Host mw1091 is DOWN: PING CRITICAL - Packet loss = 100% [05:05:51] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [05:05:51] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [05:05:51] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [05:05:51] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [05:47:01] !log Bugzilla on Kaulen being super slow again [05:47:07] Logged the message, Master [05:48:54] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [05:50:15] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:51:54] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:53:19] !log kaulen dead :-[ [05:53:22] Logged the message, Master [05:57:45] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 6.173 seconds [06:02:36] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:05:27] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.011 seconds [06:09:57] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:11:18] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 2.435 seconds [06:14:54] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [06:20:00] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:24:39] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:29:18] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:33:57] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:44:09] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:44:45] I got a login prompt but not sure it's going to let me onto kaulen [06:44:52] via mgmt [06:45:27] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.073 seconds [06:45:54] that just sounds like wishful thinking there nagios.. [06:46:05] !log powercycling kaulen [06:46:09] Logged the message, Master [06:46:20] lots and lots of wait cpu [06:46:26] when it won't give me a password prompt and then it times out the login after 60 secs, time to hit the big red button [06:46:37] yep, same as last time [06:47:06] PROBLEM - Host kaulen is DOWN: CRITICAL - Host Unreachable (208.80.152.149) [06:47:07] the first time I was lucky enough to get on the box, the only thing I could see of interest was the seg faults in the log [06:47:33] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:47:42] RECOVERY - Host kaulen is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [06:48:54] There must be some sort of pattern for this, we seem to get it for a few days, then it goes away and comes back a month or 2 later [06:49:18] what's irritating is I checked a little earlier this morning and everything eas fine [06:49:25] as soon as I went to do something else, it went out to lunch [06:50:12] and speaking of doing something else, I'm off, have errands to do [06:50:19] see folks later [08:01:51] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:03:03] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:13:42] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 1.475 seconds [08:19:33] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:21:41] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 8.925 seconds [08:22:35] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:25:44] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:29:11] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:30:14] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:34:26] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:35:02] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.745 seconds [08:40:53] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:40:53] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [08:47:56] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [08:50:47] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:50:56] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [08:50:56] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [08:52:44] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:54:05] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 5.576 seconds [08:56:56] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [08:56:56] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [08:56:56] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [09:00:41] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:08:56] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:11:56] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:12:32] !log Bugzilla is down, Kaulen looks to be in swap death again [09:12:36] Logged the message, Master [09:32:19] It's not swap threashing it's swap death! [09:39:21] thrashing [09:39:31] it's not any less dead [09:49:23] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [09:50:35] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [09:51:02] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:51:20] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [09:52:23] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 9.968 seconds [09:55:05] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:01:59] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:02:26] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:06:29] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:07:59] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 3.511 seconds [10:09:29] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:12:26] Swap death? [10:13:50] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:13:50] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:18:54] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.756 seconds [10:29:51] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:31:12] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 1.698 seconds [10:31:48] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:35:42] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:36:18] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:41:33] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 1.406 seconds [10:42:18] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:46:12] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:46:48] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:47:33] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.004 seconds [10:52:03] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:53:24] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 2.158 seconds [11:05:50] !log stopping and restarting apache on kaulen blah blah blah [11:05:54] Logged the message, Master [11:07:06] expect this to take a few minutes [11:11:33] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:13:12] blah blah blah sounds like a very descriptive log record ;-) [11:15:59] well it's the same problem for the last two days so I'm bored of repeating it [11:16:22] in that sense it's a very descriptive log record: it describes my state of mind exactly :-P [11:16:34] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:33:04] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:33:26] come on little server [11:33:28] you cando it [11:45:04] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 2.960 seconds [11:45:54] still waiting... [11:52:07] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:53:37] PROBLEM - Apache HTTP on kaulen is CRITICAL: Connection refused [12:00:47] figures, I look elsewhere for two minutes and it finishes up [12:01:08] service should now be back to normal [12:01:12] for awhile... [12:02:01] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.005 seconds [13:06:58] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours [13:21:41] PROBLEM - Lighttpd HTTP on dataset2 is CRITICAL: Connection refused [13:34:26] RECOVERY - Lighttpd HTTP on dataset2 is OK: HTTP OK HTTP/1.0 200 OK - 4899 bytes in 0.004 seconds [15:07:05] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [15:07:05] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [15:07:05] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [15:07:05] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [15:21:40] New patchset: Hashar; "split filebackend conf out of CommonSettings" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/8914 [15:21:47] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/8914 [15:22:07] New review: Hashar; "Patchset 2 move the" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8914 [15:49:43] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [16:16:06] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [18:42:10] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [18:49:04] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [18:52:04] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [18:52:04] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [18:58:04] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [18:58:04] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [18:58:04] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [19:10:04] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:27:28] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:12] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:31:15] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8914 [19:31:18] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/8914 [19:32:48] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:35] apergos needed again to fix bugzilla ^^' [19:34:39] !log bugzilla down again [19:34:43] Logged the message, Master [19:36:03] ah, that's why it doesn't response... [19:36:06] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:47] that will be another "kaulen blah blah blah" log msg... ;-) [19:37:52] * Nemo_bis hopes so but saturday came to Greece too [19:38:48] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 2.180 seconds [19:44:39] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:51] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.654 seconds [19:50:39] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [19:52:36] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [19:54:58] !log restarting apache2 on kaulen [19:55:02] Logged the message, Master [19:55:15] expect it to take a little. also I am not really here, I just got back inside, it poured buckets on us [19:55:26] so I am soaked *and* cold now [19:55:28] also hungry. [19:58:09] as before, it is liable to take a few minutes for apache to complete its shutdown [20:00:24] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:27] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.536 seconds [20:06:43] grrr [20:06:45] it better not be [20:06:51] I need it to die die die [20:08:57] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:11:39] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.493 seconds [20:16:54] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [20:21:24] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:51] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:45:14] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 3.512 seconds [20:49:44] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:00:52] maybe I'll give up and power cycle it. [21:00:54] *sigh* [21:03:59] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:03:59] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.002 seconds [21:04:23] this is fun... [21:04:35] that's it from me for today [21:04:48] nacht [21:04:48] if it's out again someone else will have to kick it (it's midnight here) [22:07:04] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:08:25] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:18:37] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.983 seconds [22:23:07] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:25:49] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 1.871 seconds [22:30:19] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:31:40] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 1.702 seconds [22:33:28] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:43:31] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:54:53] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:08:14] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours [23:30:26] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.119 seconds [23:34:47] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:40:25] Somebody doesn't like kaulen :( [23:55:02] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 7.305 seconds [23:59:23] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds