[00:05:16] <gerrit-wm>	 New patchset: Ottomata; "Hey!  gerrit-stats cronjob!  SHHhhhhh!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23384
[00:06:24] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23384
[00:26:15] <gerrit-wm>	 New patchset: Pyoungmeister; "add eqiad mw boxes to site.pp and removed nfs::upload from applicationserver.pp role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23386
[00:27:06] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23386
[00:32:25] <gerrit-wm>	 Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23384
[00:41:19] <gerrit-wm>	 New patchset: Faidon; "Remove ms7/ms8 from site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23388
[00:42:12] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23388
[00:42:30] <gerrit-wm>	 Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23388
[00:52:16] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours
[00:52:16] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours
[00:52:16] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours
[00:58:16] <nagios-wm>	 PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours
[01:01:07] <nagios-wm>	 RECOVERY - Puppet freshness on ms-be1007 is OK: puppet ran at Tue Sep 11 01:00:57 UTC 2012
[01:01:34] <nagios-wm>	 RECOVERY - Puppet freshness on ms-be1011 is OK: puppet ran at Tue Sep 11 01:01:09 UTC 2012
[01:01:34] <nagios-wm>	 RECOVERY - Puppet freshness on ms-be1010 is OK: puppet ran at Tue Sep 11 01:01:16 UTC 2012
[01:01:34] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1024 is OK: puppet ran at Tue Sep 11 01:01:20 UTC 2012
[01:02:35] <paravoid>	 !log adjusting spence's firewall to include all eqiad subnets
[01:02:44] <morebots>	 Logged the message, Master
[01:05:10] <nagios-wm>	 RECOVERY - Puppet freshness on es1010 is OK: puppet ran at Tue Sep 11 01:04:37 UTC 2012
[01:05:37] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1022 is OK: puppet ran at Tue Sep 11 01:05:18 UTC 2012
[01:05:37] <nagios-wm>	 RECOVERY - Puppet freshness on es1009 is OK: puppet ran at Tue Sep 11 01:05:29 UTC 2012
[01:05:37] <nagios-wm>	 RECOVERY - Puppet freshness on ms-be1009 is OK: puppet ran at Tue Sep 11 01:05:29 UTC 2012
[01:05:46] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1025 is OK: puppet ran at Tue Sep 11 01:05:34 UTC 2012
[01:07:36] <gerrit-wm>	 New patchset: Aaron Schulz; "Removed code to hide the ETag." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23392
[01:08:28] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23392
[01:22:43] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:23:28] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1015 is OK: puppet ran at Tue Sep 11 01:23:11 UTC 2012
[01:24:04] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1023 is OK: puppet ran at Tue Sep 11 01:23:36 UTC 2012
[01:24:04] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1017 is OK: puppet ran at Tue Sep 11 01:24:02 UTC 2012
[01:24:40] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1013 is OK: puppet ran at Tue Sep 11 01:24:17 UTC 2012
[01:24:58] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1014 is OK: puppet ran at Tue Sep 11 01:24:47 UTC 2012
[01:27:04] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1011 is OK: puppet ran at Tue Sep 11 01:26:44 UTC 2012
[01:27:04] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1021 is OK: puppet ran at Tue Sep 11 01:26:53 UTC 2012
[01:27:40] <nagios-wm>	 RECOVERY - Puppet freshness on ms-be1006 is OK: puppet ran at Tue Sep 11 01:27:28 UTC 2012
[01:27:40] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1012 is OK: puppet ran at Tue Sep 11 01:27:33 UTC 2012
[01:27:58] <nagios-wm>	 RECOVERY - Puppet freshness on ms-be1005 is OK: puppet ran at Tue Sep 11 01:27:43 UTC 2012
[01:29:37] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1020 is OK: puppet ran at Tue Sep 11 01:29:22 UTC 2012
[01:29:46] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1019 is OK: puppet ran at Tue Sep 11 01:29:34 UTC 2012
[01:30:22] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1027 is OK: puppet ran at Tue Sep 11 01:30:09 UTC 2012
[01:31:07] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1016 is OK: puppet ran at Tue Sep 11 01:31:01 UTC 2012
[01:31:34] <nagios-wm>	 RECOVERY - Puppet freshness on es1008 is OK: puppet ran at Tue Sep 11 01:31:29 UTC 2012
[01:31:53] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1026 is OK: puppet ran at Tue Sep 11 01:31:37 UTC 2012
[01:31:53] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.681 seconds
[01:32:10] <nagios-wm>	 RECOVERY - Puppet freshness on analytics1018 is OK: puppet ran at Tue Sep 11 01:31:59 UTC 2012
[01:35:10] <nagios-wm>	 RECOVERY - Puppet freshness on es1007 is OK: puppet ran at Tue Sep 11 01:34:52 UTC 2012
[01:42:53] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 332 seconds
[01:43:11] <nagios-wm>	 PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 318 seconds
[01:44:23] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds
[01:46:11] <nagios-wm>	 RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 28 seconds
[02:08:23] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:20:59] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.113 seconds
[02:25:02] <nagios-wm>	 PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours
[02:25:02] <nagios-wm>	 PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours
[02:30:26] <gerrit-wm>	 New patchset: Krinkle; "misc deployment scripts: Minor clean up" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22858
[02:31:19] <gerrit-wm>	 New review: Krinkle; "Fix syntax error." [operations/puppet] (production) C: 0;  - https://gerrit.wikimedia.org/r/22858
[02:31:19] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22858
[03:26:57] <nagios-wm>	 PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours
[03:34:27] <nagios-wm>	 PROBLEM - udp2log log age for locke on locke is CRITICAL: CRITICAL: log files /a/squid/fundraising/logs/bannerImpressions-sampled100.log,  have not been written in a critical amount of time.  For most logs, this is 4 hours.  For slow logs, this is 4 days.
[04:01:36] <nagios-wm>	 RECOVERY - udp2log log age for locke on locke is OK: OK: all log files active
[04:44:26] <nagios-wm>	 PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100%
[04:45:11] <nagios-wm>	 RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[04:49:41] <nagios-wm>	 PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused
[04:58:41] <nagios-wm>	 RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time
[06:33:17] <nagios-wm>	 PROBLEM - udp2log log age for locke on locke is CRITICAL: CRITICAL: log files /a/squid/fundraising/logs/bannerImpressions-sampled100.log,  have not been written in a critical amount of time.  For most logs, this is 4 hours.  For slow logs, this is 4 days.
[06:45:29] <nagios-wm>	 RECOVERY - udp2log log age for locke on locke is OK: OK: all log files active
[06:54:47] <nagios-wm>	 PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours
[06:54:47] <nagios-wm>	 PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours
[06:54:47] <nagios-wm>	 PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours
[06:54:47] <nagios-wm>	 PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours
[06:54:47] <nagios-wm>	 PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours
[06:54:48] <nagios-wm>	 PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours
[08:49:37] <nagios-wm>	 PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours
[09:49:44] <nagios-wm>	 PROBLEM - udp2log log age for locke on locke is CRITICAL: CRITICAL: log files /a/squid/fundraising/logs/bannerImpressions-sampled100.log,  have not been written in a critical amount of time.  For most logs, this is 4 hours.  For slow logs, this is 4 days.
[10:31:35] <nagios-wm>	 RECOVERY - udp2log log age for locke on locke is OK: OK: all log files active
[10:44:38] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:46:35] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds
[10:59:29] <nagios-wm>	 PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours
[11:11:02] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:12:23] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.067 seconds
[11:20:47] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:23:47] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.764 seconds
[11:59:03] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:07:00] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds
[12:26:12] <nagios-wm>	 PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours
[12:26:12] <nagios-wm>	 PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours
[12:40:45] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:53:21] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.102 seconds
[13:27:47] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:28:14] <nagios-wm>	 PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours
[13:33:29] <nagios-wm>	 PROBLEM - swift-object-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[13:33:29] <nagios-wm>	 PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[13:33:29] <nagios-wm>	 PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[13:33:38] <nagios-wm>	 PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[13:33:38] <nagios-wm>	 PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[13:33:38] <nagios-wm>	 PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[13:33:56] <nagios-wm>	 PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[13:34:23] <nagios-wm>	 PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[13:34:32] <nagios-wm>	 PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[13:34:32] <nagios-wm>	 PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[13:34:36] <jeremyb>	 nagios-wm: quiet
[13:34:41] <nagios-wm>	 PROBLEM - swift-object-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[13:34:50] <nagios-wm>	 PROBLEM - swift-container-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[13:35:53] <Damianz>	 jeremyb: nagios-wm is like a person, as much as you want them to stfu no matter how many hints you give them they just keep talking :P
[13:37:44] <jeremyb>	 Damianz: i just want it to know better when something's intentionally "CRITICAL"
[13:38:00] <apergos>	 ms-be6 is one of the dead backends,
[13:38:13] <apergos>	 these processes were actually shot yesterday but I had to shoot them again today
[13:38:15] <jeremyb>	 for some reason it didn't alert until now though
[13:38:17] <apergos>	 I bet puppet restarts them
[13:38:20] <jeremyb>	 it's been like 15 hrs
[13:38:28] <apergos>	 well I just now shot them again
[13:38:36] <apergos>	 so that's why it just now complained about them
[13:38:36] <jeremyb>	 oh
[13:39:07] <jeremyb>	 you could always stop puppet ;) (but then disable it too in /etc/default/puppet)
[13:39:19] <apergos>	 yeah but ugh
[13:39:29] <apergos>	 rather not
[13:39:39] <apergos>	 then we forget to enable it later and bad things happen
[13:40:41] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.239 seconds
[13:42:02] <nagios-wm>	 RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[13:42:02] <nagios-wm>	 RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[13:42:11] <nagios-wm>	 RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[13:42:11] <nagios-wm>	 RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[13:42:20] <nagios-wm>	 RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[13:42:38] <nagios-wm>	 RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[13:42:38] <nagios-wm>	 RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[13:42:38] <nagios-wm>	 RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[13:42:38] <nagios-wm>	 RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[13:42:47] <nagios-wm>	 RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[13:42:56] <nagios-wm>	 RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[13:43:05] <nagios-wm>	 RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[13:44:26] <apergos>	 knew it. puppet restarts 'em
[13:44:28] <apergos>	 grrr
[14:16:23] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:18:56] <apergos>	 !log stopped swift processes on ms-be6 and disabled puppet (/etc/default/puppet) so it won't restart them.  still doing testing of hw over there so we need the box up.
[14:19:05] <morebots>	 Logged the message, Master
[14:21:29] <nagios-wm>	 PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[14:21:38] <nagios-wm>	 PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[14:21:38] <nagios-wm>	 PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[14:21:47] <nagios-wm>	 PROBLEM - swift-object-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[14:21:47] <nagios-wm>	 PROBLEM - swift-container-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[14:21:56] <nagios-wm>	 PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[14:21:56] <nagios-wm>	 PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[14:22:05] <nagios-wm>	 PROBLEM - swift-object-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[14:22:05] <nagios-wm>	 PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[14:22:14] <nagios-wm>	 PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[14:22:23] <nagios-wm>	 PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[14:22:23] <nagios-wm>	 PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[14:29:26] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds
[14:44:31] <apergos>	 !log manually recreated filesystems and mounted missing drives on ms-be6 (without ssds cabled), seems to have worked without errors.  rebooting to see what happens.
[14:44:40] <morebots>	 Logged the message, Master
[14:50:26] <nagios-wm>	 PROBLEM - SSH on ms-be6 is CRITICAL: Connection refused
[14:52:04] <apergos>	 rats. still too sleepy.
[14:52:10] <apergos>	 should have done that from management console
[14:52:19] * apergos  waits for awhile anyways
[14:56:44] <nagios-wm>	 RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[14:56:44] <nagios-wm>	 RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[14:56:54] <nagios-wm>	 RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[14:56:54] <nagios-wm>	 RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[14:56:54] <nagios-wm>	 RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[14:57:20] <nagios-wm>	 RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[14:57:20] <nagios-wm>	 RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[14:57:20] <nagios-wm>	 RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[14:57:38] <nagios-wm>	 RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[14:57:38] <nagios-wm>	 RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[14:57:38] <nagios-wm>	 RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[14:57:56] <nagios-wm>	 RECOVERY - SSH on ms-be6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[14:57:56] <nagios-wm>	 RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[14:58:40] <apergos>	 grrr
[14:58:58] <apergos>	 stopped them again
[15:01:41] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:02:08] <nagios-wm>	 PROBLEM - swift-object-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[15:02:08] <nagios-wm>	 PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[15:02:08] <nagios-wm>	 PROBLEM - swift-container-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[15:02:35] <nagios-wm>	 PROBLEM - swift-object-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[15:02:44] <nagios-wm>	 PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[15:02:44] <nagios-wm>	 PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[15:03:02] <nagios-wm>	 PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[15:03:02] <nagios-wm>	 PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[15:03:10] <apergos>	 hush!
[15:03:11] <nagios-wm>	 PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[15:03:18] <apergos>	 I'll fix you later
[15:03:20] <nagios-wm>	 PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[15:03:20] <nagios-wm>	 PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[15:03:20] <nagios-wm>	 PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[15:06:21] <gerrit-wm>	 New patchset: Pyoungmeister; "quieting search result mover cron on oxygen" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23417
[15:07:13] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23417
[15:09:02] <gerrit-wm>	 Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23417
[15:12:04] <Guest19048>	 apergos: ok, good. I'm glad this is a controlled thing. i was just loking at nagios and got a little nervous...
[15:12:04] <apergos>	 !log redid other ms-be6 filesystems so labels would be correct with disk layout without ssds.  reboot again to see if it comes up properly
[15:12:13] <morebots>	 Logged the message, Master
[15:14:35] <nagios-wm>	 PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100%
[15:15:02] <nagios-wm>	 RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[15:15:02] <nagios-wm>	 RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[15:15:11] <nagios-wm>	 RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[15:15:11] <nagios-wm>	 RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[15:15:11] <nagios-wm>	 RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[15:15:29] <nagios-wm>	 RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[15:15:29] <nagios-wm>	 RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[15:15:29] <nagios-wm>	 RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[15:15:29] <nagios-wm>	 RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[15:15:36] * jeremyb  hands Guest19048 a /nick ;)
[15:15:42] <apergos>	 !log stopping swift processes on ms-be6 til we decide what's next; all disks came up and were mounted (but we still do have degraded raid array, not so awesome)
[15:15:51] <morebots>	 Logged the message, Master
[15:18:56] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.052 seconds
[15:19:32] <nagios-wm>	 PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[15:19:32] <nagios-wm>	 PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[15:19:41] <nagios-wm>	 PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[15:19:41] <nagios-wm>	 PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[15:19:59] <nagios-wm>	 PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[15:19:59] <nagios-wm>	 PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[15:19:59] <nagios-wm>	 PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[15:19:59] <nagios-wm>	 PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[15:23:02] <gerrit-wm>	 New patchset: Alex Monk; "(bug 40163) Try to fix ltwiki import source for betawikiversity" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23419
[15:30:47] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 190 seconds
[15:31:23] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 209 seconds
[15:32:44] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 202 seconds
[15:32:44] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 202 seconds
[15:35:44] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 1 seconds
[15:35:44] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 1 seconds
[15:40:39] <gerrit-wm>	 New review: Jeremyb; "This will work." [operations/mediawiki-config] (master) C: 1;  - https://gerrit.wikimedia.org/r/23419
[15:42:11] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 182 seconds
[15:43:05] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 212 seconds
[15:46:05] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds
[15:46:41] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds
[15:51:29] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:55:34] <gerrit-wm>	 New patchset: Hashar; "manpages for our misc scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16606
[15:56:26] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16606
[15:57:52] <apergos>	 !log rebuilding the degraded raid array on ms-be6
[15:58:01] <morebots>	 Logged the message, Master
[16:02:20] <gerrit-wm>	 New review: Jeroen De Dauw; "Does what it says" [operations/puppet] (production) C: 1;  - https://gerrit.wikimedia.org/r/23392
[16:05:34] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds
[16:24:37] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 201 seconds
[16:25:13] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 213 seconds
[16:27:10] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 191 seconds
[16:27:55] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 205 seconds
[16:37:58] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:48:37] <nagios-wm>	 PROBLEM - udp2log log age for locke on locke is CRITICAL: CRITICAL: log files /a/squid/fundraising/logs/bannerImpressions-sampled100.log,  have not been written in a critical amount of time.  For most logs, this is 4 hours.  For slow logs, this is 4 days.
[16:49:04] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.056 seconds
[16:55:40] <nagios-wm>	 PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours
[16:55:40] <nagios-wm>	 PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours
[16:55:40] <nagios-wm>	 PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours
[16:55:40] <nagios-wm>	 PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours
[16:55:40] <nagios-wm>	 PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours
[16:55:41] <nagios-wm>	 PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours
[16:56:44] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds
[16:57:28] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds
[17:04:49] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds
[17:04:58] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds
[17:10:43] <apergos>	 !log one more reboot of ms-be6 after raid repaired, see if it and all mounts stay ok
[17:10:52] <morebots>	 Logged the message, Master
[17:11:21] <paravoid>	 apergos: so what happened with ms-be6?
[17:11:29] <paravoid>	 I've kinda lost track, with you doing things to it since 5am :P
[17:11:35] <apergos>	 you weren't watching the log? :-P
[17:11:46] <paravoid>	 I'm lazy, it's easier to ask
[17:11:58] <paravoid>	 the log is intermixed with a bunch of nagios alerts as well :-)
[17:12:00] <apergos>	 I'm lazy, it's easier to paste you the link:-P
[17:12:11] <apergos>	 recreated all filesystems manually
[17:12:24] <apergos>	 then mounted them all manually
[17:12:28] <apergos>	 then rebooted (so far so good)
[17:12:37] <apergos>	 then repaired the raid array of sda and sdb
[17:12:37] <nagios-wm>	 PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100%
[17:12:44] <apergos>	 note that this is *without* the  ssds cabled up
[17:13:00] <apergos>	 now rebooting one last time to see if it keeps everything up
[17:13:13] <nagios-wm>	 RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[17:13:18] <apergos>	 if it does I'll try this same thing on a host with the ssds
[17:13:22] <nagios-wm>	 RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[17:13:22] <nagios-wm>	 RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms
[17:13:49] <nagios-wm>	 RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[17:13:49] <nagios-wm>	 RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[17:13:49] <nagios-wm>	 RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[17:14:07] <nagios-wm>	 RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[17:14:07] <nagios-wm>	 RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[17:14:07] <nagios-wm>	 RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[17:14:25] <nagios-wm>	 RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[17:14:34] <nagios-wm>	 RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[17:14:34] <nagios-wm>	 RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[17:14:34] <nagios-wm>	 RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[17:15:55] <nagios-wm>	 RECOVERY - udp2log log age for locke on locke is OK: OK: all log files active
[17:16:35] <gerrit-wm>	 New review: Hashar; "PS7: doc for clear-profile" [operations/puppet] (production) C: 0;  - https://gerrit.wikimedia.org/r/16606
[17:17:52] <nagios-wm>	 PROBLEM - swift-container-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[17:18:19] <nagios-wm>	 PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[17:18:19] <nagios-wm>	 PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[17:18:19] <nagios-wm>	 PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[17:18:37] <nagios-wm>	 PROBLEM - swift-object-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[17:18:37] <nagios-wm>	 PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[17:18:37] <nagios-wm>	 PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[17:18:55] <nagios-wm>	 PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[17:19:04] <nagios-wm>	 PROBLEM - swift-object-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[17:19:04] <nagios-wm>	 PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[17:19:04] <nagios-wm>	 PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[17:19:13] <nagios-wm>	 PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[17:19:46] <hashar>	 paravoid: more swift issue ^^^^^^
[17:24:01] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:25:40] <paravoid>	 hashar: ms-be6 is being worked on by apergos, there are some log entries
[17:25:45] <paravoid>	 hashar: but thanks :-)
[17:25:49] <hashar>	 :-)
[17:28:31] <nagios-wm>	 RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[17:29:34] <nagios-wm>	 RECOVERY - swift-container-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[17:29:34] <nagios-wm>	 RECOVERY - swift-account-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[17:29:34] <nagios-wm>	 RECOVERY - swift-container-auditor on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[17:29:43] <nagios-wm>	 RECOVERY - swift-object-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[17:29:52] <nagios-wm>	 RECOVERY - swift-container-updater on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[17:29:52] <nagios-wm>	 RECOVERY - swift-account-auditor on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[17:30:01] <nagios-wm>	 RECOVERY - swift-object-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[17:30:01] <nagios-wm>	 RECOVERY - swift-container-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[17:31:32] <apergos>	 !log puppet disabled on ms-be7 via /etc/default/puppet, swift processes shut down while we test
[17:31:41] <morebots>	 Logged the message, Master
[17:34:04] <nagios-wm>	 PROBLEM - swift-account-replicator on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[17:34:04] <nagios-wm>	 PROBLEM - swift-container-replicator on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[17:34:04] <nagios-wm>	 PROBLEM - swift-container-auditor on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[17:34:13] <nagios-wm>	 PROBLEM - swift-object-replicator on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[17:34:22] <nagios-wm>	 PROBLEM - swift-account-auditor on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[17:34:22] <nagios-wm>	 PROBLEM - swift-container-updater on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[17:34:31] <nagios-wm>	 PROBLEM - swift-object-server on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[17:34:31] <nagios-wm>	 PROBLEM - swift-container-server on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[17:34:40] <nagios-wm>	 PROBLEM - swift-account-server on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[17:39:37] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.043 seconds
[17:42:49] <apergos>	 !log rebooting ms-be7 after recreating all filesystems (except those on the ssds) and remounting them, to see if the disks stay visible
[17:42:58] <morebots>	 Logged the message, Master
[17:43:59] <nagios-wm>	 PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100%
[17:45:02] <nagios-wm>	 RECOVERY - swift-object-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[17:45:02] <nagios-wm>	 RECOVERY - swift-account-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[17:45:02] <nagios-wm>	 RECOVERY - swift-container-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[17:45:11] <nagios-wm>	 RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms
[17:45:13] <apergos>	 looky that it came up
[17:45:49] <apergos>	 I've stoppped the swift processes on the box again.  I'll leave it up I guess
[17:45:52] <cmjohnson1>	 apergos: w/out having to skip through mounts?
[17:46:03] <^demon>	 apergos: Are you shocked, shocked that it came up?
[17:46:09] <apergos>	 yes, well I recreated all the filesystems by hand
[17:46:14] <apergos>	 and then mounted manually
[17:46:36] <apergos>	 ^demon: not exactly
[17:46:42] <apergos>	 that would imply that I expected it to come up
[17:46:51] <apergos>	 but really I expect these boxes to be broken (because they are)
[17:47:18] <apergos>	 ok, I'll try rescuign one more box so we can use it to beat on, and then we'll be done for now
[17:47:19] <^demon>	 apergos: Well, then you're pleasantly surprised :)
[17:47:24] <apergos>	 yes
[17:48:14] <apergos>	 I think I'll leave ms-be8 for dell
[17:48:17] <apergos>	 it's powered off anyways
[17:48:41] <apergos>	 ms-be10 can be our playtoy (if I get it to come back up)
[17:49:41] <nagios-wm>	 PROBLEM - swift-object-server on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[17:49:41] <nagios-wm>	 PROBLEM - swift-account-replicator on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[17:49:41] <nagios-wm>	 PROBLEM - swift-container-server on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[17:50:35] <apergos>	 maybe I'm leaving ms-be10 for dell instead
[17:51:06] <apergos>	 oh for cripes sake
[17:51:12] <apergos>	 root@ms-be10:~# ls -l /dev/sd*
[17:51:23] <apergos>	 has...
[17:51:31] <apergos>	 brw-rw---- 1 root disk  8, 224 Aug 14 23:50 /dev/sdo
[17:51:31] <apergos>	   
[17:51:32] <apergos>	 etc
[17:51:49] <apergos>	 ok we get to reboot this and see if we can clean these up.
[17:54:27] <apergos>	 !log rebooting ms-bs10 to see if we can get some reasonable device names, in prep for trying the same trick on it as on ms-be6 and 7
[17:54:36] <morebots>	 Logged the message, Master
[17:55:46] <apergos>	 !log new degradedarray event on ms-be6, booooo
[17:55:55] <morebots>	 Logged the message, Master
[17:56:08] <apergos>	 that might end up being the test box then, don't want to use it for swift
[18:00:38] <nagios-wm>	 PROBLEM - swift-account-server on ms-be10 is CRITICAL: Connection refused by host
[18:00:40] <cmjohnson1>	 faidon and/or mark: ms7 and ms8 have been removed from site.pp are these being decommissioned?
[18:00:56] <nagios-wm>	 PROBLEM - swift-object-server on ms-be10 is CRITICAL: Connection refused by host
[18:01:41] <nagios-wm>	 PROBLEM - swift-container-auditor on ms-be10 is CRITICAL: Connection refused by host
[18:01:41] <nagios-wm>	 PROBLEM - swift-object-auditor on ms-be10 is CRITICAL: Connection refused by host
[18:01:41] <nagios-wm>	 PROBLEM - SSH on ms-be10 is CRITICAL: Connection refused
[18:01:41] <nagios-wm>	 PROBLEM - swift-account-auditor on ms-be10 is CRITICAL: Connection refused by host
[18:01:41] <nagios-wm>	 PROBLEM - swift-container-server on ms-be10 is CRITICAL: Connection refused by host
[18:01:50] <nagios-wm>	 PROBLEM - swift-account-replicator on ms-be10 is CRITICAL: Connection refused by host
[18:02:08] <nagios-wm>	 PROBLEM - swift-account-reaper on ms-be10 is CRITICAL: Connection refused by host
[18:02:08] <nagios-wm>	 PROBLEM - swift-object-replicator on ms-be10 is CRITICAL: Connection refused by host
[18:02:08] <nagios-wm>	 PROBLEM - swift-container-replicator on ms-be10 is CRITICAL: Connection refused by host
[18:02:08] <nagios-wm>	 PROBLEM - swift-container-updater on ms-be10 is CRITICAL: Connection refused by host
[18:02:35] <nagios-wm>	 PROBLEM - swift-object-updater on ms-be10 is CRITICAL: Connection refused by host
[18:02:37] <paravoid>	 cmjohnson1: not yet, that's why they're not in decomissioning.pp
[18:02:44] <paravoid>	 cmjohnson1: but they run Solaris and they don't run puppet for quite some time
[18:03:02] <nagios-wm>	 RECOVERY - SSH on ms-be10 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[18:12:02] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:13:24] <apergos>	 !log rebooting ms-be10 after recreation and remount of swift filesystems (except those on ssds), see if they stick
[18:13:33] <morebots>	 Logged the message, Master
[18:13:56] <gerrit-wm>	 New patchset: Reedy; "Move noc from /h/w/htdocs to /h/w/c/docroot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23425
[18:14:26] <nagios-wm>	 PROBLEM - Host ms-be10 is DOWN: PING CRITICAL - Packet loss = 100%
[18:14:48] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23425
[18:16:50] <nagios-wm>	 RECOVERY - Host ms-be10 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[18:17:41] <Krinkle>	 Reedy: trailing ws at line 12
[18:18:01] <Reedy>	 Krinkle: I'm working on a seriously laggy internet connection
[18:18:05] <Reedy>	 Working on a remote server via ssh
[18:18:10] <Krinkle>	 k
[18:20:37] <Krinkle>	 np
[18:20:39] <Krinkle>	 fixed
[18:20:43] <gerrit-wm>	 New patchset: Krinkle; "Move noc from /h/w/htdocs to /h/w/c/docroot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23425
[18:20:47] <apergos>	 so here is an interesting tidbit, reboot and we see all the devices and they mount on ms-be10 *and yet*
[18:21:13] <apergos>	 during the boot I saw on the console whines that devices /dev/sdi,k,n were not ready or not present
[18:21:28] <apergos>	 and did I want to skip or do manual recovery during the attempt to mount)
[18:21:31] <apergos>	 very weird
[18:21:34] <gerrit-wm>	 New review: Krinkle; "* Removed trailing space" [operations/puppet] (production) C: 1;  - https://gerrit.wikimedia.org/r/23425
[18:21:34] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23425
[18:22:05] <gerrit-wm>	 New patchset: RobH; "adding in eqiad row c, pmtpa rows c and d" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23426
[18:22:56] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23426
[18:23:17] <gerrit-wm>	 New patchset: Reedy; "Move noc from /h/w/htdocs to /h/w/c/docroot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23425
[18:24:09] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23425
[18:24:20] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds
[18:26:05] <apergos>	 !log first reboot of msbe10 showed complaints of not ready on devices sdi, k, n but they did eventually mount when the box came up fully.  rebooting a second time to see what happens
[18:26:15] <morebots>	 Logged the message, Master
[18:30:20] <nagios-wm>	 RECOVERY - swift-account-auditor on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[18:30:20] <nagios-wm>	 RECOVERY - swift-container-auditor on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[18:30:20] <nagios-wm>	 RECOVERY - swift-container-server on ms-be10 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[18:30:20] <nagios-wm>	 RECOVERY - swift-object-auditor on ms-be10 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[18:31:20] <apergos>	 !log second reboot had whines of device not present/not ready for /dev/sdn1 and /dev/sdh1 (note the list is not the same as the previous list), waited a little and it booted up with all disks mounted
[18:31:29] <morebots>	 Logged the message, Master
[18:32:33] <apergos>	 I doublechecked and of course ms-be10 is with boot delay 90 seconds
[18:32:44] <apergos>	 we could increase it to 120 but seriously?? it's ridiculous
[18:34:34] <binasher>	 i wonder if the sound of drive heads helplessly banging back and forth just struggled to be heard over the drone of a million fans
[18:34:59] <nagios-wm>	 PROBLEM - swift-account-auditor on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[18:34:59] <nagios-wm>	 PROBLEM - swift-container-auditor on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[18:35:08] <nagios-wm>	 PROBLEM - swift-container-server on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[18:35:08] <nagios-wm>	 PROBLEM - swift-object-auditor on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[18:42:11] <apergos>	 heh
[18:42:27] <apergos>	 starting to collect notes here:
[18:42:28] <apergos>	 http://wikitech.wikimedia.org/view/Swift/Server_issues_Aug-Sept_2012
[18:42:46] <apergos>	 since our rt tickets are really for specific cases with vendors, rather than a general plan
[18:42:56] <apergos>	 lemme see what else is in the trouble list
[18:50:44] <nagios-wm>	 PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours
[18:57:38] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:12:02] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds
[19:19:19] <gerrit-wm>	 New patchset: RobH; "adding in eqiad row c, pmtpa rows c and d" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23426
[19:20:12] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23426
[19:20:47] <preilly>	 notpeter: ping
[19:21:44] <notpeter>	 sup
[19:22:22] <gerrit-wm>	 New review: RobH; "This patchset relates to RT 3402" [operations/puppet] (production); V: 0 C: 0;  - https://gerrit.wikimedia.org/r/23426
[19:34:33] <nagios-wm>	 PROBLEM - udp2log log age for locke on locke is CRITICAL: CRITICAL: log files /a/squid/fundraising/logs/bannerImpressions-sampled100.log,  have not been written in a critical amount of time.  For most logs, this is 4 hours.  For slow logs, this is 4 days.
[19:41:27] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours
[19:43:24] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:58:42] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.890 seconds
[20:01:33] <nagios-wm>	 RECOVERY - udp2log log age for locke on locke is OK: OK: all log files active
[20:31:51] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:45:39] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.111 seconds
[20:51:49] <RobH>	 !log authdns-update per rt 1326
[20:51:58] <morebots>	 Logged the message, RobH
[21:00:05] <nagios-wm>	 PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours
[21:15:41] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa
[21:15:50] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa
[21:15:50] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa
[21:15:50] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa
[21:15:50] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa
[21:15:59] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa
[21:15:59] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa
[21:16:35] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa
[21:16:35] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa
[21:16:35] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa
[21:16:35] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa
[21:16:44] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa
[21:20:20] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:32:29] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.082 seconds
[21:33:05] <nagios-wm>	 RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa
[21:33:14] <nagios-wm>	 RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa
[21:33:23] <nagios-wm>	 RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa
[21:33:23] <nagios-wm>	 RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa
[21:33:50] <nagios-wm>	 RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa
[21:33:50] <nagios-wm>	 RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa
[21:33:50] <nagios-wm>	 RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa
[21:33:59] <nagios-wm>	 RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa
[21:34:08] <nagios-wm>	 RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa
[21:34:08] <nagios-wm>	 RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa
[21:34:35] <nagios-wm>	 RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa
[21:34:44] <nagios-wm>	 RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa
[22:07:17] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:19:17] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.451 seconds
[22:26:41] <nagios-wm>	 PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours
[22:26:41] <nagios-wm>	 PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours
[22:29:16] <notpeter>	 !log removing srv195-199 from apaches pool for upgarde to precise
[22:29:26] <morebots>	 Logged the message, notpeter
[22:36:08] <gerrit-wm>	 New patchset: Pyoungmeister; "NO-OP: removing spare memecache boxes from mc.php that no longer have memecache class" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23477
[22:37:37] <gerrit-wm>	 Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23477
[22:39:08] <nagios-wm>	 PROBLEM - Host srv195 is DOWN: PING CRITICAL - Packet loss = 100%
[22:44:23] <nagios-wm>	 PROBLEM - Host srv196 is DOWN: PING CRITICAL - Packet loss = 100%
[22:44:50] <nagios-wm>	 RECOVERY - Host srv195 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[22:45:35] <nagios-wm>	 PROBLEM - Host srv197 is DOWN: PING CRITICAL - Packet loss = 100%
[22:47:54] <gerrit-wm>	 New patchset: Pyoungmeister; "add eqiad mw boxes to site.pp and removed nfs::upload from applicationserver.pp role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23386
[22:48:44] <nagios-wm>	 PROBLEM - Apache HTTP on srv195 is CRITICAL: Connection refused
[22:48:48] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23386
[22:49:02] <nagios-wm>	 PROBLEM - Memcached on srv195 is CRITICAL: Connection refused
[22:49:56] <nagios-wm>	 RECOVERY - Host srv196 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[22:50:42] <gerrit-wm>	 Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23386
[22:51:17] <nagios-wm>	 PROBLEM - SSH on srv198 is CRITICAL: Connection refused
[22:51:17] <nagios-wm>	 RECOVERY - Host srv197 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms
[22:51:44] <nagios-wm>	 PROBLEM - Apache HTTP on srv198 is CRITICAL: Connection refused
[22:52:11] <nagios-wm>	 PROBLEM - Memcached on srv198 is CRITICAL: Connection refused
[22:53:23] <nagios-wm>	 PROBLEM - Apache HTTP on srv196 is CRITICAL: Connection refused
[22:53:32] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:53:50] <nagios-wm>	 PROBLEM - Memcached on srv196 is CRITICAL: Connection refused
[22:55:20] <nagios-wm>	 PROBLEM - Apache HTTP on srv197 is CRITICAL: Connection refused
[22:56:23] <nagios-wm>	 PROBLEM - Memcached on srv197 is CRITICAL: Connection refused
[22:57:35] <nagios-wm>	 RECOVERY - SSH on srv198 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[23:07:20] <nagios-wm>	 PROBLEM - NTP on srv195 is CRITICAL: NTP CRITICAL: No response from NTP server
[23:10:11] <nagios-wm>	 PROBLEM - NTP on srv198 is CRITICAL: NTP CRITICAL: No response from NTP server
[23:10:38] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds
[23:11:23] <nagios-wm>	 RECOVERY - Apache HTTP on srv195 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.007 seconds
[23:11:50] <nagios-wm>	 PROBLEM - NTP on srv196 is CRITICAL: NTP CRITICAL: No response from NTP server
[23:12:53] <nagios-wm>	 PROBLEM - NTP on srv197 is CRITICAL: NTP CRITICAL: No response from NTP server
[23:20:23] <nagios-wm>	 RECOVERY - Apache HTTP on srv196 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.003 seconds
[23:20:23] <nagios-wm>	 PROBLEM - Apache HTTP on srv195 is CRITICAL: Connection refused
[23:27:35] <nagios-wm>	 RECOVERY - Apache HTTP on srv197 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.006 seconds
[23:28:47] <nagios-wm>	 PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours
[23:29:23] <nagios-wm>	 PROBLEM - Apache HTTP on srv196 is CRITICAL: Connection refused
[23:35:50] <nagios-wm>	 RECOVERY - NTP on srv195 is OK: NTP OK: Offset -0.04674077034 secs
[23:36:26] <nagios-wm>	 RECOVERY - Apache HTTP on srv198 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.011 seconds
[23:36:35] <nagios-wm>	 PROBLEM - Apache HTTP on srv197 is CRITICAL: Connection refused
[23:41:50] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours
[23:42:35] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:43:20] <nagios-wm>	 RECOVERY - NTP on srv196 is OK: NTP OK: Offset -0.04592752457 secs
[23:51:35] <nagios-wm>	 RECOVERY - NTP on srv197 is OK: NTP OK: Offset -0.05046725273 secs
[23:54:44] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.227 seconds
[23:58:47] <nagios-wm>	 PROBLEM - LDAP on nfs1 is CRITICAL: Connection refused
[23:59:05] <nagios-wm>	 PROBLEM - LDAPS on nfs1 is CRITICAL: Connection refused
[23:59:14] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error
[23:59:32] <nagios-wm>	 PROBLEM - LDAP on nfs2 is CRITICAL: Connection refused
[23:59:41] <nagios-wm>	 PROBLEM - LDAPS on nfs2 is CRITICAL: Connection refused