[00:05:16] New patchset: Ottomata; "Hey! gerrit-stats cronjob! SHHhhhhh!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23384 [00:06:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23384 [00:26:15] New patchset: Pyoungmeister; "add eqiad mw boxes to site.pp and removed nfs::upload from applicationserver.pp role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23386 [00:27:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23386 [00:32:25] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23384 [00:41:19] New patchset: Faidon; "Remove ms7/ms8 from site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23388 [00:42:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23388 [00:42:30] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23388 [00:52:16] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [00:52:16] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [00:52:16] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [00:58:16] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [01:01:07] RECOVERY - Puppet freshness on ms-be1007 is OK: puppet ran at Tue Sep 11 01:00:57 UTC 2012 [01:01:34] RECOVERY - Puppet freshness on ms-be1011 is OK: puppet ran at Tue Sep 11 01:01:09 UTC 2012 [01:01:34] RECOVERY - Puppet freshness on ms-be1010 is OK: puppet ran at Tue Sep 11 01:01:16 UTC 2012 [01:01:34] RECOVERY - Puppet freshness on analytics1024 is OK: puppet ran at Tue Sep 11 01:01:20 UTC 2012 [01:02:35] !log adjusting spence's firewall to include all eqiad subnets [01:02:44] Logged the message, Master [01:05:10] RECOVERY - Puppet freshness on es1010 is OK: puppet ran at Tue Sep 11 01:04:37 UTC 2012 [01:05:37] RECOVERY - Puppet freshness on analytics1022 is OK: puppet ran at Tue Sep 11 01:05:18 UTC 2012 [01:05:37] RECOVERY - Puppet freshness on es1009 is OK: puppet ran at Tue Sep 11 01:05:29 UTC 2012 [01:05:37] RECOVERY - Puppet freshness on ms-be1009 is OK: puppet ran at Tue Sep 11 01:05:29 UTC 2012 [01:05:46] RECOVERY - Puppet freshness on analytics1025 is OK: puppet ran at Tue Sep 11 01:05:34 UTC 2012 [01:07:36] New patchset: Aaron Schulz; "Removed code to hide the ETag." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23392 [01:08:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23392 [01:22:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:28] RECOVERY - Puppet freshness on analytics1015 is OK: puppet ran at Tue Sep 11 01:23:11 UTC 2012 [01:24:04] RECOVERY - Puppet freshness on analytics1023 is OK: puppet ran at Tue Sep 11 01:23:36 UTC 2012 [01:24:04] RECOVERY - Puppet freshness on analytics1017 is OK: puppet ran at Tue Sep 11 01:24:02 UTC 2012 [01:24:40] RECOVERY - Puppet freshness on analytics1013 is OK: puppet ran at Tue Sep 11 01:24:17 UTC 2012 [01:24:58] RECOVERY - Puppet freshness on analytics1014 is OK: puppet ran at Tue Sep 11 01:24:47 UTC 2012 [01:27:04] RECOVERY - Puppet freshness on analytics1011 is OK: puppet ran at Tue Sep 11 01:26:44 UTC 2012 [01:27:04] RECOVERY - Puppet freshness on analytics1021 is OK: puppet ran at Tue Sep 11 01:26:53 UTC 2012 [01:27:40] RECOVERY - Puppet freshness on ms-be1006 is OK: puppet ran at Tue Sep 11 01:27:28 UTC 2012 [01:27:40] RECOVERY - Puppet freshness on analytics1012 is OK: puppet ran at Tue Sep 11 01:27:33 UTC 2012 [01:27:58] RECOVERY - Puppet freshness on ms-be1005 is OK: puppet ran at Tue Sep 11 01:27:43 UTC 2012 [01:29:37] RECOVERY - Puppet freshness on analytics1020 is OK: puppet ran at Tue Sep 11 01:29:22 UTC 2012 [01:29:46] RECOVERY - Puppet freshness on analytics1019 is OK: puppet ran at Tue Sep 11 01:29:34 UTC 2012 [01:30:22] RECOVERY - Puppet freshness on analytics1027 is OK: puppet ran at Tue Sep 11 01:30:09 UTC 2012 [01:31:07] RECOVERY - Puppet freshness on analytics1016 is OK: puppet ran at Tue Sep 11 01:31:01 UTC 2012 [01:31:34] RECOVERY - Puppet freshness on es1008 is OK: puppet ran at Tue Sep 11 01:31:29 UTC 2012 [01:31:53] RECOVERY - Puppet freshness on analytics1026 is OK: puppet ran at Tue Sep 11 01:31:37 UTC 2012 [01:31:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.681 seconds [01:32:10] RECOVERY - Puppet freshness on analytics1018 is OK: puppet ran at Tue Sep 11 01:31:59 UTC 2012 [01:35:10] RECOVERY - Puppet freshness on es1007 is OK: puppet ran at Tue Sep 11 01:34:52 UTC 2012 [01:42:53] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 332 seconds [01:43:11] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 318 seconds [01:44:23] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [01:46:11] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 28 seconds [02:08:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.113 seconds [02:25:02] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [02:25:02] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [02:30:26] New patchset: Krinkle; "misc deployment scripts: Minor clean up" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22858 [02:31:19] New review: Krinkle; "Fix syntax error." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/22858 [02:31:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22858 [03:26:57] PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours [03:34:27] PROBLEM - udp2log log age for locke on locke is CRITICAL: CRITICAL: log files /a/squid/fundraising/logs/bannerImpressions-sampled100.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [04:01:36] RECOVERY - udp2log log age for locke on locke is OK: OK: all log files active [04:44:26] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [04:45:11] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [04:49:41] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [04:58:41] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [06:33:17] PROBLEM - udp2log log age for locke on locke is CRITICAL: CRITICAL: log files /a/squid/fundraising/logs/bannerImpressions-sampled100.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [06:45:29] RECOVERY - udp2log log age for locke on locke is OK: OK: all log files active [06:54:47] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:54:47] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [06:54:47] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [06:54:47] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [06:54:47] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [06:54:48] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:49:37] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [09:49:44] PROBLEM - udp2log log age for locke on locke is CRITICAL: CRITICAL: log files /a/squid/fundraising/logs/bannerImpressions-sampled100.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [10:31:35] RECOVERY - udp2log log age for locke on locke is OK: OK: all log files active [10:44:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:46:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [10:59:29] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [11:11:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:12:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.067 seconds [11:20:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:23:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.764 seconds [11:59:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:07:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds [12:26:12] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [12:26:12] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [12:40:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.102 seconds [13:27:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:14] PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours [13:33:29] PROBLEM - swift-object-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [13:33:29] PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [13:33:29] PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [13:33:38] PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:33:38] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [13:33:38] PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [13:33:56] PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [13:34:23] PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [13:34:32] PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [13:34:32] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [13:34:36] nagios-wm: quiet [13:34:41] PROBLEM - swift-object-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [13:34:50] PROBLEM - swift-container-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [13:35:53] jeremyb: nagios-wm is like a person, as much as you want them to stfu no matter how many hints you give them they just keep talking :P [13:37:44] Damianz: i just want it to know better when something's intentionally "CRITICAL" [13:38:00] ms-be6 is one of the dead backends, [13:38:13] these processes were actually shot yesterday but I had to shoot them again today [13:38:15] for some reason it didn't alert until now though [13:38:17] I bet puppet restarts them [13:38:20] it's been like 15 hrs [13:38:28] well I just now shot them again [13:38:36] so that's why it just now complained about them [13:38:36] oh [13:39:07] you could always stop puppet ;) (but then disable it too in /etc/default/puppet) [13:39:19] yeah but ugh [13:39:29] rather not [13:39:39] then we forget to enable it later and bad things happen [13:40:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.239 seconds [13:42:02] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [13:42:02] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [13:42:11] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [13:42:11] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [13:42:20] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [13:42:38] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [13:42:38] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [13:42:38] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:42:38] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [13:42:47] RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [13:42:56] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [13:43:05] RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [13:44:26] knew it. puppet restarts 'em [13:44:28] grrr [14:16:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:18:56] !log stopped swift processes on ms-be6 and disabled puppet (/etc/default/puppet) so it won't restart them. still doing testing of hw over there so we need the box up. [14:19:05] Logged the message, Master [14:21:29] PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:21:38] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:21:38] PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:21:47] PROBLEM - swift-object-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:21:47] PROBLEM - swift-container-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:21:56] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [14:21:56] PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [14:22:05] PROBLEM - swift-object-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [14:22:05] PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [14:22:14] PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:22:23] PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [14:22:23] PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:29:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [14:44:31] !log manually recreated filesystems and mounted missing drives on ms-be6 (without ssds cabled), seems to have worked without errors. rebooting to see what happens. [14:44:40] Logged the message, Master [14:50:26] PROBLEM - SSH on ms-be6 is CRITICAL: Connection refused [14:52:04] rats. still too sleepy. [14:52:10] should have done that from management console [14:52:19] * apergos waits for awhile anyways [14:56:44] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [14:56:44] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [14:56:54] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [14:56:54] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [14:56:54] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:57:20] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:57:20] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:57:20] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:57:38] RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:57:38] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:57:38] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:57:56] RECOVERY - SSH on ms-be6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:57:56] RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [14:58:40] grrr [14:58:58] stopped them again [15:01:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:02:08] PROBLEM - swift-object-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:02:08] PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [15:02:08] PROBLEM - swift-container-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [15:02:35] PROBLEM - swift-object-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [15:02:44] PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [15:02:44] PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [15:03:02] PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [15:03:02] PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:03:10] hush! [15:03:11] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [15:03:18] I'll fix you later [15:03:20] PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [15:03:20] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [15:03:20] PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:06:21] New patchset: Pyoungmeister; "quieting search result mover cron on oxygen" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23417 [15:07:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23417 [15:09:02] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23417 [15:12:04] apergos: ok, good. I'm glad this is a controlled thing. i was just loking at nagios and got a little nervous... [15:12:04] !log redid other ms-be6 filesystems so labels would be correct with disk layout without ssds. reboot again to see if it comes up properly [15:12:13] Logged the message, Master [15:14:35] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [15:15:02] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [15:15:02] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [15:15:11] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [15:15:11] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:15:11] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [15:15:29] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [15:15:29] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [15:15:29] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:15:29] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [15:15:36] * jeremyb hands Guest19048 a /nick ;) [15:15:42] !log stopping swift processes on ms-be6 til we decide what's next; all disks came up and were mounted (but we still do have degraded raid array, not so awesome) [15:15:51] Logged the message, Master [15:18:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.052 seconds [15:19:32] PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [15:19:32] PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [15:19:41] PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [15:19:41] PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:19:59] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [15:19:59] PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [15:19:59] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [15:19:59] PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:23:02] New patchset: Alex Monk; "(bug 40163) Try to fix ltwiki import source for betawikiversity" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23419 [15:30:47] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 190 seconds [15:31:23] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 209 seconds [15:32:44] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 202 seconds [15:32:44] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 202 seconds [15:35:44] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 1 seconds [15:35:44] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 1 seconds [15:40:39] New review: Jeremyb; "This will work." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/23419 [15:42:11] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 182 seconds [15:43:05] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 212 seconds [15:46:05] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [15:46:41] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [15:51:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:55:34] New patchset: Hashar; "manpages for our misc scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16606 [15:56:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16606 [15:57:52] !log rebuilding the degraded raid array on ms-be6 [15:58:01] Logged the message, Master [16:02:20] New review: Jeroen De Dauw; "Does what it says" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/23392 [16:05:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [16:24:37] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 201 seconds [16:25:13] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 213 seconds [16:27:10] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 191 seconds [16:27:55] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 205 seconds [16:37:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:37] PROBLEM - udp2log log age for locke on locke is CRITICAL: CRITICAL: log files /a/squid/fundraising/logs/bannerImpressions-sampled100.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [16:49:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.056 seconds [16:55:40] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [16:55:40] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [16:55:40] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [16:55:40] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [16:55:40] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [16:55:41] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [16:56:44] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [16:57:28] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [17:04:49] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [17:04:58] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [17:10:43] !log one more reboot of ms-be6 after raid repaired, see if it and all mounts stay ok [17:10:52] Logged the message, Master [17:11:21] apergos: so what happened with ms-be6? [17:11:29] I've kinda lost track, with you doing things to it since 5am :P [17:11:35] you weren't watching the log? :-P [17:11:46] I'm lazy, it's easier to ask [17:11:58] the log is intermixed with a bunch of nagios alerts as well :-) [17:12:00] I'm lazy, it's easier to paste you the link:-P [17:12:11] recreated all filesystems manually [17:12:24] then mounted them all manually [17:12:28] then rebooted (so far so good) [17:12:37] then repaired the raid array of sda and sdb [17:12:37] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [17:12:44] note that this is *without* the ssds cabled up [17:13:00] now rebooting one last time to see if it keeps everything up [17:13:13] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:13:18] if it does I'll try this same thing on a host with the ssds [17:13:22] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:13:22] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [17:13:49] RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:13:49] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [17:13:49] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:14:07] RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:14:07] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:14:07] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:14:25] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:14:34] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:14:34] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:14:34] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:15:55] RECOVERY - udp2log log age for locke on locke is OK: OK: all log files active [17:16:35] New review: Hashar; "PS7: doc for clear-profile" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16606 [17:17:52] PROBLEM - swift-container-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:18:19] PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:18:19] PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [17:18:19] PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:18:37] PROBLEM - swift-object-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:18:37] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:18:37] PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:18:55] PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:19:04] PROBLEM - swift-object-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:19:04] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:19:04] PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:19:13] PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:19:46] paravoid: more swift issue ^^^^^^ [17:24:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:40] hashar: ms-be6 is being worked on by apergos, there are some log entries [17:25:45] hashar: but thanks :-) [17:25:49] :-) [17:28:31] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:29:34] RECOVERY - swift-container-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:29:34] RECOVERY - swift-account-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:29:34] RECOVERY - swift-container-auditor on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:29:43] RECOVERY - swift-object-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:29:52] RECOVERY - swift-container-updater on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:29:52] RECOVERY - swift-account-auditor on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:30:01] RECOVERY - swift-object-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:30:01] RECOVERY - swift-container-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:31:32] !log puppet disabled on ms-be7 via /etc/default/puppet, swift processes shut down while we test [17:31:41] Logged the message, Master [17:34:04] PROBLEM - swift-account-replicator on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:34:04] PROBLEM - swift-container-replicator on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:34:04] PROBLEM - swift-container-auditor on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:34:13] PROBLEM - swift-object-replicator on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:34:22] PROBLEM - swift-account-auditor on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:34:22] PROBLEM - swift-container-updater on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:34:31] PROBLEM - swift-object-server on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:34:31] PROBLEM - swift-container-server on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:34:40] PROBLEM - swift-account-server on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [17:39:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.043 seconds [17:42:49] !log rebooting ms-be7 after recreating all filesystems (except those on the ssds) and remounting them, to see if the disks stay visible [17:42:58] Logged the message, Master [17:43:59] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [17:45:02] RECOVERY - swift-object-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:45:02] RECOVERY - swift-account-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:45:02] RECOVERY - swift-container-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:45:11] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [17:45:13] looky that it came up [17:45:49] I've stoppped the swift processes on the box again. I'll leave it up I guess [17:45:52] apergos: w/out having to skip through mounts? [17:46:03] <^demon> apergos: Are you shocked, shocked that it came up? [17:46:09] yes, well I recreated all the filesystems by hand [17:46:14] and then mounted manually [17:46:36] ^demon: not exactly [17:46:42] that would imply that I expected it to come up [17:46:51] but really I expect these boxes to be broken (because they are) [17:47:18] ok, I'll try rescuign one more box so we can use it to beat on, and then we'll be done for now [17:47:19] <^demon> apergos: Well, then you're pleasantly surprised :) [17:47:24] yes [17:48:14] I think I'll leave ms-be8 for dell [17:48:17] it's powered off anyways [17:48:41] ms-be10 can be our playtoy (if I get it to come back up) [17:49:41] PROBLEM - swift-object-server on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:49:41] PROBLEM - swift-account-replicator on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:49:41] PROBLEM - swift-container-server on ms-be7 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:50:35] maybe I'm leaving ms-be10 for dell instead [17:51:06] oh for cripes sake [17:51:12] root@ms-be10:~# ls -l /dev/sd* [17:51:23] has... [17:51:31] brw-rw---- 1 root disk 8, 224 Aug 14 23:50 /dev/sdo [17:51:31] [17:51:32] etc [17:51:49] ok we get to reboot this and see if we can clean these up. [17:54:27] !log rebooting ms-bs10 to see if we can get some reasonable device names, in prep for trying the same trick on it as on ms-be6 and 7 [17:54:36] Logged the message, Master [17:55:46] !log new degradedarray event on ms-be6, booooo [17:55:55] Logged the message, Master [17:56:08] that might end up being the test box then, don't want to use it for swift [18:00:38] PROBLEM - swift-account-server on ms-be10 is CRITICAL: Connection refused by host [18:00:40] faidon and/or mark: ms7 and ms8 have been removed from site.pp are these being decommissioned? [18:00:56] PROBLEM - swift-object-server on ms-be10 is CRITICAL: Connection refused by host [18:01:41] PROBLEM - swift-container-auditor on ms-be10 is CRITICAL: Connection refused by host [18:01:41] PROBLEM - swift-object-auditor on ms-be10 is CRITICAL: Connection refused by host [18:01:41] PROBLEM - SSH on ms-be10 is CRITICAL: Connection refused [18:01:41] PROBLEM - swift-account-auditor on ms-be10 is CRITICAL: Connection refused by host [18:01:41] PROBLEM - swift-container-server on ms-be10 is CRITICAL: Connection refused by host [18:01:50] PROBLEM - swift-account-replicator on ms-be10 is CRITICAL: Connection refused by host [18:02:08] PROBLEM - swift-account-reaper on ms-be10 is CRITICAL: Connection refused by host [18:02:08] PROBLEM - swift-object-replicator on ms-be10 is CRITICAL: Connection refused by host [18:02:08] PROBLEM - swift-container-replicator on ms-be10 is CRITICAL: Connection refused by host [18:02:08] PROBLEM - swift-container-updater on ms-be10 is CRITICAL: Connection refused by host [18:02:35] PROBLEM - swift-object-updater on ms-be10 is CRITICAL: Connection refused by host [18:02:37] cmjohnson1: not yet, that's why they're not in decomissioning.pp [18:02:44] cmjohnson1: but they run Solaris and they don't run puppet for quite some time [18:03:02] RECOVERY - SSH on ms-be10 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:12:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:13:24] !log rebooting ms-be10 after recreation and remount of swift filesystems (except those on ssds), see if they stick [18:13:33] Logged the message, Master [18:13:56] New patchset: Reedy; "Move noc from /h/w/htdocs to /h/w/c/docroot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23425 [18:14:26] PROBLEM - Host ms-be10 is DOWN: PING CRITICAL - Packet loss = 100% [18:14:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23425 [18:16:50] RECOVERY - Host ms-be10 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [18:17:41] Reedy: trailing ws at line 12 [18:18:01] Krinkle: I'm working on a seriously laggy internet connection [18:18:05] Working on a remote server via ssh [18:18:10] k [18:20:37] np [18:20:39] fixed [18:20:43] New patchset: Krinkle; "Move noc from /h/w/htdocs to /h/w/c/docroot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23425 [18:20:47] so here is an interesting tidbit, reboot and we see all the devices and they mount on ms-be10 *and yet* [18:21:13] during the boot I saw on the console whines that devices /dev/sdi,k,n were not ready or not present [18:21:28] and did I want to skip or do manual recovery during the attempt to mount) [18:21:31] very weird [18:21:34] New review: Krinkle; "* Removed trailing space" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/23425 [18:21:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23425 [18:22:05] New patchset: RobH; "adding in eqiad row c, pmtpa rows c and d" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23426 [18:22:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23426 [18:23:17] New patchset: Reedy; "Move noc from /h/w/htdocs to /h/w/c/docroot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23425 [18:24:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23425 [18:24:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [18:26:05] !log first reboot of msbe10 showed complaints of not ready on devices sdi, k, n but they did eventually mount when the box came up fully. rebooting a second time to see what happens [18:26:15] Logged the message, Master [18:30:20] RECOVERY - swift-account-auditor on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [18:30:20] RECOVERY - swift-container-auditor on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:30:20] RECOVERY - swift-container-server on ms-be10 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:30:20] RECOVERY - swift-object-auditor on ms-be10 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [18:31:20] !log second reboot had whines of device not present/not ready for /dev/sdn1 and /dev/sdh1 (note the list is not the same as the previous list), waited a little and it booted up with all disks mounted [18:31:29] Logged the message, Master [18:32:33] I doublechecked and of course ms-be10 is with boot delay 90 seconds [18:32:44] we could increase it to 120 but seriously?? it's ridiculous [18:34:34] i wonder if the sound of drive heads helplessly banging back and forth just struggled to be heard over the drone of a million fans [18:34:59] PROBLEM - swift-account-auditor on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [18:34:59] PROBLEM - swift-container-auditor on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:35:08] PROBLEM - swift-container-server on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:35:08] PROBLEM - swift-object-auditor on ms-be10 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [18:42:11] heh [18:42:27] starting to collect notes here: [18:42:28] http://wikitech.wikimedia.org/view/Swift/Server_issues_Aug-Sept_2012 [18:42:46] since our rt tickets are really for specific cases with vendors, rather than a general plan [18:42:56] lemme see what else is in the trouble list [18:50:44] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [18:57:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [19:19:19] New patchset: RobH; "adding in eqiad row c, pmtpa rows c and d" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23426 [19:20:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23426 [19:20:47] notpeter: ping [19:21:44] sup [19:22:22] New review: RobH; "This patchset relates to RT 3402" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/23426 [19:34:33] PROBLEM - udp2log log age for locke on locke is CRITICAL: CRITICAL: log files /a/squid/fundraising/logs/bannerImpressions-sampled100.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [19:41:27] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [19:43:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:58:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.890 seconds [20:01:33] RECOVERY - udp2log log age for locke on locke is OK: OK: all log files active [20:31:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:45:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.111 seconds [20:51:49] !log authdns-update per rt 1326 [20:51:58] Logged the message, RobH [21:00:05] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [21:15:41] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:15:50] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:15:50] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:15:50] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:15:50] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:15:59] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:15:59] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:16:35] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:16:35] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:16:35] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:16:35] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:16:44] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:20:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:32:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.082 seconds [21:33:05] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa [21:33:14] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [21:33:23] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [21:33:23] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [21:33:50] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [21:33:50] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [21:33:50] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [21:33:59] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [21:34:08] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa [21:34:08] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [21:34:35] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [21:34:44] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [22:07:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:19:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.451 seconds [22:26:41] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [22:26:41] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [22:29:16] !log removing srv195-199 from apaches pool for upgarde to precise [22:29:26] Logged the message, notpeter [22:36:08] New patchset: Pyoungmeister; "NO-OP: removing spare memecache boxes from mc.php that no longer have memecache class" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23477 [22:37:37] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23477 [22:39:08] PROBLEM - Host srv195 is DOWN: PING CRITICAL - Packet loss = 100% [22:44:23] PROBLEM - Host srv196 is DOWN: PING CRITICAL - Packet loss = 100% [22:44:50] RECOVERY - Host srv195 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [22:45:35] PROBLEM - Host srv197 is DOWN: PING CRITICAL - Packet loss = 100% [22:47:54] New patchset: Pyoungmeister; "add eqiad mw boxes to site.pp and removed nfs::upload from applicationserver.pp role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23386 [22:48:44] PROBLEM - Apache HTTP on srv195 is CRITICAL: Connection refused [22:48:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23386 [22:49:02] PROBLEM - Memcached on srv195 is CRITICAL: Connection refused [22:49:56] RECOVERY - Host srv196 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [22:50:42] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23386 [22:51:17] PROBLEM - SSH on srv198 is CRITICAL: Connection refused [22:51:17] RECOVERY - Host srv197 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [22:51:44] PROBLEM - Apache HTTP on srv198 is CRITICAL: Connection refused [22:52:11] PROBLEM - Memcached on srv198 is CRITICAL: Connection refused [22:53:23] PROBLEM - Apache HTTP on srv196 is CRITICAL: Connection refused [22:53:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:53:50] PROBLEM - Memcached on srv196 is CRITICAL: Connection refused [22:55:20] PROBLEM - Apache HTTP on srv197 is CRITICAL: Connection refused [22:56:23] PROBLEM - Memcached on srv197 is CRITICAL: Connection refused [22:57:35] RECOVERY - SSH on srv198 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:07:20] PROBLEM - NTP on srv195 is CRITICAL: NTP CRITICAL: No response from NTP server [23:10:11] PROBLEM - NTP on srv198 is CRITICAL: NTP CRITICAL: No response from NTP server [23:10:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [23:11:23] RECOVERY - Apache HTTP on srv195 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.007 seconds [23:11:50] PROBLEM - NTP on srv196 is CRITICAL: NTP CRITICAL: No response from NTP server [23:12:53] PROBLEM - NTP on srv197 is CRITICAL: NTP CRITICAL: No response from NTP server [23:20:23] RECOVERY - Apache HTTP on srv196 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.003 seconds [23:20:23] PROBLEM - Apache HTTP on srv195 is CRITICAL: Connection refused [23:27:35] RECOVERY - Apache HTTP on srv197 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.006 seconds [23:28:47] PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours [23:29:23] PROBLEM - Apache HTTP on srv196 is CRITICAL: Connection refused [23:35:50] RECOVERY - NTP on srv195 is OK: NTP OK: Offset -0.04674077034 secs [23:36:26] RECOVERY - Apache HTTP on srv198 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.011 seconds [23:36:35] PROBLEM - Apache HTTP on srv197 is CRITICAL: Connection refused [23:41:50] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [23:42:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:43:20] RECOVERY - NTP on srv196 is OK: NTP OK: Offset -0.04592752457 secs [23:51:35] RECOVERY - NTP on srv197 is OK: NTP OK: Offset -0.05046725273 secs [23:54:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.227 seconds [23:58:47] PROBLEM - LDAP on nfs1 is CRITICAL: Connection refused [23:59:05] PROBLEM - LDAPS on nfs1 is CRITICAL: Connection refused [23:59:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [23:59:32] PROBLEM - LDAP on nfs2 is CRITICAL: Connection refused [23:59:41] PROBLEM - LDAPS on nfs2 is CRITICAL: Connection refused