[00:51:10] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [00:51:55] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [01:27:10] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 195 seconds [01:27:55] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 221 seconds [01:34:58] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 203 seconds [01:44:07] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 269 seconds [01:48:43] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 4 seconds [02:10:46] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 3 seconds [02:10:55] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [02:25:10] PROBLEM - Host ssl3 is DOWN: PING CRITICAL - Packet loss = 100% [02:30:09] * jeremyb waits a couple mins to see if that recovers... [02:34:36] hrmm, binasher is least idle of the people i just checked. but even he is nearly 2 hrs idle [02:38:35] oh? [02:40:11] paravoid: ssl3 [02:40:20] on oit [02:47:10] !log powercycled ssl3 [02:47:14] Logged the message, Master [02:49:28] RECOVERY - Host ssl3 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [02:53:27] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=network_report&s=descending&c=SSL+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=medium&hc=2 is happier ;-) [02:53:39] thanks paravoid ;) [02:56:35] !log rebooting ssl2 (has 214 days uptime) [02:56:38] Logged the message, Master [02:58:28] PROBLEM - Host ssl2 is DOWN: PING CRITICAL - Packet loss = 100% [02:59:51] RECOVERY - Host ssl2 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [03:04:04] that's enough for a Saturday I guess :-) [03:05:23] paravoid: i just hope it's an SF saturday ;) [03:05:37] when do you go back? [03:08:38] May 13th [03:55:22] PROBLEM - Host mw10 is DOWN: PING CRITICAL - Packet loss = 100% [04:54:28] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [08:01:22] PROBLEM - Puppet freshness on blondel is CRITICAL: Puppet has not run in the last 10 hours [12:07:55] !log powercycling mw30 [12:07:58] Logged the message, Master [12:11:31] RECOVERY - Host mw30 is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [12:13:46] RECOVERY - Host mw32 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [12:13:55] RECOVERY - Host mw33 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [12:14:31] PROBLEM - Apache HTTP on mw30 is CRITICAL: Connection refused [12:15:35] !log powercycling mw32,mw33,mw44,mw46 one by one, they were all frozen and went down between like 17 and 24 hours ago approx. [12:16:46] PROBLEM - Apache HTTP on mw32 is CRITICAL: Connection refused [12:17:31] PROBLEM - Apache HTTP on mw33 is CRITICAL: Connection refused [12:17:58] RECOVERY - Host mw44 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [12:19:38] RECOVERY - Host mw46 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [12:20:25] PROBLEM - Apache HTTP on mw46 is CRITICAL: Connection refused [12:20:25] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [12:21:10] PROBLEM - NTP on mw46 is CRITICAL: NTP CRITICAL: Offset unknown [12:21:19] PROBLEM - Apache HTTP on mw44 is CRITICAL: Connection refused [12:21:19] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [12:21:59] !log check_all_memcached recovered, but still same treatment for mw10 and 11 (8 and 15h ago) [12:22:02] Logged the message, Master [12:22:49] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.020 second response time [12:24:10] RECOVERY - NTP on mw46 is OK: NTP OK: Offset -0.02202630043 secs [12:24:19] RECOVERY - Host mw10 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [12:26:52] RECOVERY - Host mw11 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [12:28:13] PROBLEM - Apache HTTP on mw10 is CRITICAL: Connection refused [12:29:43] PROBLEM - Apache HTTP on mw11 is CRITICAL: Connection refused [12:31:31] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [12:34:22] !log and finally mw1, so just leaving mw1102 and mw60 for having other issues for a while (->Nagios) [12:34:25] Logged the message, Master [12:37:13] RECOVERY - Host mw1 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [12:39:46] RECOVERY - Apache HTTP on mw11 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [12:40:04] PROBLEM - Apache HTTP on mw1 is CRITICAL: Connection refused [12:42:46] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [12:44:07] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [12:55:05] RECOVERY - Apache HTTP on mw10 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [13:08:35] RECOVERY - Apache HTTP on mw1 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [14:21:56] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 214 seconds [14:22:14] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 222 seconds [14:37:59] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [14:39:11] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [14:55:50] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [15:24:35] PROBLEM - Host ssl3001 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:53] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.45:11000 (timeout) [15:44:14] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [16:24:25] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.48:11000 (timeout) [16:25:55] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [16:32:31] RECOVERY - Host ssl3001 is UP: PING OK - Packet loss = 0%, RTA = 109.08 ms [16:34:02] !log powercycling the ssl300x.esams hosts. 212 days of uptime... (and 3001 had gone out to lunch) [16:34:05] Logged the message, Master [16:35:26] 3001 done, waiting on 3002 now [16:35:49] PROBLEM - Host ssl3002 is DOWN: PING CRITICAL - Packet loss = 100% [16:35:59] very informative, thanks [16:36:34] PROBLEM - HTTPS on ssl3001 is CRITICAL: Connection refused [16:37:37] RECOVERY - Host ssl3002 is UP: PING OK - Packet loss = 0%, RTA = 108.59 ms [16:39:19] guess I oughta wait for https to show as up on both those before moving on [16:58:32] huh it's fine on ssl3002 but 3001 is whinging [17:00:14] well I'll do 3003, it needs it anyways [17:04:08] doing 3004 now [17:06:25] PROBLEM - Host ssl3004 is DOWN: CRITICAL - Host Unreachable (91.198.174.105) [17:20:36] updates the kernel on 3001 and rebooting it again, let's see if it's happier [17:20:43] *updated [17:25:48] nope [17:26:18] a bunch of ipv6-related complaints [18:02:10] PROBLEM - Puppet freshness on blondel is CRITICAL: Puppet has not run in the last 10 hours [18:02:23] !log ok the ssl300x situation: ssl3001 is now disabled in the pybal conf file on fenari; it is picking up the ipv6and4labs tmplate and I don't know if that's right, anyways nginx doesn't want to bind to one of those addresses. ssl3004 isn't reachable or pingable even via mgmt but at leasy lvs sees it's gone [18:02:26] Logged the message, Master [18:05:49] !log sll3002 and 3003 were rebooted and are the entire ssl esams pool right now [18:05:52] Logged the message, Master [18:17:25] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.39:11000 (timeout) [18:18:46] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [18:22:58] PROBLEM - Host mw48 is DOWN: PING CRITICAL - Packet loss = 100% [18:24:37] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.48:11000 (Connection timed out) 10.0.11.35:11000 (timeout) [18:55:58] PROBLEM - Host mw52 is DOWN: PING CRITICAL - Packet loss = 100% [19:04:49] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [19:05:52] RECOVERY - Host mw48 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:09:55] PROBLEM - Apache HTTP on mw48 is CRITICAL: Connection refused [19:12:01] RECOVERY - Host mw52 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [19:12:10] PROBLEM - Apache HTTP on mw52 is CRITICAL: Connection refused [19:12:58] !log power cycled mw48 and mw52 (hung just like the others) [19:13:01] Logged the message, Master [19:13:40] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [19:19:40] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.234:11000 (Connection refused) [19:21:10] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [19:21:55] PROBLEM - Apache HTTP on srv228 is CRITICAL: Connection refused [19:22:13] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [19:23:16] PROBLEM - Apache HTTP on srv234 is CRITICAL: Connection refused [19:26:52] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.12:11000 (Connection refused) [19:28:13] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [19:28:58] PROBLEM - Apache HTTP on srv262 is CRITICAL: Connection refused [19:33:19] RECOVERY - Apache HTTP on srv262 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [19:33:55] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.19:11000 (timeout) 10.0.2.239:11000 (Connection refused) [19:35:25] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [19:37:31] RECOVERY - Apache HTTP on srv228 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [19:37:40] PROBLEM - Apache HTTP on srv239 is CRITICAL: Connection refused [19:39:01] RECOVERY - Apache HTTP on srv234 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.057 second response time [19:39:01] RECOVERY - Host mw60 is UP: PING OK - Packet loss = 0%, RTA = 1.47 ms [19:39:27] apergos: the script hung [19:39:33] apergos: so, I made it continue [19:39:37] you'll see some reboots [19:39:39] great [19:39:44] I thought there was something weird [19:39:54] the hosts I'm checking aren't rebooting, they were just dead [19:39:59] ahm since you're here [19:40:12] ssl3001 through 4 were all at 212 [19:40:23] 212? [19:40:26] load? [19:40:40] days. I can't get ssl3004 to come back: not reachable by mgmt even, I can't get to the mgmt interface [19:40:43] 212 days up [19:40:45] ah [19:41:12] ssl3001 has something very bizarre, it' has an ipv6 interface that appears to come right out of the ipv4and6labs template in puppet [19:41:19] nginx doesn't want to bind t it [19:41:37] I finally gave up and pulled it form pybal conf on /home/w/conf/pyba/blah on fenari [19:41:39] this is all logge [19:41:40] d [19:41:56] but that last one is so weird I figure you might know what's up with it [19:42:01] PROBLEM - Apache HTTP on mw60 is CRITICAL: Connection refused [19:42:02] hm. nope [19:42:06] I can't get to it either [19:42:11] ssl3001? [19:42:15] 3004 [19:42:28] 3001 is the one I thought you might be able to decipher [19:42:42] the other one, no clue whatsoever, it was reachable for me to reboot it and after that, *poof* [19:42:57] anyways 2 and 3 are done and in service but they are all we got [19:43:10] weird [19:43:31] mw60 is complaining that neon isn't allowed to talk to it [19:43:40] dunno what that's about (I power cycled that one too, it was dead) [19:44:54] oh. it's way out of date on puppet looks lke [19:45:17] some other host has the IP bound [19:46:01] maerlant [19:46:02] weird [19:46:13] PROBLEM - Apache HTTP on srv238 is CRITICAL: Connection refused [19:46:30] !log powercycled mw60, same reason as the rest [19:46:33] Logged the message, Master [19:46:46] what is maerlant? [19:47:16] one of the old ssl boxes [19:47:28] huh [19:47:52] PROBLEM - RAID on mw60 is CRITICAL: Connection refused by host [19:48:10] PROBLEM - DPKG on mw60 is CRITICAL: Connection refused by host [19:48:37] PROBLEM - Disk space on mw60 is CRITICAL: Connection refused by host [19:49:13] PROBLEM - Host srv233 is DOWN: PING CRITICAL - Packet loss = 100% [19:50:16] !log repooling ssl3001 [19:50:18] Logged the message, Master [19:50:24] thanks [19:50:34] RECOVERY - Apache HTTP on mw60 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [19:51:01] RECOVERY - HTTPS on ssl3001 is OK: OK - Certificate will expire on 07/19/2016 16:14. [19:51:02] RECOVERY - Host srv233 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [19:51:02] !log removed the ipv6 addresses from maerlant and added them to ssl3001, then restarted nginx [19:51:05] Logged the message, Master [19:51:21] I figured there was some sort of mixup [19:51:32] ssl3004 is down? [19:51:37] RECOVERY - Apache HTTP on srv239 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [19:51:45] 3004 is the one that's not reachable [19:51:51] it's still pooled [19:51:56] !log depooling ssl3004 [19:51:56] it is? [19:51:58] Logged the message, Master [19:52:08] it was showing as not pooled last I looked at lvs/pybal [19:52:17] well, it's pooled in the file [19:52:23] that's what I meant :) [19:52:25] that's ok, it wil check and fail the check [19:52:32] yeah [19:52:42] better to have it specifically depooled though [19:53:32] I thought we want to leave them in generally [19:53:34] RECOVERY - RAID on mw60 is OK: OK: no RAID installed [19:53:43] so we don't have to remember later when we figure out what's wrong with the host [19:54:01] RECOVERY - DPKG on mw60 is OK: All packages OK [19:54:11] hm. maybe so [19:54:19] RECOVERY - Disk space on mw60 is OK: DISK OK [19:54:28] and there's mw60 a bit happier [19:54:37] PROBLEM - Apache HTTP on srv233 is CRITICAL: Connection refused [19:55:52] !power cycling srv206 (blah blah) [19:59:43] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.23:11000 (timeout) [19:59:51] I wonder if the script hung on a host that died in the meantime [20:00:19] RECOVERY - Apache HTTP on srv233 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.256 second response time [20:00:25] could be [20:00:28] well [20:00:29] no [20:00:35] it was hung on a host requesting a password [20:00:43] oh [20:00:46] heh [20:00:50] I just needed to hit enter [20:00:58] I rebooted that one manually [20:01:02] err: /File[/var/lib/puppet/lib]: Could not evaluate: Could not retrieve information from source(s) puppet://puppet/plugins [20:01:04] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [20:01:13] PROBLEM - Apache HTTP on srv227 is CRITICAL: Connection refused [20:01:17] what is the fix for this again? cause puppet won't run on this host til it's cleard up [20:02:09] hm [20:02:16] which host is this? [20:02:20] err: Could not run Puppet configuration client: Invalid parameter system at /var/lib/git/operations/puppet/manifests/generic-definitions.pp:50 that's the last message before it gives up [20:02:26] srv206 [20:02:47] after that there's only 6 hosts left that are down [20:02:52] I might leave them for tommorrow [20:03:18] I've never seen this before [20:03:58] ugh [20:04:41] I figure it could be a bad yaml file or whatever but I don't remember how to fix that [20:05:06] oh [20:05:07] likely [20:05:09] on the server [20:05:11] gimme a sec [20:05:16] if you do it [20:05:20] I won't know how to do it :-P [20:05:25] PROBLEM - Host srv194 is DOWN: PING CRITICAL - Packet loss = 100% [20:05:42] ah it's on the puppet wikitech page [20:05:46] yay for up to date docs [20:06:31] let me know when you are done looking cause I'm going to try this [20:06:31] ok. I'm off for a bit [20:06:36] will be back later [20:06:39] go for it [20:06:40] ok [20:06:42] happy trails [20:06:49] I'm going to afk as soon as I get this box happy [20:06:55] (11 pm and I am hungry) [20:07:04] RECOVERY - Host srv194 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [20:07:40] RECOVERY - Apache HTTP on srv238 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [20:07:46] well that didn't help. rats [20:11:07] PROBLEM - Apache HTTP on srv194 is CRITICAL: Connection refused [20:13:06] !log srv206 won't run puppet, see syslog, clearing out the yaml file didn't help, since it's not urgent I'm leaving it for tomorrow [20:13:09] Logged the message, Master [20:19:09] RECOVERY - Apache HTTP on srv227 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [20:35:21] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [21:25:05] PROBLEM - Apache HTTP on srv206 is CRITICAL: Connection refused