[00:05:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:09:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.763 seconds [00:16:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.137 seconds [00:28:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:30:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.742 seconds [00:36:20] PROBLEM - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: Connection timed out [00:37:47] notpeter: are you here? [00:38:35] looking at it [00:38:43] probably just going to restart search on 1012 [00:38:53] going to look at log first [00:39:13] maplebed: ^ [00:39:24] cool. [00:39:34] I was just reading http://wikitech.wikimedia.org/view/Search#What_to_do_if_you_get_a_page_about_a_search_pool [00:40:35] ganglia (network graph) shows network dropoff on search1002,3,5 - what pointed you to 1012? [00:40:57] nagios [00:41:05] and the page was re pool3 [00:41:06] but yeah [00:41:12] pool1 is in trouble as well [00:42:58] this might actually be leap-second related... [00:43:24] oh? [00:43:38] was I too quick to make a ruling? :) [00:43:40] paravoid: things started hanging right around 00:00 [00:43:51] hah [00:43:57] which machine is that? [00:44:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:44:17] so, the way they fixed the previous leap second bug is by making the clock go backwards this time [00:44:18] I just added links to the search pool configs (eg http://noc.wikimedia.org/pybal/eqiad/search_pool3) to the wikitech page. [00:44:26] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=Search+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [00:44:27] not every service would be happy about that [00:44:47] at midnight, numer of procs rocketed [00:44:51] that is funny. [00:45:04] … [00:45:24] I'm going to restart lucene on all nodes on one minute intervals [00:46:21] I'm going to go back to trying to figure out camera tethering software. [00:46:23] :) [00:46:27] enjoy! [00:46:30] thanks for being around, notpeter! [00:46:32] :) [00:46:32] PROBLEM - Lucene on search1011 is CRITICAL: Connection timed out [00:48:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.761 seconds [00:50:52] !log based on ganglia evidence, lucene seems to have been affected by leap second bug. restartig each instance, one minute wait in between [00:50:53] PROBLEM - Lucene on search1012 is CRITICAL: Connection timed out [00:51:02] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [00:51:03] Logged the message, notpeter [00:51:06] !log search1004 dead. powercycling. [00:51:17] Logged the message, notpeter [00:51:23] nice [00:51:38] PROBLEM - Host msfe1001 is DOWN: PING CRITICAL - Packet loss = 100% [00:52:02] btw, what are the indications of an incident in search? [00:52:14] the LVS Lucene on search-pool alert? [00:52:18] I didn't get paged fwiw [00:52:20] yeah [00:52:23] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [00:56:35] PROBLEM - Lucene on search1002 is CRITICAL: Connection timed out [00:56:53] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.029 second response time on port 8123 [00:57:03] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: Connection refused [00:57:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:59:08] RECOVERY - Host search1004 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [01:00:29] PROBLEM - Lucene on search1007 is CRITICAL: Connection refused [01:01:30] fyi, I htink that we did, in fact, have some issues related to leap second: http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20pmtpa&m=load_one&r=2hr&s=by%20name&hc=4&mc=2 [01:02:17] PROBLEM - Lucene on search1009 is CRITICAL: Connection refused [01:02:35] PROBLEM - Lucene on search1008 is CRITICAL: Connection refused [01:02:53] PROBLEM - Lucene on search1003 is CRITICAL: Connection timed out [01:03:02] RECOVERY - LVS Lucene on search-pool3.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [01:03:29] PROBLEM - Lucene on search1010 is CRITICAL: Connection refused [01:04:50] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection refused [01:04:50] RECOVERY - Lucene on search1010 is OK: TCP OK - 0.027 second response time on port 8123 [01:05:17] RECOVERY - Lucene on search1009 is OK: TCP OK - 0.026 second response time on port 8123 [01:07:41] PROBLEM - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: Connection timed out [01:07:50] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: Connection refused [01:09:38] notpeter: holy crap [01:10:05] PROBLEM - Lucene on search1015 is CRITICAL: Connection refused [01:10:05] paravoid: yeah, some of our boxes are in trouble [01:10:06] nagios didn't notice anything [01:10:15] RECOVERY - LVS Lucene on search-pool1.svc.eqiad.wmnet is OK: TCP OK - 9.027 second response time on port 8123 [01:10:15] in addition to search [01:10:16] PROBLEM - Lucene on search1016 is CRITICAL: Connection refused [01:10:23] we don't often monitor individual hosts... [01:10:27] really? [01:10:54] some. varies [01:11:15] damn it [01:11:59] I have to leave for the airport in 40' or so [01:12:02] PROBLEM - Lucene on search1013 is CRITICAL: Connection refused [01:12:11] PROBLEM - Lucene on search1017 is CRITICAL: Connection refused [01:12:42] until then, I can help [01:12:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [01:12:47] PROBLEM - Host mw1089 is DOWN: PING CRITICAL - Packet loss = 100% [01:12:50] but I think it'll be better if you kept track of what's going on [01:12:57] since I'll be between planes for about 24h :) [01:13:03] yes [01:13:05] iow, task me [01:13:26] I'm going to focus on search for the moment. once that's squared away, I'll start looking at other boxes [01:13:29] okay [01:13:30] as I want the page to sftu [01:13:41] fucking leap second [01:13:44] want to take a look at the nfs boxes [01:13:45] ? [01:13:51] it look like they are unhappy [01:14:02] probaly why fenari is unhappy as well [01:14:45] PROBLEM - Lucene on search1019 is CRITICAL: Connection refused [01:14:45] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [01:14:47] looking [01:14:55] opendj [01:14:59] java too [01:15:07] (opendj is java that is) [01:15:11] PROBLEM - Lucene on search1018 is CRITICAL: Connection refused [01:15:11] PROBLEM - Lucene on search1021 is CRITICAL: Connection refused [01:15:56] ah, hrmph [01:16:26] the good news is, despite all this noise, search seems to be fucntioning relatively well on the site [01:16:50] RECOVERY - Lucene on search1007 is OK: TCP OK - 3.024 second response time on port 8123 [01:17:26] RECOVERY - LVS Lucene on search-pool1.svc.eqiad.wmnet is OK: TCP OK - 3.019 second response time on port 8123 [01:18:53] ugh, java crap [01:20:32] !log restarting opendj (nfs1/nfs2), load spike, possibly related to leap second [01:20:44] Logged the message, Master [01:21:21] [01/Jul/2012:01:21:08 +0000] category=SYNC severity=SEVERE_ERROR msgID=14942259 msg=The hostname virt1.wikimedia.org could not be resolved as an IP address [01:21:41] any idea what's up with that? virt1 doesn't resolve but it shouldn't [01:22:07] er? [01:22:09] no, not sure [01:22:15] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [01:23:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:26] PROBLEM - LDAP on nfs2 is CRITICAL: Connection refused [01:24:05] fcking java crap [01:24:20] PROBLEM - LDAPS on nfs2 is CRITICAL: Connection refused [01:24:49] when I tried to restart lucene, it wouldn't kill the proc. had to more aggressively kill [01:26:17] PROBLEM - Lucene on search1007 is CRITICAL: Connection timed out [01:27:38] PROBLEM - HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:28:23] PROBLEM - Lucene on search1006 is CRITICAL: Connection refused [01:28:23] PROBLEM - Etherpad HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:30:29] RECOVERY - HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.009 second response time [01:30:38] PROBLEM - Lucene on search1009 is CRITICAL: Connection refused [01:30:45] Hey guys, searching seems to be broken on all the sites [01:31:04] woosters: ^ [01:31:17] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.011 second response time [01:31:18] kaldari: notpeter is looking at it [01:31:27] ah cool, thanks [01:31:43] kaldari: what wiki are you seeing that on? [01:31:51] commons and en.wiki [01:31:59] but those are the only 2 I tested [01:32:11] hhhmmm, I'm getting results from en [01:32:19] are you getting misses right now? [01:32:22] from Special:Search? [01:32:35] from the box [01:32:47] yeah, still broken for me [01:32:49] oh, no, nvm [01:32:59] fuu [01:32:59] the article redirecting from the box is working, but not lucene searching [01:33:03] yep [01:33:18] I was only searching for real words :) [01:33:31] i never search for real words :) [01:33:34] lulz [01:34:26] this is really interfering with my ability to locate a map of both the United States and Canada :( [01:34:32] PROBLEM - Lucene on search1020 is CRITICAL: Connection refused [01:34:47] it's also interfering with my ability to continue drinking :/ [01:34:49] if anyone has such a map, please scan it, and email it to me [01:37:23] PROBLEM - Lucene on search1022 is CRITICAL: Connection refused [01:37:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [01:39:02] PROBLEM - Host search1001 is DOWN: PING CRITICAL - Packet loss = 100% [01:39:11] PROBLEM - Host search1006 is DOWN: PING CRITICAL - Packet loss = 100% [01:39:11] PROBLEM - Host search1003 is DOWN: PING CRITICAL - Packet loss = 100% [01:39:20] PROBLEM - Host search1005 is DOWN: PING CRITICAL - Packet loss = 100% [01:39:20] PROBLEM - Host search1002 is DOWN: PING CRITICAL - Packet loss = 100% [01:39:38] RECOVERY - Lucene on search1008 is OK: TCP OK - 0.027 second response time on port 8123 [01:39:38] PROBLEM - Host searchidx1001 is DOWN: PING CRITICAL - Packet loss = 100% [01:39:49] !log problem with lucene persisting through service restart, but not node restart. restarting en pool nodes. [01:39:59] Logged the message, notpeter [01:40:14] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 207 seconds [01:40:41] RECOVERY - Host search1005 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [01:40:50] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 241 seconds [01:40:50] RECOVERY - Host search1003 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [01:40:50] RECOVERY - Host search1002 is UP: PING OK - Packet loss = 0%, RTA = 26.40 ms [01:41:08] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.026 second response time on port 8123 [01:41:26] RECOVERY - LVS Lucene on search-pool1.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [01:41:35] RECOVERY - Lucene on search1003 is OK: TCP OK - 0.026 second response time on port 8123 [01:41:35] RECOVERY - Host search1001 is UP: PING OK - Packet loss = 0%, RTA = 26.38 ms [01:41:44] RECOVERY - Host searchidx1001 is UP: PING OK - Packet loss = 0%, RTA = 26.91 ms [01:41:53] RECOVERY - Lucene on search1006 is OK: TCP OK - 9.023 second response time on port 8123 [01:42:02] RECOVERY - Host search1006 is UP: PING OK - Packet loss = 0%, RTA = 26.40 ms [01:42:38] RECOVERY - Lucene on search1007 is OK: TCP OK - 9.027 second response time on port 8123 [01:43:15] !log that worked. restarting all remaining search nodes. [01:43:25] Logged the message, notpeter [01:43:51] I suspect that's the source of the ProofreadPage errors on [01:44:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:23] grabbing search results [01:45:02] PROBLEM - Host search1022 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:02] PROBLEM - Host search1018 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:02] PROBLEM - Host search1007 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:02] PROBLEM - Host search1008 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:02] PROBLEM - Host search1012 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:03] PROBLEM - Host search1020 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:03] PROBLEM - Host search1016 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:04] PROBLEM - Host search1021 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:04] PROBLEM - Host search1019 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:04] PROBLEM - Host search1017 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:05] PROBLEM - Host search1023 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:06] PROBLEM - Host search1009 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:06] PROBLEM - Host search1010 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:06] PROBLEM - Host search1014 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:07] PROBLEM - Host search1024 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:08] PROBLEM - Host search1011 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:08] PROBLEM - Host search1015 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:17] fun [01:45:44] paravoid: well, something in the jvm wasn't giving up the ghost... [01:45:54] and it's not like this is a problem that I'm worried about happening again [01:46:24] although I am taking special care of the indexers [01:46:50] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 602s [01:46:50] PROBLEM - Host search1013 is DOWN: PING CRITICAL - Packet loss = 100% [01:46:53] opendj didn't get fixed with restarting it either [01:47:04] hurray for the jvm. [01:47:05] what magic [01:47:17] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:26] RECOVERY - LVS Lucene on search-prefix.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [01:47:27] RECOVERY - Lucene on search1018 is OK: TCP OK - 0.026 second response time on port 8123 [01:47:27] RECOVERY - Lucene on search1013 is OK: TCP OK - 0.027 second response time on port 8123 [01:47:27] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:27] RECOVERY - Host search1009 is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [01:47:27] RECOVERY - Host search1016 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [01:47:35] RECOVERY - LVS Lucene on search-pool3.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [01:47:35] PROBLEM - Apache HTTP on mw64 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:35] RECOVERY - LVS Lucene on search-pool2.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [01:47:36] RECOVERY - Host search1017 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [01:47:36] RECOVERY - Host search1008 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [01:47:36] RECOVERY - Host search1018 is UP: PING OK - Packet loss = 0%, RTA = 26.66 ms [01:47:36] RECOVERY - Host search1013 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [01:47:37] RECOVERY - Host search1019 is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [01:47:44] RECOVERY - Lucene on search1011 is OK: TCP OK - 0.027 second response time on port 8123 [01:47:44] RECOVERY - Lucene on search1019 is OK: TCP OK - 0.029 second response time on port 8123 [01:47:44] RECOVERY - Host search1007 is UP: PING OK - Packet loss = 0%, RTA = 26.37 ms [01:47:44] RECOVERY - Host search1023 is UP: PING OK - Packet loss = 0%, RTA = 26.48 ms [01:47:44] RECOVERY - Host search1020 is UP: PING OK - Packet loss = 0%, RTA = 26.41 ms [01:47:53] RECOVERY - Lucene on search1020 is OK: TCP OK - 0.027 second response time on port 8123 [01:47:53] RECOVERY - Host search1011 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [01:48:02] RECOVERY - Host search1010 is UP: PING OK - Packet loss = 0%, RTA = 26.94 ms [01:48:02] RECOVERY - Host search1014 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [01:48:11] RECOVERY - Lucene on search1017 is OK: TCP OK - 0.028 second response time on port 8123 [01:48:11] PROBLEM - Apache HTTP on srv296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:11] PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:11] PROBLEM - Apache HTTP on srv295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:11] PROBLEM - Apache HTTP on mw73 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:12] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:20] PROBLEM - Apache HTTP on srv292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:20] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:20] RECOVERY - Host search1024 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [01:48:29] PROBLEM - Apache HTTP on mw67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:38] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:38] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:38] RECOVERY - Lucene on search1009 is OK: TCP OK - 0.026 second response time on port 8123 [01:48:38] PROBLEM - Apache HTTP on mw62 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:38] PROBLEM - Apache HTTP on srv300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:39] RECOVERY - Host search1015 is UP: PING OK - Packet loss = 0%, RTA = 26.39 ms [01:48:39] RECOVERY - Host search1012 is UP: PING OK - Packet loss = 0%, RTA = 26.73 ms [01:48:47] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.026 second response time on port 8123 [01:48:47] PROBLEM - Apache HTTP on srv299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:47] PROBLEM - Apache HTTP on srv291 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:47] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:56] RECOVERY - Lucene on search1012 is OK: TCP OK - 0.026 second response time on port 8123 [01:49:05] PROBLEM - Apache HTTP on mw63 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:15] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [01:49:15] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:23] RECOVERY - Lucene on search1021 is OK: TCP OK - 0.027 second response time on port 8123 [01:49:23] RECOVERY - Lucene on search1022 is OK: TCP OK - 0.027 second response time on port 8123 [01:49:32] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:32] RECOVERY - Host search1021 is UP: PING OK - Packet loss = 0%, RTA = 26.82 ms [01:49:32] RECOVERY - Host search1022 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [01:49:50] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.026 second response time on port 8123 [01:49:50] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:50] PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:51:38] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:51:56] PROBLEM - Apache HTTP on srv298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:51:56] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:52:32] RECOVERY - Puppet freshness on lvs3 is OK: puppet ran at Sun Jul 1 01:52:21 UTC 2012 [01:52:50] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 43s [01:53:35] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 13 seconds [01:54:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [01:56:53] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: Connection timed out [02:00:47] PROBLEM - Lucene on search1017 is CRITICAL: Connection timed out [02:01:23] PROBLEM - Lucene on search1018 is CRITICAL: Connection timed out [02:02:26] PROBLEM - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:03:11] RECOVERY - Apache HTTP on srv296 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.277 second response time [02:07:41] PROBLEM - Apache HTTP on srv296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:09:11] RECOVERY - Apache HTTP on srv296 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.624 second response time [02:09:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:09:47] RECOVERY - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 2241 bytes in 8.139 seconds [02:10:32] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.063 second response time [02:10:32] RECOVERY - Apache HTTP on srv295 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.784 second response time [02:10:41] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [02:10:41] RECOVERY - Apache HTTP on srv292 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.553 second response time [02:10:41] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.938 second response time [02:10:41] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.372 second response time [02:10:50] RECOVERY - Apache HTTP on mw67 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [02:10:59] RECOVERY - Apache HTTP on mw62 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [02:10:59] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [02:10:59] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [02:10:59] RECOVERY - Apache HTTP on srv300 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [02:10:59] RECOVERY - Lucene on search1017 is OK: TCP OK - 0.026 second response time on port 8123 [02:11:08] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [02:11:08] RECOVERY - Apache HTTP on srv291 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [02:11:08] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [02:11:08] RECOVERY - Apache HTTP on srv299 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [02:11:17] RECOVERY - Apache HTTP on srv298 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [02:11:17] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [02:11:17] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [02:11:26] RECOVERY - Apache HTTP on mw64 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [02:11:26] RECOVERY - Apache HTTP on mw63 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [02:11:35] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [02:11:35] RECOVERY - LVS Lucene on search-prefix.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [02:11:36] RECOVERY - Lucene on search1018 is OK: TCP OK - 0.027 second response time on port 8123 [02:11:53] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [02:12:02] RECOVERY - Apache HTTP on mw68 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [02:12:02] RECOVERY - Apache HTTP on mw73 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [02:22:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [02:23:44] PROBLEM - Puppet freshness on cp1017 is CRITICAL: Puppet has not run in the last 10 hours [02:23:44] PROBLEM - Puppet freshness on mw1102 is CRITICAL: Puppet has not run in the last 10 hours [02:27:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:36:47] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [02:37:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [02:42:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:55:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.398 seconds [02:57:29] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [02:58:50] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [03:00:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:04:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.767 seconds [03:26:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:41] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [03:53:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.038 seconds [04:01:55] !log rebooting virt1000 [04:02:05] Logged the message, Master [04:04:23] PROBLEM - Host virt1000 is DOWN: PING CRITICAL - Packet loss = 100% [04:05:44] RECOVERY - Host virt1000 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [04:06:08] !log virt1000 is back up, rebooting virt0 [04:06:18] Logged the message, Master [04:10:05] PROBLEM - LDAPS on virt0 is CRITICAL: Connection refused [04:10:05] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [04:10:41] PROBLEM - SSH on virt0 is CRITICAL: Connection refused [04:10:41] PROBLEM - LDAP on virt0 is CRITICAL: Connection refused [04:14:26] RECOVERY - LDAPS on virt0 is OK: TCP OK - 0.003 second response time on port 636 [04:14:44] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.001 second response time on port 11000 [04:15:11] RECOVERY - SSH on virt0 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [04:15:11] RECOVERY - LDAP on virt0 is OK: TCP OK - 0.013 second response time on port 389 [04:20:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:42:47] PROBLEM - Host mw1011 is DOWN: PING CRITICAL - Packet loss = 100% [04:46:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.936 seconds [04:49:41] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [05:00:47] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [05:10:50] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [05:16:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:21:47] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [05:41:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.183 seconds [05:43:05] PROBLEM - Host gilman is DOWN: CRITICAL - Host Unreachable (208.80.152.176) [06:03:29] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [06:04:50] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [06:10:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:26:44] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [06:26:44] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [06:28:50] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [06:29:44] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [06:30:47] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [06:30:47] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [06:33:47] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:33:47] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [06:33:47] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [06:33:47] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [06:34:41] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [06:34:41] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [06:35:44] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [06:36:38] PROBLEM - Host db1009 is DOWN: PING CRITICAL - Packet loss = 100% [06:36:47] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [06:36:47] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [06:36:47] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [06:37:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [06:37:50] PROBLEM - Puppet freshness on search35 is CRITICAL: Puppet has not run in the last 10 hours [06:38:44] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [06:40:50] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [06:41:44] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [06:42:47] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [06:42:47] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [06:44:06] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [06:44:44] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [06:45:47] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [06:45:47] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [06:47:44] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [06:49:32] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [06:49:32] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [06:49:32] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [06:50:35] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [06:51:38] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [06:55:41] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [07:06:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:15:38] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [07:30:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.168 seconds [07:41:08] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:42:38] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:53:00] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [07:55:50] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [07:58:41] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [08:00:03] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [08:01:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:12:11] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [08:13:32] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [08:22:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.045 seconds [08:26:44]