[00:05:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:09:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.763 seconds [00:16:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.137 seconds [00:28:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:30:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.742 seconds [00:36:20] PROBLEM - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: Connection timed out [00:37:47] notpeter: are you here? [00:38:35] looking at it [00:38:43] probably just going to restart search on 1012 [00:38:53] going to look at log first [00:39:13] maplebed: ^ [00:39:24] cool. [00:39:34] I was just reading http://wikitech.wikimedia.org/view/Search#What_to_do_if_you_get_a_page_about_a_search_pool [00:40:35] ganglia (network graph) shows network dropoff on search1002,3,5 - what pointed you to 1012? [00:40:57] nagios [00:41:05] and the page was re pool3 [00:41:06] but yeah [00:41:12] pool1 is in trouble as well [00:42:58] this might actually be leap-second related... [00:43:24] oh? [00:43:38] was I too quick to make a ruling? :) [00:43:40] paravoid: things started hanging right around 00:00 [00:43:51] hah [00:43:57] which machine is that? [00:44:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:44:17] so, the way they fixed the previous leap second bug is by making the clock go backwards this time [00:44:18] I just added links to the search pool configs (eg http://noc.wikimedia.org/pybal/eqiad/search_pool3) to the wikitech page. [00:44:26] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=Search+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [00:44:27] not every service would be happy about that [00:44:47] at midnight, numer of procs rocketed [00:44:51] that is funny. [00:45:04] … [00:45:24] I'm going to restart lucene on all nodes on one minute intervals [00:46:21] I'm going to go back to trying to figure out camera tethering software. [00:46:23] :) [00:46:27] enjoy! [00:46:30] thanks for being around, notpeter! [00:46:32] :) [00:46:32] PROBLEM - Lucene on search1011 is CRITICAL: Connection timed out [00:48:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.761 seconds [00:50:52] !log based on ganglia evidence, lucene seems to have been affected by leap second bug. restartig each instance, one minute wait in between [00:50:53] PROBLEM - Lucene on search1012 is CRITICAL: Connection timed out [00:51:02] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [00:51:03] Logged the message, notpeter [00:51:06] !log search1004 dead. powercycling. [00:51:17] Logged the message, notpeter [00:51:23] nice [00:51:38] PROBLEM - Host msfe1001 is DOWN: PING CRITICAL - Packet loss = 100% [00:52:02] btw, what are the indications of an incident in search? [00:52:14] the LVS Lucene on search-pool alert? [00:52:18] I didn't get paged fwiw [00:52:20] yeah [00:52:23] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [00:56:35] PROBLEM - Lucene on search1002 is CRITICAL: Connection timed out [00:56:53] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.029 second response time on port 8123 [00:57:03] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: Connection refused [00:57:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:59:08] RECOVERY - Host search1004 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [01:00:29] PROBLEM - Lucene on search1007 is CRITICAL: Connection refused [01:01:30] fyi, I htink that we did, in fact, have some issues related to leap second: http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20pmtpa&m=load_one&r=2hr&s=by%20name&hc=4&mc=2 [01:02:17] PROBLEM - Lucene on search1009 is CRITICAL: Connection refused [01:02:35] PROBLEM - Lucene on search1008 is CRITICAL: Connection refused [01:02:53] PROBLEM - Lucene on search1003 is CRITICAL: Connection timed out [01:03:02] RECOVERY - LVS Lucene on search-pool3.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [01:03:29] PROBLEM - Lucene on search1010 is CRITICAL: Connection refused [01:04:50] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection refused [01:04:50] RECOVERY - Lucene on search1010 is OK: TCP OK - 0.027 second response time on port 8123 [01:05:17] RECOVERY - Lucene on search1009 is OK: TCP OK - 0.026 second response time on port 8123 [01:07:41] PROBLEM - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: Connection timed out [01:07:50] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: Connection refused [01:09:38] notpeter: holy crap [01:10:05] PROBLEM - Lucene on search1015 is CRITICAL: Connection refused [01:10:05] paravoid: yeah, some of our boxes are in trouble [01:10:06] nagios didn't notice anything [01:10:15] RECOVERY - LVS Lucene on search-pool1.svc.eqiad.wmnet is OK: TCP OK - 9.027 second response time on port 8123 [01:10:15] in addition to search [01:10:16] PROBLEM - Lucene on search1016 is CRITICAL: Connection refused [01:10:23] we don't often monitor individual hosts... [01:10:27] really? [01:10:54] some. varies [01:11:15] damn it [01:11:59] I have to leave for the airport in 40' or so [01:12:02] PROBLEM - Lucene on search1013 is CRITICAL: Connection refused [01:12:11] PROBLEM - Lucene on search1017 is CRITICAL: Connection refused [01:12:42] until then, I can help [01:12:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [01:12:47] PROBLEM - Host mw1089 is DOWN: PING CRITICAL - Packet loss = 100% [01:12:50] but I think it'll be better if you kept track of what's going on [01:12:57] since I'll be between planes for about 24h :) [01:13:03] yes [01:13:05] iow, task me [01:13:26] I'm going to focus on search for the moment. once that's squared away, I'll start looking at other boxes [01:13:29] okay [01:13:30] as I want the page to sftu [01:13:41] fucking leap second [01:13:44] want to take a look at the nfs boxes [01:13:45] ? [01:13:51] it look like they are unhappy [01:14:02] probaly why fenari is unhappy as well [01:14:45] PROBLEM - Lucene on search1019 is CRITICAL: Connection refused [01:14:45] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [01:14:47] looking [01:14:55] opendj [01:14:59] java too [01:15:07] (opendj is java that is) [01:15:11] PROBLEM - Lucene on search1018 is CRITICAL: Connection refused [01:15:11] PROBLEM - Lucene on search1021 is CRITICAL: Connection refused [01:15:56] ah, hrmph [01:16:26] the good news is, despite all this noise, search seems to be fucntioning relatively well on the site [01:16:50] RECOVERY - Lucene on search1007 is OK: TCP OK - 3.024 second response time on port 8123 [01:17:26] RECOVERY - LVS Lucene on search-pool1.svc.eqiad.wmnet is OK: TCP OK - 3.019 second response time on port 8123 [01:18:53] ugh, java crap [01:20:32] !log restarting opendj (nfs1/nfs2), load spike, possibly related to leap second [01:20:44] Logged the message, Master [01:21:21] [01/Jul/2012:01:21:08 +0000] category=SYNC severity=SEVERE_ERROR msgID=14942259 msg=The hostname virt1.wikimedia.org could not be resolved as an IP address [01:21:41] any idea what's up with that? virt1 doesn't resolve but it shouldn't [01:22:07] er? [01:22:09] no, not sure [01:22:15] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [01:23:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:26] PROBLEM - LDAP on nfs2 is CRITICAL: Connection refused [01:24:05] fcking java crap [01:24:20] PROBLEM - LDAPS on nfs2 is CRITICAL: Connection refused [01:24:49] when I tried to restart lucene, it wouldn't kill the proc. had to more aggressively kill [01:26:17] PROBLEM - Lucene on search1007 is CRITICAL: Connection timed out [01:27:38] PROBLEM - HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:28:23] PROBLEM - Lucene on search1006 is CRITICAL: Connection refused [01:28:23] PROBLEM - Etherpad HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:30:29] RECOVERY - HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.009 second response time [01:30:38] PROBLEM - Lucene on search1009 is CRITICAL: Connection refused [01:30:45] Hey guys, searching seems to be broken on all the sites [01:31:04] woosters: ^ [01:31:17] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.011 second response time [01:31:18] kaldari: notpeter is looking at it [01:31:27] ah cool, thanks [01:31:43] kaldari: what wiki are you seeing that on? [01:31:51] commons and en.wiki [01:31:59] but those are the only 2 I tested [01:32:11] hhhmmm, I'm getting results from en [01:32:19] are you getting misses right now? [01:32:22] from Special:Search? [01:32:35] from the box [01:32:47] yeah, still broken for me [01:32:49] oh, no, nvm [01:32:59] fuu [01:32:59] the article redirecting from the box is working, but not lucene searching [01:33:03] yep [01:33:18] I was only searching for real words :) [01:33:31] i never search for real words :) [01:33:34] lulz [01:34:26] this is really interfering with my ability to locate a map of both the United States and Canada :( [01:34:32] PROBLEM - Lucene on search1020 is CRITICAL: Connection refused [01:34:47] it's also interfering with my ability to continue drinking :/ [01:34:49] if anyone has such a map, please scan it, and email it to me [01:37:23] PROBLEM - Lucene on search1022 is CRITICAL: Connection refused [01:37:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [01:39:02] PROBLEM - Host search1001 is DOWN: PING CRITICAL - Packet loss = 100% [01:39:11] PROBLEM - Host search1006 is DOWN: PING CRITICAL - Packet loss = 100% [01:39:11] PROBLEM - Host search1003 is DOWN: PING CRITICAL - Packet loss = 100% [01:39:20] PROBLEM - Host search1005 is DOWN: PING CRITICAL - Packet loss = 100% [01:39:20] PROBLEM - Host search1002 is DOWN: PING CRITICAL - Packet loss = 100% [01:39:38] RECOVERY - Lucene on search1008 is OK: TCP OK - 0.027 second response time on port 8123 [01:39:38] PROBLEM - Host searchidx1001 is DOWN: PING CRITICAL - Packet loss = 100% [01:39:49] !log problem with lucene persisting through service restart, but not node restart. restarting en pool nodes. [01:39:59] Logged the message, notpeter [01:40:14] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 207 seconds [01:40:41] RECOVERY - Host search1005 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [01:40:50] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 241 seconds [01:40:50] RECOVERY - Host search1003 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [01:40:50] RECOVERY - Host search1002 is UP: PING OK - Packet loss = 0%, RTA = 26.40 ms [01:41:08] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.026 second response time on port 8123 [01:41:26] RECOVERY - LVS Lucene on search-pool1.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [01:41:35] RECOVERY - Lucene on search1003 is OK: TCP OK - 0.026 second response time on port 8123 [01:41:35] RECOVERY - Host search1001 is UP: PING OK - Packet loss = 0%, RTA = 26.38 ms [01:41:44] RECOVERY - Host searchidx1001 is UP: PING OK - Packet loss = 0%, RTA = 26.91 ms [01:41:53] RECOVERY - Lucene on search1006 is OK: TCP OK - 9.023 second response time on port 8123 [01:42:02] RECOVERY - Host search1006 is UP: PING OK - Packet loss = 0%, RTA = 26.40 ms [01:42:38] RECOVERY - Lucene on search1007 is OK: TCP OK - 9.027 second response time on port 8123 [01:43:15] !log that worked. restarting all remaining search nodes. [01:43:25] Logged the message, notpeter [01:43:51] I suspect that's the source of the ProofreadPage errors on [01:44:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:23] grabbing search results [01:45:02] PROBLEM - Host search1022 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:02] PROBLEM - Host search1018 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:02] PROBLEM - Host search1007 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:02] PROBLEM - Host search1008 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:02] PROBLEM - Host search1012 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:03] PROBLEM - Host search1020 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:03] PROBLEM - Host search1016 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:04] PROBLEM - Host search1021 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:04] PROBLEM - Host search1019 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:04] PROBLEM - Host search1017 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:05] PROBLEM - Host search1023 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:06] PROBLEM - Host search1009 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:06] PROBLEM - Host search1010 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:06] PROBLEM - Host search1014 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:07] PROBLEM - Host search1024 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:08] PROBLEM - Host search1011 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:08] PROBLEM - Host search1015 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:17] fun [01:45:44] paravoid: well, something in the jvm wasn't giving up the ghost... [01:45:54] and it's not like this is a problem that I'm worried about happening again [01:46:24] although I am taking special care of the indexers [01:46:50] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 602s [01:46:50] PROBLEM - Host search1013 is DOWN: PING CRITICAL - Packet loss = 100% [01:46:53] opendj didn't get fixed with restarting it either [01:47:04] hurray for the jvm. [01:47:05] what magic [01:47:17] PROBLEM - Apache HTTP on mw72 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:26] RECOVERY - LVS Lucene on search-prefix.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [01:47:27] RECOVERY - Lucene on search1018 is OK: TCP OK - 0.026 second response time on port 8123 [01:47:27] RECOVERY - Lucene on search1013 is OK: TCP OK - 0.027 second response time on port 8123 [01:47:27] PROBLEM - Apache HTTP on mw71 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:27] RECOVERY - Host search1009 is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [01:47:27] RECOVERY - Host search1016 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [01:47:35] RECOVERY - LVS Lucene on search-pool3.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [01:47:35] PROBLEM - Apache HTTP on mw64 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:35] RECOVERY - LVS Lucene on search-pool2.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [01:47:36] RECOVERY - Host search1017 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [01:47:36] RECOVERY - Host search1008 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [01:47:36] RECOVERY - Host search1018 is UP: PING OK - Packet loss = 0%, RTA = 26.66 ms [01:47:36] RECOVERY - Host search1013 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [01:47:37] RECOVERY - Host search1019 is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [01:47:44] RECOVERY - Lucene on search1011 is OK: TCP OK - 0.027 second response time on port 8123 [01:47:44] RECOVERY - Lucene on search1019 is OK: TCP OK - 0.029 second response time on port 8123 [01:47:44] RECOVERY - Host search1007 is UP: PING OK - Packet loss = 0%, RTA = 26.37 ms [01:47:44] RECOVERY - Host search1023 is UP: PING OK - Packet loss = 0%, RTA = 26.48 ms [01:47:44] RECOVERY - Host search1020 is UP: PING OK - Packet loss = 0%, RTA = 26.41 ms [01:47:53] RECOVERY - Lucene on search1020 is OK: TCP OK - 0.027 second response time on port 8123 [01:47:53] RECOVERY - Host search1011 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [01:48:02] RECOVERY - Host search1010 is UP: PING OK - Packet loss = 0%, RTA = 26.94 ms [01:48:02] RECOVERY - Host search1014 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [01:48:11] RECOVERY - Lucene on search1017 is OK: TCP OK - 0.028 second response time on port 8123 [01:48:11] PROBLEM - Apache HTTP on srv296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:11] PROBLEM - Apache HTTP on mw68 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:11] PROBLEM - Apache HTTP on srv295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:11] PROBLEM - Apache HTTP on mw73 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:12] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:20] PROBLEM - Apache HTTP on srv292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:20] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:20] RECOVERY - Host search1024 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [01:48:29] PROBLEM - Apache HTTP on mw67 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:38] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:38] PROBLEM - Apache HTTP on mw70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:38] RECOVERY - Lucene on search1009 is OK: TCP OK - 0.026 second response time on port 8123 [01:48:38] PROBLEM - Apache HTTP on mw62 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:38] PROBLEM - Apache HTTP on srv300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:39] RECOVERY - Host search1015 is UP: PING OK - Packet loss = 0%, RTA = 26.39 ms [01:48:39] RECOVERY - Host search1012 is UP: PING OK - Packet loss = 0%, RTA = 26.73 ms [01:48:47] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.026 second response time on port 8123 [01:48:47] PROBLEM - Apache HTTP on srv299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:47] PROBLEM - Apache HTTP on srv291 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:47] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:56] RECOVERY - Lucene on search1012 is OK: TCP OK - 0.026 second response time on port 8123 [01:49:05] PROBLEM - Apache HTTP on mw63 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:15] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [01:49:15] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:23] RECOVERY - Lucene on search1021 is OK: TCP OK - 0.027 second response time on port 8123 [01:49:23] RECOVERY - Lucene on search1022 is OK: TCP OK - 0.027 second response time on port 8123 [01:49:32] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:32] RECOVERY - Host search1021 is UP: PING OK - Packet loss = 0%, RTA = 26.82 ms [01:49:32] RECOVERY - Host search1022 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [01:49:50] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.026 second response time on port 8123 [01:49:50] PROBLEM - Apache HTTP on mw69 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:50] PROBLEM - Apache HTTP on mw66 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:51:38] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:51:56] PROBLEM - Apache HTTP on srv298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:51:56] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:52:32] RECOVERY - Puppet freshness on lvs3 is OK: puppet ran at Sun Jul 1 01:52:21 UTC 2012 [01:52:50] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 43s [01:53:35] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 13 seconds [01:54:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [01:56:53] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: Connection timed out [02:00:47] PROBLEM - Lucene on search1017 is CRITICAL: Connection timed out [02:01:23] PROBLEM - Lucene on search1018 is CRITICAL: Connection timed out [02:02:26] PROBLEM - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:03:11] RECOVERY - Apache HTTP on srv296 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.277 second response time [02:07:41] PROBLEM - Apache HTTP on srv296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:09:11] RECOVERY - Apache HTTP on srv296 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.624 second response time [02:09:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:09:47] RECOVERY - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 2241 bytes in 8.139 seconds [02:10:32] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.063 second response time [02:10:32] RECOVERY - Apache HTTP on srv295 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.784 second response time [02:10:41] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [02:10:41] RECOVERY - Apache HTTP on srv292 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.553 second response time [02:10:41] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.938 second response time [02:10:41] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.372 second response time [02:10:50] RECOVERY - Apache HTTP on mw67 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [02:10:59] RECOVERY - Apache HTTP on mw62 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [02:10:59] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [02:10:59] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [02:10:59] RECOVERY - Apache HTTP on srv300 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [02:10:59] RECOVERY - Lucene on search1017 is OK: TCP OK - 0.026 second response time on port 8123 [02:11:08] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [02:11:08] RECOVERY - Apache HTTP on srv291 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [02:11:08] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [02:11:08] RECOVERY - Apache HTTP on srv299 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [02:11:17] RECOVERY - Apache HTTP on srv298 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [02:11:17] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [02:11:17] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [02:11:26] RECOVERY - Apache HTTP on mw64 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [02:11:26] RECOVERY - Apache HTTP on mw63 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [02:11:35] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [02:11:35] RECOVERY - LVS Lucene on search-prefix.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [02:11:36] RECOVERY - Lucene on search1018 is OK: TCP OK - 0.027 second response time on port 8123 [02:11:53] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [02:12:02] RECOVERY - Apache HTTP on mw68 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [02:12:02] RECOVERY - Apache HTTP on mw73 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [02:22:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [02:23:44] PROBLEM - Puppet freshness on cp1017 is CRITICAL: Puppet has not run in the last 10 hours [02:23:44] PROBLEM - Puppet freshness on mw1102 is CRITICAL: Puppet has not run in the last 10 hours [02:27:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:36:47] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [02:37:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [02:42:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:55:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.398 seconds [02:57:29] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [02:58:50] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [03:00:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:04:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.767 seconds [03:26:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:37:41] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [03:53:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.038 seconds [04:01:55] !log rebooting virt1000 [04:02:05] Logged the message, Master [04:04:23] PROBLEM - Host virt1000 is DOWN: PING CRITICAL - Packet loss = 100% [04:05:44] RECOVERY - Host virt1000 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [04:06:08] !log virt1000 is back up, rebooting virt0 [04:06:18] Logged the message, Master [04:10:05] PROBLEM - LDAPS on virt0 is CRITICAL: Connection refused [04:10:05] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [04:10:41] PROBLEM - SSH on virt0 is CRITICAL: Connection refused [04:10:41] PROBLEM - LDAP on virt0 is CRITICAL: Connection refused [04:14:26] RECOVERY - LDAPS on virt0 is OK: TCP OK - 0.003 second response time on port 636 [04:14:44] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.001 second response time on port 11000 [04:15:11] RECOVERY - SSH on virt0 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [04:15:11] RECOVERY - LDAP on virt0 is OK: TCP OK - 0.013 second response time on port 389 [04:20:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:42:47] PROBLEM - Host mw1011 is DOWN: PING CRITICAL - Packet loss = 100% [04:46:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.936 seconds [04:49:41] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [05:00:47] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [05:10:50] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [05:16:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:21:47] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [05:41:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.183 seconds [05:43:05] PROBLEM - Host gilman is DOWN: CRITICAL - Host Unreachable (208.80.152.176) [06:03:29] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [06:04:50] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [06:10:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:26:44] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [06:26:44] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [06:28:50] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [06:29:44] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [06:30:47] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [06:30:47] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [06:33:47] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:33:47] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [06:33:47] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [06:33:47] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [06:34:41] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [06:34:41] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [06:35:44] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [06:36:38] PROBLEM - Host db1009 is DOWN: PING CRITICAL - Packet loss = 100% [06:36:47] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [06:36:47] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [06:36:47] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [06:37:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [06:37:50] PROBLEM - Puppet freshness on search35 is CRITICAL: Puppet has not run in the last 10 hours [06:38:44] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [06:40:50] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [06:41:44] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [06:42:47] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [06:42:47] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [06:44:06] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [06:44:44] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [06:45:47] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [06:45:47] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [06:47:44] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [06:49:32] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [06:49:32] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [06:49:32] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [06:50:35] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [06:51:38] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [06:55:41] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [07:06:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:15:38] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [07:30:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.168 seconds [07:41:08] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:42:38] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:53:00] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [07:55:50] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [07:58:41] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [08:00:03] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [08:01:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:12:11] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [08:13:32] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [08:22:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.045 seconds [08:26:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:29:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [08:37:05] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:40:32] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [08:57:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:07:32] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [09:11:35] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [09:18:47] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:21:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.446 seconds [09:30:11] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:36:20] PROBLEM - Etherpad HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:36:56] PROBLEM - HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:41] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.006 second response time [09:38:17] RECOVERY - HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.005 second response time [09:40:41] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:52:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:17:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [10:21:02] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:45:12] PROBLEM - Host mw1050 is DOWN: PING CRITICAL - Packet loss = 100% [10:45:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:55:05] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: Device does not support ifTable - try without -I option [10:56:26] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [11:12:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.017 seconds [11:12:29] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:14:35] PROBLEM - Apache HTTP on mw60 is CRITICAL: Connection refused [11:33:47] RECOVERY - Apache HTTP on mw60 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [11:41:08] PROBLEM - Host mw1043 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:49:32] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [11:53:44] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [11:57:02] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [11:58:23] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [12:08:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.636 seconds [12:24:38] PROBLEM - Puppet freshness on cp1017 is CRITICAL: Puppet has not run in the last 10 hours [12:24:38] PROBLEM - Puppet freshness on mw1102 is CRITICAL: Puppet has not run in the last 10 hours [12:34:32] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [12:37:32] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [12:37:41] PROBLEM - Host mw1128 is DOWN: PING CRITICAL - Packet loss = 100% [12:39:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:48:29] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [12:49:59] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [12:54:28] !log also going to reboot all pmtpa search nodes. not in prod, but are still freaking out from leap second bug. [12:54:39] Logged the message, notpeter [12:57:20] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [12:58:14] PROBLEM - Host search20 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:23] PROBLEM - Host search14 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:23] PROBLEM - Host search13 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:32] PROBLEM - Host search30 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:32] PROBLEM - Host search25 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:32] PROBLEM - Host search18 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:33] PROBLEM - Host search28 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:33] PROBLEM - Host search27 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:33] PROBLEM - Host search16 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:41] PROBLEM - Host search22 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:50] PROBLEM - Host search23 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:50] PROBLEM - Host search19 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:50] PROBLEM - Host search24 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:59] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [12:58:59] RECOVERY - Host search14 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [12:58:59] RECOVERY - Host search16 is UP: PING OK - Packet loss = 0%, RTA = 2.01 ms [12:58:59] PROBLEM - Host search34 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:59] PROBLEM - Host search26 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:00] PROBLEM - Host search36 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:00] PROBLEM - Host search31 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:08] RECOVERY - Host search20 is UP: PING OK - Packet loss = 0%, RTA = 1.64 ms [12:59:08] RECOVERY - Host search13 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [12:59:08] PROBLEM - Host search29 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:17] RECOVERY - Host search18 is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [12:59:26] PROBLEM - Host search35 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:26] PROBLEM - Host search21 is DOWN: PING CRITICAL - Packet loss = 100% [13:00:02] PROBLEM - Host search33 is DOWN: PING CRITICAL - Packet loss = 100% [13:00:02] RECOVERY - Host search19 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [13:00:11] RECOVERY - Host search24 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [13:00:20] RECOVERY - Host search21 is UP: PING OK - Packet loss = 0%, RTA = 1.73 ms [13:00:29] RECOVERY - Host search28 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [13:00:29] RECOVERY - Host search29 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [13:00:29] RECOVERY - Host search31 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [13:00:29] RECOVERY - Host search27 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [13:00:38] RECOVERY - Host search26 is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [13:00:38] RECOVERY - Host search25 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [13:00:38] RECOVERY - Host search22 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [13:00:47] RECOVERY - Host search30 is UP: PING OK - Packet loss = 0%, RTA = 2.29 ms [13:00:47] RECOVERY - Host search34 is UP: PING OK - Packet loss = 0%, RTA = 2.59 ms [13:00:56] RECOVERY - Host search33 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [13:01:23] RECOVERY - Host search23 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [13:01:32] RECOVERY - Host search36 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [13:03:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [13:34:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:35] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [13:45:56] PROBLEM - Host mw1020 is DOWN: PING CRITICAL - Packet loss = 100% [13:52:32] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [13:55:23] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [14:00:11] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [14:01:32] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [14:02:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds [14:26:23] r0csteady: Hi! I'm Asheesh again. [14:30:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:31:05] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [14:32:17] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [14:39:29] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:46:36] http://scribunto.wmflabs.org/index.php/Special:RecentChanges [14:46:52] we got quite a bit of spam there ^ [14:48:48] lol [14:48:54] there's no admins or crats on the wiki [14:49:35] It's beyond overrun [14:50:35] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [14:51:08] Need Tim or Patrick I guess [14:51:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.525 seconds [14:56:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:57:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.017 seconds [14:58:05] PROBLEM - Host db1048 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:32] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [15:09:11] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:11:35] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [15:22:32] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [15:23:53] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:25:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:29] PROBLEM - Host mw1115 is DOWN: PING CRITICAL - Packet loss = 100% [15:50:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [16:12:38] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:20:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:59] PROBLEM - Host mw1047 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:29] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [16:25:51] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [16:27:38] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [16:27:38] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [16:29:35] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [16:30:38] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [16:31:32] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [16:31:32] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [16:34:20] Tim-away: when you get up tomorrow morning, can you please look at the scribunto spam ? [16:34:32] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [16:34:32] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [16:34:32] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [16:34:32] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [16:34:37] And i have no backups, so please don't delete all the content :D [16:35:35] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [16:35:35] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [16:36:38] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [16:37:32] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [16:37:32] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [16:37:33] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [16:39:38] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [16:41:35] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [16:42:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.550 seconds [16:42:38] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [16:43:32] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [16:43:32] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [16:44:35] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [16:45:38] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [16:46:41] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [16:46:41] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [16:47:08] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [16:48:38] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [16:49:50] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [16:50:35] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [16:50:35] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [16:50:35] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [16:51:38] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [16:52:32] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [16:56:35] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [16:57:50] hey, i am having trouble reaching office.wikimedia.org, is this already a known issue? [17:02:56] it loads for me [17:03:42] platonides: and also all images / jss / css files? [17:05:40] I don't have a login there, but apparently looking at the from page yes [17:06:27] okay, thanks, my other browsing is normal but office.wikimedia.org is doing weird [17:07:42] ohai paulprot1us! [17:08:07] Asheesh! [17:16:33] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [17:17:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:58] office works fine for me [17:36:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.810 seconds [18:07:23] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [18:08:44] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [18:12:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:31:04] Asheesh: I am working on the missions on openhatch.org! [18:34:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.125 seconds [18:38:44] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [18:38:44] PROBLEM - Host db1027 is DOWN: PING CRITICAL - Packet loss = 100% [18:41:35] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [18:43:05] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [19:05:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:08:35] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [19:10:14] PROBLEM - Host mw1009 is DOWN: PING CRITICAL - Packet loss = 100% [19:12:38] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [19:26:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.136 seconds [20:00:56] PROBLEM - Host mw1048 is DOWN: PING CRITICAL - Packet loss = 100% [20:01:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:47] PROBLEM - HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:05] PROBLEM - Etherpad HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:08] RECOVERY - HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.009 second response time [20:20:26] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.017 second response time [20:26:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [20:51:35] New patchset: Liangent; "Add a symbolic link to CREDITS for Change Ia02c3bcf." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13847 [20:54:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:58:45] @replag [20:58:47] Krinkle: [s1] db36: 2s, db32: 29s, db59: 2s, db60: 1s, db12: 2s [21:02:08] PROBLEM - Host mw1082 is DOWN: PING CRITICAL - Packet loss = 100% [21:02:17] PROBLEM - Host mw1087 is DOWN: PING CRITICAL - Packet loss = 100% [21:19:05] @replag [21:19:07] Krinkle: [s1] db36: 3s, db32: 31s, db59: 3s, db60: 3s, db12: 3s [21:20:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.726 seconds [21:22:14] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:31:23] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:40:59] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:53:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:05:17] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, sessions up: 7, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [22:06:47] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [22:14:17] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:14:44] PROBLEM - Host mw1123 is DOWN: PING CRITICAL - Packet loss = 100% [22:17:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [22:17:17] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:24:11] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:25:32] PROBLEM - Puppet freshness on cp1017 is CRITICAL: Puppet has not run in the last 10 hours [22:25:32] PROBLEM - Puppet freshness on mw1102 is CRITICAL: Puppet has not run in the last 10 hours [22:35:35] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [22:38:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [22:46:14] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:46:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:47:08] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [22:48:29] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [22:51:56] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [22:52:41] PROBLEM - HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:53:17] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [22:53:26] PROBLEM - Etherpad HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:54:02] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:54:02] RECOVERY - HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.008 second response time [22:54:47] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK - HTTP/1.1 302 Found - 0.007 second response time [23:11:53] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:12:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [23:27:47] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:39:38] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [23:40:41] PROBLEM - Etherpad HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:41:36] PROBLEM - HTTP on hooper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:42:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:55:59] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: CRIT replication delay 213 seconds [23:55:59] PROBLEM - MySQL Slave Delay on db12 is CRITICAL: CRIT replication delay 213 seconds [23:55:59] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 214 seconds [23:56:08] PROBLEM - MySQL Replication Heartbeat on db59 is CRITICAL: CRIT replication delay 224 seconds [23:56:26] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 239 seconds [23:56:44] PROBLEM - MySQL Slave Delay on db36 is CRITICAL: CRIT replication delay 260 seconds [23:56:53] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: CRIT replication delay 267 seconds [23:56:53] PROBLEM - MySQL Replication Heartbeat on db1043 is CRITICAL: CRIT replication delay 269 seconds [23:57:02] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 275 seconds [23:57:11] PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: CRIT replication delay 283 seconds [23:57:11] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 285 seconds [23:57:11] PROBLEM - MySQL Slave Delay on db59 is CRITICAL: CRIT replication delay 285 seconds [23:57:11] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: CRIT replication delay 288 seconds [23:57:20] PROBLEM - MySQL Slave Delay on db60 is CRITICAL: CRIT replication delay 294 seconds [23:58:42] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 27 seconds [23:58:50] RECOVERY - MySQL Slave Delay on db60 is OK: OK replication delay 11 seconds [23:59:08] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 194 seconds [23:59:26] PROBLEM - MySQL Slave Delay on db1033 is CRITICAL: CRIT replication delay 222 seconds [23:59:53] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay 4 seconds [23:59:53] RECOVERY - MySQL Replication Heartbeat on db1043 is OK: OK replication delay 24 seconds [23:59:53] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 246 seconds