[00:33:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:37:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.885 seconds [01:10:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:16:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.475 seconds [01:49:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:53:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.700 seconds [01:56:36] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 640s [01:56:45] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 647s [01:57:48] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [02:00:48] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours [02:03:48] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [02:03:48] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [02:27:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:00] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [02:30:45] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [02:31:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.115 seconds [02:34:03] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [02:52:30] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [03:08:34] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [03:09:18] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [04:11:49] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Puppet has not run in the last 10 hours [04:45:05] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [05:22:07] RECOVERY - Disk space on db30 is OK: DISK OK [05:22:34] RECOVERY - MySQL disk space on db30 is OK: DISK OK [05:36:04] PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours [06:31:48] PROBLEM - Squid on brewster is CRITICAL: Connection refused [08:10:29] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [08:11:59] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [08:14:14] RECOVERY - Lucene on search9 is OK: TCP OK - 0.002 second response time on port 8123 [08:14:23] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [08:32:59] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [08:58:11] PROBLEM - Lucene on search4 is CRITICAL: Connection refused [09:53:45] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [09:54:48] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.006 second response time on port 8123 [09:56:09] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:58:42] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [10:00:57] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [10:37:49] RECOVERY - Lucene on search15 is OK: TCP OK - 8.994 second response time on port 8123 [10:46:04] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [11:04:22] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [11:05:25] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.005 second response time on port 8123 [11:13:09] RECOVERY - Lucene on search15 is OK: TCP OK - 9.008 second response time on port 8123 [11:20:39] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [11:23:39] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [11:26:03] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.005 second response time on port 8123 [11:29:41] New patchset: Pyoungmeister; "making lucene check even less sensitive" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2671 [11:31:08] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2671 [11:31:12] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2671 [11:58:17] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [12:01:17] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours [12:04:17] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [12:04:17] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [12:05:47] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [12:06:50] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [12:37:35] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [12:42:50] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.006 second response time on port 8123 [12:58:45] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [12:59:48] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.002 second response time on port 8123 [13:34:18] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [13:35:21] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.004 second response time on port 8123 [13:39:42] PROBLEM - LVS Lucene on search-pool3.svc.pmtpa.wmnet is CRITICAL: Connection timed out [13:39:42] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [13:40:45] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.003 second response time on port 8123 [13:43:27] RECOVERY - LVS Lucene on search-pool3.svc.pmtpa.wmnet is OK: TCP OK - 0.006 second response time on port 8123 [13:47:02] !log disabling notifications for search lvs... if anyone still has their phone on [13:47:05] Logged the message, and now dispaching a T1000 to your position to terminate you. [13:47:24] I do but I have been wondering what to do about it, since you were working on it [13:47:55] apergos: the answer is spin up more search hosts a year or two ago... :/ [13:48:06] * apergos works on a time machine [13:48:14] I worked on the eqiad infrastructure yesterday [13:48:17] and will again today [13:48:19] ahh [13:48:23] fabulous [13:48:28] once that is up, hopefully we can just point all traffic at that [13:48:46] any chance... we can split up the traffic? [13:48:58] eventually, yes [13:49:02] iunno [13:49:03] cool [13:49:15] but, all of the search boxes in pmtpa need to be reimaged [13:49:18] they're running hardy [13:49:21] and... karmic [13:49:25] ok [13:49:39] so a fullscale switchover would be a good thing [13:49:58] I mean, we could kill the index rebuild proc on searchidx2 [13:50:13] to that would stop the insane rsyncing that is what is thrashing the search hosts [13:50:43] but, I'm also guessing that killing that halfway through would be bad [13:50:43] gotcha [13:50:55] and probably yield a corupted index [13:51:25] could well be [13:51:27] note: guessing. I really wish that there was better documentation on the code itself [13:51:36] you do eh? [13:51:40] fancy that... [13:52:01] yes. trying to deal with a magical java mystery box isn't my idea of fun [13:52:09] me neither [13:52:18] RECOVERY - Lucene on search15 is OK: TCP OK - 8.998 second response time on port 8123 [13:52:36] sounds like you should be sleeping now [13:53:04] http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=&c=Search&h=search6.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [13:53:09] let's play name the problem... [13:53:57] nah, east coast time. been up since 630. also, need to unclog kitchen sink. my partner is getting a snake from the hardware store right now for that [13:56:40] yeah, if you look at the graphs for the various search nodes, you can see when they're furiously rsyncing over a new index [13:56:54] and that nicely lines up with then that lvs group goes down.... [13:58:14] yeah, I remember we talked about this a few days ago [13:58:30] actually... last Sunday when we had this issue :-D [13:58:31] and sadly, it's still fucked :( [13:58:35] yep! [13:59:53] it's really frustrating to be working on the oldest, cruftiest, most neglected, and least well documented part of our infrastructure [14:00:00] and have it become a crisis *right now* [14:00:17] :-( [14:00:18] after being largely neglected for 3 years... [14:00:32] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [14:00:45] it was pretty fun when it was a crisis earlier and there was nothing to do but try to remove random items before it ran out of space [14:00:54] that pretty much put it on my permanent blacklist [14:01:16] yeah [14:01:24] I mean, it wouldn't be so bad if it wasn't so neglected [14:01:29] and it was better documented [14:01:40] so now you own this monstrosity [14:01:51] about all I can give you is moral suport I'm afraid [14:02:34] meh. it'll be ok eventually [14:02:43] search nodes in eqiad are good to go [14:02:47] once you beat it into shape! [14:02:48] just need to finish indexer [14:03:07] but I've done that before! [14:03:19] but the removal of nfs is making it slower going [14:03:38] I do, however, need to figure out wtf packages are needed on the indexer [14:03:45] php? the appserver pkg? [14:04:06] php, yes, there's some dump-like things it does [14:04:12] which I really wish we could axe from there [14:04:52] gotcha. I was just going to grab everything in the appserver package list minus apache [14:05:03] axe away! [14:05:24] would have to figure out how the xml dumps are used and how we could replace that piece [14:06:55] yeah, after this is no longer a crisis and eqiad is up, I'm going to spend some time working with oren to make sure he has what he needs in labs so that there can actually be code changes... [14:07:10] is he actually taking it over then? [14:08:03] in that he is the only person working on it, yes [14:08:40] ah [14:08:43] :-D [14:09:13] ah, and disks are overflowing with debug info from gmond [14:09:24] joy [14:12:12] the DoS was coming from *inside the ops team* [14:12:13] ;) [14:12:50] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Puppet has not run in the last 10 hours [14:14:30] heh [14:14:37] bad sysadmins! bad! [14:15:37] I mean, come on, a 2.8 gig log file? [14:16:30] peanuts! [14:17:23] yes. and you don't eat the same peanut over and over. you switch the peanuts that you eat. rotate them even, perhaps [14:17:40] :-D [14:45:50] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [15:10:53] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [15:14:00] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:37:06] PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours [16:19:06] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [16:46:24] RECOVERY - Lucene on search15 is OK: TCP OK - 0.010 second response time on port 8123 [16:54:39] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [17:11:15] RECOVERY - Lucene on search15 is OK: TCP OK - 0.019 second response time on port 8123 [17:20:51] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [17:46:12] RECOVERY - Lucene on search15 is OK: TCP OK - 0.002 second response time on port 8123 [17:54:45] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [18:03:45] RECOVERY - Lucene on search15 is OK: TCP OK - 0.000 second response time on port 8123 [18:12:00] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [18:30:12] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:31:24] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [18:34:06] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [19:42:52] RECOVERY - Lucene on search15 is OK: TCP OK - 8.994 second response time on port 8123 [19:51:07] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [19:52:56] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2578 [19:54:05] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2611 [20:15:07] RECOVERY - Lucene on search15 is OK: TCP OK - 0.003 second response time on port 8123 [20:23:22] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [21:10:47] RECOVERY - Lucene on search15 is OK: TCP OK - 2.995 second response time on port 8123 [21:17:05] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [21:19:11] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [21:59:05] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [22:02:05] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours [22:05:05] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [22:05:06] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [22:16:11] RECOVERY - Lucene on search15 is OK: TCP OK - 0.004 second response time on port 8123 [22:29:08] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [23:13:41] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [23:14:26] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [23:17:26] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [23:25:23] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [23:31:59] RECOVERY - Lucene on search15 is OK: TCP OK - 2.997 second response time on port 8123 [23:40:23] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out