[03:11:54] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [03:11:55] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [03:41:24] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:25] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:56] PROBLEM - RAID on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:41:56] PROBLEM - RAID on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:26] PROBLEM - Disk space on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:26] PROBLEM - Disk space on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:27] ugh, the nagioses are mating again [03:50:56] PROBLEM - DPKG on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:50:56] PROBLEM - SSH on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:50:57] PROBLEM - DPKG on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:50:57] PROBLEM - SSH on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:58:16] RECOVERY - RAID on srv273 is OK: OK: no RAID installed [03:58:17] RECOVERY - RAID on srv273 is OK: OK: no RAID installed [04:00:06] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.019 second response time [04:00:06] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.019 second response time [04:00:36] RECOVERY - SSH on srv273 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [04:00:37] RECOVERY - SSH on srv273 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [04:00:56] RECOVERY - DPKG on srv273 is OK: All packages OK [04:00:56] RECOVERY - DPKG on srv273 is OK: All packages OK [04:02:56] RECOVERY - Disk space on srv273 is OK: DISK OK [04:02:56] RECOVERY - Disk space on srv273 is OK: DISK OK [04:43:48] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [04:43:48] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [07:01:40] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [07:01:41] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [07:11:20] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:11:20] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:47:30] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [07:47:30] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [08:09:01] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:01] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:19:01] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [08:19:01] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [09:07:55] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 99 MB (1% inode=60%): /var/lib/ureadahead/debugfs 99 MB (1% inode=60%): [09:07:56] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 99 MB (1% inode=60%): /var/lib/ureadahead/debugfs 99 MB (1% inode=60%): [09:25:35] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=60%): /var/lib/ureadahead/debugfs 1 MB (0% inode=60%): [09:25:36] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=60%): /var/lib/ureadahead/debugfs 1 MB (0% inode=60%): [09:27:15] RECOVERY - Disk space on srv222 is OK: DISK OK [09:27:16] RECOVERY - Disk space on srv222 is OK: DISK OK [09:49:30] RECOVERY - Disk space on srv221 is OK: DISK OK [09:49:31] RECOVERY - Disk space on srv221 is OK: DISK OK [09:53:50] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 443510 MB (3% inode=99%): [09:53:51] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 443510 MB (3% inode=99%): [09:55:40] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 434584 MB (3% inode=99%): [09:55:41] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 434584 MB (3% inode=99%): [10:02:30] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:02:31] RECOVERY - MySQL slave status on es1004 is OK: OK: [13:25:40] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [13:25:41] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [14:25:53] New patchset: Catrope; "Fix Nagios job queue check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1766 [14:33:42] New review: Dzahn; "works on spence." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1766 [14:33:42] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1766 [14:45:56] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [14:45:56] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [15:24:56] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2288 [15:24:57] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2288 [15:34:37] New patchset: Dzahn; "job_queue: tweak retry check interval" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1767 [15:35:37] New review: Dzahn; "keep the regular interval at 15 minutes, but if it fails once (SOFT), keep re-checking every 5 minut..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1767 [15:35:38] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1767 [15:36:06] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (77620) [15:36:07] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (77620) [15:40:29] robh: good morning [15:41:19] morning =] [15:41:53] i have 8 shiny new servers here w/ no ticket [15:44:02] can you read me the rt# off the packing slip? [15:45:15] 2125 [15:45:50] ahhhhh, ok [15:46:52] cmjohnson1: ok, assigned that to you to receive it in [15:46:55] and making a racking ticket now [15:48:38] cmjohnson1: so B4, sdtpa looks like the place to put these. I am not physically there, but do you see any issues? (I checked power, but please check it on the unit as well so we are both doing it) [15:49:35] !log torrus deadlocked, kicking [15:49:36] Logged the message, RobH [15:49:47] robh: i shouldn't have any issues with power on B4 [15:50:05] cool, i think thats where we need to put those, not a lot of other options =] [15:50:22] i will make a ticket for it now, two of them have higher priority since they are allocated for ben to use for swift [15:50:24] any plans for c3? [15:50:42] New patchset: Hashar; "class to install Apache Maven" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1768 [15:50:43] c3 is mostly empty, i think i wanna keep it open for potential larger machines [15:50:46] if we need more dbs, etc... [15:50:55] b4 is already a lot of misc servers [15:50:56] New patchset: Hashar; "Add Apache Maven to gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1769 [15:50:59] k..snds good! [15:52:28] argh [15:52:30] i hate naming servers. [15:52:34] what to call them.... [15:52:39] two are swift front ends [15:52:55] but they are media store front ends... msfe... [15:52:58] swiftfe... [15:53:03] i dislike using software names in the server names. [15:53:45] ms-fe1, ms-fe2..... [15:53:52] hrmm... [15:53:57] ms501 ? [15:54:06] these are front ends [15:54:07] ms1-ms3 , ms1001-ms1003 are backends [15:54:09] so they dont actually store. [15:54:18] Why not make the frontends ms501-503 and 1501-1503 [15:54:38] hmm, then 'ms' is a misnomer, right [15:54:42] yep [15:54:47] ms-fe media store front end... [15:54:52] that seems to make sense to me [15:54:52] New review: Dzahn; "typo: ensure => lastest; != latest" [operations/puppet] (production); V: -1 C: -1; - https://gerrit.wikimedia.org/r/1768 [15:55:21] well, the truth is no matter what, some folks will hate the name =P [15:56:30] heh, bens gonna hate the name [15:56:32] http://rt.wikimedia.org/Ticket/Display.html?id=2200 [15:56:36] cmjohnson1: all yers ^ [15:57:06] thx....only one issue that is not really an issue...no ability for redundant power [15:57:27] yea, we knew that could happen ordering them [15:57:35] so the secondary power supply you should just slightly unseat [15:57:43] so it still blocks airflow, but doesnt detect in system [15:57:47] so it wont error for it not having power [15:57:56] RobH: I don't blame you, there's no good name for these thinsg [15:58:18] !log torrus back, took forever to recompile [15:58:19] Logged the message, RobH [15:58:40] Why not name them trucker handles? hooch, snowman(smokey and the bandit), etc [15:59:10] lol [15:59:25] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2650* [15:59:26] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2650* [15:59:30] cmjohnson1: making me lugh is mean, i have a cold and i just nearly passed out coughing [15:59:31] ;p [16:00:22] sorry bout that! [16:01:01] you could name em after country n western singers [16:01:08] then I could kill you in your sleep for doing that [16:01:21] http://home.earthlink.net/~roygbvgw/namsfd.html#funny%20hillbilly%20names [16:04:07] * apergos is installing macos on their macbook and cursing every minute of it [16:04:30] awesome [16:04:56] yeah real awesome [16:05:18] * apergos curses alternatively: juniper, apple, matshita, j schiller and fedora [16:05:19] if you're going to use a macbook at least run windows7 on it! [16:05:21] oh and lawyers [16:07:52] Why are you installing MacOS? [16:12:41] cause the stupid juniper courses have a platform that won't work with linux [16:12:45] nor with wine [16:14:51] robh: can you look at this when you get a chance and get back to me http://rt.wikimedia.org/Ticket/Display.html?id=2193 [16:15:21] checkin [16:16:32] cmjohnson1: updated, the output looks like the hdd died, and that server is under warranty until next month =] [16:16:43] so good timing for it to die now, rahter than in 60 days =] [16:17:01] right! okay, i will call on that this week and get a new HDD [16:17:03] chekcing the ilom too [16:17:25] bah, no lom errors, only the OS error for bad sectors [16:17:34] so the drive isnt fully dead (the motor works) but its dying [16:17:41] dell will prolly ask about that [16:17:50] (i went into the drac and did racadm getsel) [16:18:03] if its a full hdd death, it shows there, bad sectors really only show in os logs not drac logs [16:18:06] (just fyi) [16:52:37] * jeremyb wonders if someone could kill the extra nagios-wm [16:53:05] is it puppetized or otherwise in version control? [16:53:18] * jeremyb would put in a lock file ;-) [16:53:20] I believe it is [16:55:41] oooh, looks like c series is in (above) [16:58:50] apergos: who is matshita? [17:00:28] dvd drive manufacturer [17:00:34] RobH: ya'll know that swift has some stuff built in to detect dying spindles? [17:01:01] i was told by ben yea, it directly handles all disks [17:01:29] jeremyb: the new c series sint ordered, there is a bit of confusion [17:01:32] so its not in ;] [17:01:45] RobH: i thought that was the shinyness! [17:01:59] or maybe it was just that it's really good to watch the swift logs... i can't remember how it presents [17:02:05] we got in two swift frontends, and some other high performance misc servers for future use [17:02:15] but the c series is a storage brick, so its not in yet [17:02:25] a real ms not an msfe [17:05:10] yep [17:10:17] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [17:10:18] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [17:17:53] * jeremyb starts a petition to shoot nagios-wm in the head [17:18:16] * apergos signs that petition [17:18:21] right after the onefor shooting solaris [17:18:30] for defenestration of dataset1 [17:18:36] and a few others... [17:18:54] hah, had to relookup defenestration [17:19:02] ;-) [17:19:31] i remember reading about a human defenestration on enwp in the eastern bloc [17:20:41] Jeff_Green: storage3 has been repaired http://rt.wikimedia.org/Ticket/Display.html?id=2161 [17:20:47] i'm thinking [[Jan Masaryk]] [17:21:18] cmjohnson1 - thks! [17:21:19] cmjohnson1: great, thank you! [17:21:32] apergos: you know i meant the dupe bot, right? [17:21:41] placing bets as to the # of days before it drops yet another drive [17:21:42] yes I do [17:21:57] but puppet and spence could both go into the queue of shoot now ask questions later [17:22:05] heh [17:22:07] and so could nagios the way it's set up now [17:24:13] New patchset: Jgreen; "puppetizing fundraising jenkins maintenance cron (oh the irony) scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1770 [17:32:34] New patchset: Jgreen; "puppetizing fundraising jenkins maintenance cron (oh the irony) scripts typofix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1770 [17:33:25] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1770 [17:33:26] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1770 [17:36:35] notpeter: you around today/yet? [17:37:09] notpeter: i.e. want to talk search in #-labs? [17:39:40] sure [17:40:59] del [17:41:01] ergh [17:49:17] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [17:49:18] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [17:57:08] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [17:57:09] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [18:03:36] huh, i am remembering to eat before 3pm... [18:17:48] New patchset: Jgreen; "fundraising mail config for aluminium/grosley" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1771 [18:25:54] New patchset: Dzahn; "give sudo access to khorn on grosley/aluminium per RT 2196" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1772 [18:26:27] New patchset: Jgreen; "fundraising mail config for aluminium/grosley (typofix)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1771 [18:28:11] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1771 [18:28:12] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1771 [18:28:27] today is day of amended commits . . . :-( [18:30:57] "no comment" is fun [18:31:34] unfortunately you also see that when in fact there have been comments (if they are just inline comments) [18:32:07] New review: Dzahn; "approved by woosters" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1772 [18:32:08] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1772 [18:40:16] argh [18:40:21] why wont juniper lemme register for a class [18:40:34] what's breaking rob ? [18:40:41] didn't we go over this? because you're on linux [18:40:54] i click register on the class page, nothing happens [18:40:56] i run os x [18:44:30] you can call up their support - 1-888-314-5822 (choose the customer care option) [18:44:48] register from linux :-P [18:44:53] I was able to make that work at least... [18:45:09] * apergos now installing 9 more mac os updates... [18:45:12] someday it will be done [18:45:19] it may be who i chose, when i click register in chrome it pulls up the training center [18:45:24] rather than stay in juniper pages. [18:45:26] sigh [18:46:58] oh in chrome [18:47:01] try ff [18:47:12] yea, its owrking now [18:47:14] Fx* :P [18:47:23] but pulls up the vendor, not juniper when i pull up class from link to register [18:47:39] hrm, i wonder if that's new with all their third party trainers ? [18:48:06] yea, seems to be, i just tried [18:48:07] another one [18:48:28] https://en.wikipedia.org/wiki/Firefox#cite_ref-25 [18:54:56] yay, got it [18:55:11] so junipers page on their schedule was wrong, but i signed up for the entire certification course in a two day [18:55:14] feb 20,21 [18:55:16] cool [18:55:17] :) [18:55:27] thanks for helping me pick courses earlier =] [18:55:42] no prob [18:56:53] so the credits have to be booked before feb [18:57:08] so i can possibly register for a more advanced course when i finish with this [18:57:13] yep, but can be used whenever [18:57:18] LeslieCarr: is that right, its just book by feb, book for whenever [18:57:19] cool [18:58:06] not sure if anything more advanced would be helpful yet, but will know when finished with the basic courses [20:08:28] New patchset: Hashar; "Add Apache Maven to gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1769 [20:08:43] New patchset: Hashar; "class to install Apache Maven" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1768 [20:19:37] robh: power disto on ps1-b4 is pretty balanced...i cannot touch anything on AC phase so this looks to be as good as it gets...take a look and lemme know [20:20:49] looks like z is under the others, but otherwise close =] [20:21:13] lookin at torrus, my proxy connection to internal vlan isnt working for some reason [20:22:00] hacking at it now to fix it, but i think yer all set [20:22:59] I can move one or two to z but there is not much room left for more growth on that rack [20:24:03] cmjohnson1: ok, i am on strip now [20:24:09] yea if ya look its 11, 10, 8 [20:24:15] Z has to come up if possible [20:24:17] right [20:24:41] but if it cannot, its not the end of the world, its technically close enough to not be in an alarm state [20:24:45] just not as nice [20:25:25] right....z is pretty full and i wanted to keep the cables close but I can move it around. [20:25:35] please do a bit [20:25:37] New patchset: Pyoungmeister; "adding searchidx.cfg to autoinstall" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1773 [20:27:34] New patchset: Pyoungmeister; "adding searchidx.cfg to autoinstall" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1773 [20:28:00] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1773 [20:28:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1773 [20:37:30] New patchset: Hashar; "integration: make homepage URLs relative" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1774 [20:43:06] RECOVERY - ps1-d3-pmtpa-infeed-load-tower-A-phase-Z on ps1-d3-pmtpa is OK: ps1-d3-pmtpa-infeed-load-tower-A-phase-Z OK - 1188 [20:43:07] RECOVERY - ps1-d3-pmtpa-infeed-load-tower-A-phase-Z on ps1-d3-pmtpa is OK: ps1-d3-pmtpa-infeed-load-tower-A-phase-Z OK - 1188 [20:49:36] PROBLEM - Host rendering.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [20:49:37] PROBLEM - Host rendering.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [20:53:04] notpeter: ping? [20:53:16] PROBLEM - Host api.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [20:53:17] PROBLEM - Host api.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [20:53:30] that looks bad... [20:53:42] yeah [20:53:58] notpeter: when you're done firefighting pls stop by #-labs again :) [20:54:39] kk [20:55:12] jeremyb: it wouldn't look that bad if there was corrent amount of nagios-wm :P [20:55:28] http://ganglia.wikimedia.org/ [20:55:30] correct* [20:55:34] yes [20:55:35] :) [20:55:45] search, scalers, pdf, api out [20:56:23] hi, who can help me setup a simple filter to track NARA related images on the image squids for a couple of days? [20:56:32] mid outage at the moment [20:56:39] so noone at the moment ;] [20:56:49] oh, that's all [20:57:01] but nagios isn't showing anything broken... [20:57:08] hrm [20:57:10] I trust ganglia more than nag [20:57:11] drdee: in the mean time you want to debug ssh in #-labs? [20:57:25] oh, yeah [20:57:28] but that's a data point [20:57:41] another datapoint is RobH's upload failed [20:57:43] that's a weird collection of things [20:57:51] is an lvs box down? [20:57:57] commonist uses api i would think [20:58:01] I expect so [20:58:18] I guess [20:58:19] grrr [20:58:24] jeremyb: sure, how can I help? [20:58:32] ok, I am not exactly sure how to fix the api issue [20:58:33] which lvs are thos on? [20:58:33] no response now from ganglia. [20:58:51] drdee: i just poked you in #wikimedia-labs [20:59:07] sorry didn't see it [20:59:14] np :) [20:59:27] apergos: lvs3, on it now [20:59:31] ok [21:00:09] well, that has the api pool, maybe wrong... [21:00:32] lvs4 is the active one [21:01:33] now even ganglia's broken [21:01:41] cuz everone went on it. [21:01:48] so fragile [21:01:55] =[ [21:02:27] so the lvs isnt even pushing connections to the api cluster [21:02:30] they are all at 0 [21:02:53] Max concurrent service checks (64) has been reached. (nagios) [21:02:56] so that's probably that [21:03:03] RobH: I can try checking the lvs ? [21:03:11] the lvs shows the api servers up and pooled [21:03:24] I dunno why its not actually sending them traffic, cuz pybal thinks they are up [21:03:43] restart it [21:03:49] lvs3 has the loopback for api.svc.pmtpa.wmnet, but lvs4 doesn't [21:04:08] binasher: it needs the loopback to function? [21:04:15] yes [21:04:25] did puppet just run and strip it out or something? =/ [21:05:17] loopback address are present on lvs3 [21:05:17] i see the same loopback interfaces on both, i must not be chekcing right place [21:05:22] hmm pybal got restarted [21:05:39] where do you list the loopbacks for it? (notpeter?) [21:05:47] ip addr [21:06:06] ahhh [21:06:15] i see what you guys are talking about, wtf made it go away =P [21:06:29] i have not run anything, but would puppet add it back if run? [21:07:32] (I got the timeout in commonist and figured it was the app, not the cluster, heh) [21:08:16] !log ran ip addr add 10.2.1.22/32 label "lo:LVS" dev lo on lvs4 [21:08:17] Logged the message, Master [21:08:50] !log that fixed it. but how did that happen? [21:08:51] Logged the message, Master [21:09:04] API is reported back up [21:09:18] that.. was crazy. [21:09:27] that was muy crazy [21:09:32] wtf removed it? [21:10:51] puppet running also prolly would have fixed it, cuz its included in its puppet config [21:11:02] if i am reading it right that is. [21:11:19] stuff's been out for more than 25 mins acccordin to ganglia though [21:11:49] tougher outage to catch right away [21:11:52] need to go meet a friend for lunch, will help hunt down what happened if its still a mystery when i get back [21:11:59] less folks bitch about api downtime [21:12:02] yeah just wondering if puppet really would have got it [21:12:15] me too, i didnt wanna run it while asher was working on it though [21:12:26] binasher: pretty sure its still gonna be a mystery, have a good lunch ;] [21:12:48] apergos: what was exact time of it? [21:12:58] I can't load ganglia atm... [21:13:09] around 20:40 utc [21:13:40] i think spence is overloaded right now [21:14:03] that wil be anotehr reason nagios didn't have much to say [21:14:15] gotta split that off onto its own host [21:14:36] Jan 3 19:48:04 lvs4 puppet-agent[20518]: (/Stage[main]/Lvs::Balancer/File[/etc/pybal/pybal.conf]/content) content changed [21:14:40] that's odd [21:14:49] ganglia tended to break sometimes when tons of folks hit in the past, let alone now with spence being far too burdened [21:16:03] those ips arent in the pybal.conf through [21:16:24] they are listed in the main site.pp, but they may be assigned via pybal.conf..... still diggin [21:16:52] who's on spence mgmt right now ? [21:16:56] hrmm, nope, site.pp also has the info to tag the interface with the info [21:17:03] LeslieCarr: not I, you getting com2 error? [21:17:12] yep, saying already in use [21:17:19] ιτ λιεσ [21:17:21] er [21:17:22] it lies [21:17:23] if someone stayed on it till timeout, its just an error [21:17:34] they hit timeout, are forced out, but the port doesnt get freed up [21:17:39] known issue on drac/5 [21:17:41] lame [21:17:44] only fix is racadm racreset [21:17:46] any good way to free the port ? [21:17:48] and wait for drac to come back [21:17:52] okay [21:17:58] (it wont affect actual server, just the lights out manager) [21:18:11] !log resetting DRAC 5 on spence for management connectivity [21:18:12] Logged the message, Mistress of the network gear. [21:18:45] hrmm, they havent confirmed my training dates, they said it takes up to 24 hours... lame. [21:19:01] oh, nvm, new pybal.conf is pushed out each run [21:19:05] yeah you'll get an email pretty soon though [21:19:17] i guess the training folks manually review them [21:19:23] and ensure its legit and they are giving the class [21:19:53] so basically i signed up on the training vendor site, and in payment (linked to that site when hitting register on juniper site) they ask for the juniper credit code like normal [21:19:56] and… spence is giving me nada [21:20:01] so seems legit, just has human interaction [21:20:05] LeslieCarr: on the serial console? [21:20:05] cmjohnson1: are you onsite ? [21:20:09] yep RobH [21:20:14] we can reboot it remotely [21:20:15] lesliecarr yes [21:20:18] if needed, lemme look [21:20:37] LeslieCarr: can you kick off port for a moment? [21:20:59] for the record, spence is an r300 [21:21:02] which sucks. [21:21:19] luckily, we just purchased 6 more shiny high performence servers for sdtpa, chris is working on racking them today =] [21:21:24] yay [21:21:39] RobH what's the disconnect for this one again ? [21:21:47] since it's diff than the other ones [21:21:51] ctrol + \ [21:21:55] nm [21:21:56] got it [21:21:57] hehe [21:22:01] off [21:22:13] connect: com2 port is currently in use [21:22:16] heh, have to reset it again [21:22:23]