[03:11:54] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [03:11:55] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [03:41:24] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:25] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:56] PROBLEM - RAID on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:41:56] PROBLEM - RAID on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:26] PROBLEM - Disk space on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:26] PROBLEM - Disk space on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:27] ugh, the nagioses are mating again [03:50:56] PROBLEM - DPKG on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:50:56] PROBLEM - SSH on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:50:57] PROBLEM - DPKG on srv273 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:50:57] PROBLEM - SSH on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:58:16] RECOVERY - RAID on srv273 is OK: OK: no RAID installed [03:58:17] RECOVERY - RAID on srv273 is OK: OK: no RAID installed [04:00:06] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.019 second response time [04:00:06] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.019 second response time [04:00:36] RECOVERY - SSH on srv273 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [04:00:37] RECOVERY - SSH on srv273 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [04:00:56] RECOVERY - DPKG on srv273 is OK: All packages OK [04:00:56] RECOVERY - DPKG on srv273 is OK: All packages OK [04:02:56] RECOVERY - Disk space on srv273 is OK: DISK OK [04:02:56] RECOVERY - Disk space on srv273 is OK: DISK OK [04:43:48] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [04:43:48] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [07:01:40] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [07:01:41] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [07:11:20] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:11:20] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:47:30] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [07:47:30] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [08:09:01] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:01] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:19:01] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [08:19:01] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [09:07:55] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 99 MB (1% inode=60%): /var/lib/ureadahead/debugfs 99 MB (1% inode=60%): [09:07:56] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 99 MB (1% inode=60%): /var/lib/ureadahead/debugfs 99 MB (1% inode=60%): [09:25:35] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=60%): /var/lib/ureadahead/debugfs 1 MB (0% inode=60%): [09:25:36] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=60%): /var/lib/ureadahead/debugfs 1 MB (0% inode=60%): [09:27:15] RECOVERY - Disk space on srv222 is OK: DISK OK [09:27:16] RECOVERY - Disk space on srv222 is OK: DISK OK [09:49:30] RECOVERY - Disk space on srv221 is OK: DISK OK [09:49:31] RECOVERY - Disk space on srv221 is OK: DISK OK [09:53:50] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 443510 MB (3% inode=99%): [09:53:51] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 443510 MB (3% inode=99%): [09:55:40] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 434584 MB (3% inode=99%): [09:55:41] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 434584 MB (3% inode=99%): [10:02:30] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:02:31] RECOVERY - MySQL slave status on es1004 is OK: OK: [13:25:40] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [13:25:41] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [14:25:53] New patchset: Catrope; "Fix Nagios job queue check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1766 [14:33:42] New review: Dzahn; "works on spence." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1766 [14:33:42] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1766 [14:45:56] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [14:45:56] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [15:24:56] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2288 [15:24:57] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2288 [15:34:37] New patchset: Dzahn; "job_queue: tweak retry check interval" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1767 [15:35:37] New review: Dzahn; "keep the regular interval at 15 minutes, but if it fails once (SOFT), keep re-checking every 5 minut..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1767 [15:35:38] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1767 [15:36:06] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (77620) [15:36:07] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (77620) [15:40:29] robh: good morning [15:41:19] morning =] [15:41:53] i have 8 shiny new servers here w/ no ticket [15:44:02] can you read me the rt# off the packing slip? [15:45:15] 2125 [15:45:50] ahhhhh, ok [15:46:52] cmjohnson1: ok, assigned that to you to receive it in [15:46:55] and making a racking ticket now [15:48:38] cmjohnson1: so B4, sdtpa looks like the place to put these. I am not physically there, but do you see any issues? (I checked power, but please check it on the unit as well so we are both doing it) [15:49:35] !log torrus deadlocked, kicking [15:49:36] Logged the message, RobH [15:49:47] robh: i shouldn't have any issues with power on B4 [15:50:05] cool, i think thats where we need to put those, not a lot of other options =] [15:50:22] i will make a ticket for it now, two of them have higher priority since they are allocated for ben to use for swift [15:50:24] any plans for c3? [15:50:42] New patchset: Hashar; "class to install Apache Maven" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1768 [15:50:43] c3 is mostly empty, i think i wanna keep it open for potential larger machines [15:50:46] if we need more dbs, etc... [15:50:55] b4 is already a lot of misc servers [15:50:56] New patchset: Hashar; "Add Apache Maven to gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1769 [15:50:59] k..snds good! [15:52:28] argh [15:52:30] i hate naming servers. [15:52:34] what to call them.... [15:52:39] two are swift front ends [15:52:55] but they are media store front ends... msfe... [15:52:58] swiftfe... [15:53:03] i dislike using software names in the server names. [15:53:45] ms-fe1, ms-fe2..... [15:53:52] hrmm... [15:53:57] ms501 ? [15:54:06] these are front ends [15:54:07] ms1-ms3 , ms1001-ms1003 are backends [15:54:09] so they dont actually store. [15:54:18] Why not make the frontends ms501-503 and 1501-1503 [15:54:38] hmm, then 'ms' is a misnomer, right [15:54:42] yep [15:54:47] ms-fe media store front end... [15:54:52] that seems to make sense to me [15:54:52] New review: Dzahn; "typo: ensure => lastest; != latest" [operations/puppet] (production); V: -1 C: -1; - https://gerrit.wikimedia.org/r/1768 [15:55:21] well, the truth is no matter what, some folks will hate the name =P [15:56:30] heh, bens gonna hate the name [15:56:32] http://rt.wikimedia.org/Ticket/Display.html?id=2200 [15:56:36] cmjohnson1: all yers ^ [15:57:06] thx....only one issue that is not really an issue...no ability for redundant power [15:57:27] yea, we knew that could happen ordering them [15:57:35] so the secondary power supply you should just slightly unseat [15:57:43] so it still blocks airflow, but doesnt detect in system [15:57:47] so it wont error for it not having power [15:57:56] RobH: I don't blame you, there's no good name for these thinsg [15:58:18] !log torrus back, took forever to recompile [15:58:19] Logged the message, RobH [15:58:40] Why not name them trucker handles? hooch, snowman(smokey and the bandit), etc [15:59:10] lol [15:59:25] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2650* [15:59:26] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2650* [15:59:30] cmjohnson1: making me lugh is mean, i have a cold and i just nearly passed out coughing [15:59:31] ;p [16:00:22] sorry bout that! [16:01:01] you could name em after country n western singers [16:01:08] then I could kill you in your sleep for doing that [16:01:21] http://home.earthlink.net/~roygbvgw/namsfd.html#funny%20hillbilly%20names [16:04:07] * apergos is installing macos on their macbook and cursing every minute of it [16:04:30] awesome [16:04:56] yeah real awesome [16:05:18] * apergos curses alternatively: juniper, apple, matshita, j schiller and fedora [16:05:19] if you're going to use a macbook at least run windows7 on it! [16:05:21] oh and lawyers [16:07:52] Why are you installing MacOS? [16:12:41] cause the stupid juniper courses have a platform that won't work with linux [16:12:45] nor with wine [16:14:51] robh: can you look at this when you get a chance and get back to me http://rt.wikimedia.org/Ticket/Display.html?id=2193 [16:15:21] checkin [16:16:32] cmjohnson1: updated, the output looks like the hdd died, and that server is under warranty until next month =] [16:16:43] so good timing for it to die now, rahter than in 60 days =] [16:17:01] right! okay, i will call on that this week and get a new HDD [16:17:03] chekcing the ilom too [16:17:25] bah, no lom errors, only the OS error for bad sectors [16:17:34] so the drive isnt fully dead (the motor works) but its dying [16:17:41] dell will prolly ask about that [16:17:50] (i went into the drac and did racadm getsel) [16:18:03] if its a full hdd death, it shows there, bad sectors really only show in os logs not drac logs [16:18:06] (just fyi) [16:52:37] * jeremyb wonders if someone could kill the extra nagios-wm [16:53:05] is it puppetized or otherwise in version control? [16:53:18] * jeremyb would put in a lock file ;-) [16:53:20] I believe it is [16:55:41] oooh, looks like c series is in (above) [16:58:50] apergos: who is matshita? [17:00:28] dvd drive manufacturer [17:00:34] RobH: ya'll know that swift has some stuff built in to detect dying spindles? [17:01:01] i was told by ben yea, it directly handles all disks [17:01:29] jeremyb: the new c series sint ordered, there is a bit of confusion [17:01:32] so its not in ;] [17:01:45] RobH: i thought that was the shinyness! [17:01:59] or maybe it was just that it's really good to watch the swift logs... i can't remember how it presents [17:02:05] we got in two swift frontends, and some other high performance misc servers for future use [17:02:15] but the c series is a storage brick, so its not in yet [17:02:25] a real ms not an msfe [17:05:10] yep [17:10:17] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [17:10:18] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [17:17:53] * jeremyb starts a petition to shoot nagios-wm in the head [17:18:16] * apergos signs that petition [17:18:21] right after the onefor shooting solaris [17:18:30] for defenestration of dataset1 [17:18:36] and a few others... [17:18:54] hah, had to relookup defenestration [17:19:02] ;-) [17:19:31] i remember reading about a human defenestration on enwp in the eastern bloc [17:20:41] Jeff_Green: storage3 has been repaired http://rt.wikimedia.org/Ticket/Display.html?id=2161 [17:20:47] i'm thinking [[Jan Masaryk]] [17:21:18] cmjohnson1 - thks! [17:21:19] cmjohnson1: great, thank you! [17:21:32] apergos: you know i meant the dupe bot, right? [17:21:41] placing bets as to the # of days before it drops yet another drive [17:21:42] yes I do [17:21:57] but puppet and spence could both go into the queue of shoot now ask questions later [17:22:05] heh [17:22:07] and so could nagios the way it's set up now [17:24:13] New patchset: Jgreen; "puppetizing fundraising jenkins maintenance cron (oh the irony) scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1770 [17:32:34] New patchset: Jgreen; "puppetizing fundraising jenkins maintenance cron (oh the irony) scripts typofix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1770 [17:33:25] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1770 [17:33:26] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1770 [17:36:35] notpeter: you around today/yet? [17:37:09] notpeter: i.e. want to talk search in #-labs? [17:39:40] sure [17:40:59] del [17:41:01] ergh [17:49:17] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [17:49:18] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [17:57:08] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [17:57:09] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [18:03:36] huh, i am remembering to eat before 3pm... [18:17:48] New patchset: Jgreen; "fundraising mail config for aluminium/grosley" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1771 [18:25:54] New patchset: Dzahn; "give sudo access to khorn on grosley/aluminium per RT 2196" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1772 [18:26:27] New patchset: Jgreen; "fundraising mail config for aluminium/grosley (typofix)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1771 [18:28:11] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1771 [18:28:12] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1771 [18:28:27] today is day of amended commits . . . :-( [18:30:57] "no comment" is fun [18:31:34] unfortunately you also see that when in fact there have been comments (if they are just inline comments) [18:32:07] New review: Dzahn; "approved by woosters" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1772 [18:32:08] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1772 [18:40:16] argh [18:40:21] why wont juniper lemme register for a class [18:40:34] what's breaking rob ? [18:40:41] didn't we go over this? because you're on linux [18:40:54] i click register on the class page, nothing happens [18:40:56] i run os x [18:44:30] you can call up their support - 1-888-314-5822 (choose the customer care option) [18:44:48] register from linux :-P [18:44:53] I was able to make that work at least... [18:45:09] * apergos now installing 9 more mac os updates... [18:45:12] someday it will be done [18:45:19] it may be who i chose, when i click register in chrome it pulls up the training center [18:45:24] rather than stay in juniper pages. [18:45:26] sigh [18:46:58] oh in chrome [18:47:01] try ff [18:47:12] yea, its owrking now [18:47:14] Fx* :P [18:47:23] but pulls up the vendor, not juniper when i pull up class from link to register [18:47:39] hrm, i wonder if that's new with all their third party trainers ? [18:48:06] yea, seems to be, i just tried [18:48:07] another one [18:48:28] https://en.wikipedia.org/wiki/Firefox#cite_ref-25 [18:54:56] yay, got it [18:55:11] so junipers page on their schedule was wrong, but i signed up for the entire certification course in a two day [18:55:14] feb 20,21 [18:55:16] cool [18:55:17] :) [18:55:27] thanks for helping me pick courses earlier =] [18:55:42] no prob [18:56:53] so the credits have to be booked before feb [18:57:08] so i can possibly register for a more advanced course when i finish with this [18:57:13] yep, but can be used whenever [18:57:18] LeslieCarr: is that right, its just book by feb, book for whenever [18:57:19] cool [18:58:06] not sure if anything more advanced would be helpful yet, but will know when finished with the basic courses [20:08:28] New patchset: Hashar; "Add Apache Maven to gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1769 [20:08:43] New patchset: Hashar; "class to install Apache Maven" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1768 [20:19:37] robh: power disto on ps1-b4 is pretty balanced...i cannot touch anything on AC phase so this looks to be as good as it gets...take a look and lemme know [20:20:49] looks like z is under the others, but otherwise close =] [20:21:13] lookin at torrus, my proxy connection to internal vlan isnt working for some reason [20:22:00] hacking at it now to fix it, but i think yer all set [20:22:59] I can move one or two to z but there is not much room left for more growth on that rack [20:24:03] cmjohnson1: ok, i am on strip now [20:24:09] yea if ya look its 11, 10, 8 [20:24:15] Z has to come up if possible [20:24:17] right [20:24:41] but if it cannot, its not the end of the world, its technically close enough to not be in an alarm state [20:24:45] just not as nice [20:25:25] right....z is pretty full and i wanted to keep the cables close but I can move it around. [20:25:35] please do a bit [20:25:37] New patchset: Pyoungmeister; "adding searchidx.cfg to autoinstall" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1773 [20:27:34] New patchset: Pyoungmeister; "adding searchidx.cfg to autoinstall" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1773 [20:28:00] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1773 [20:28:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1773 [20:37:30] New patchset: Hashar; "integration: make homepage URLs relative" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1774 [20:43:06] RECOVERY - ps1-d3-pmtpa-infeed-load-tower-A-phase-Z on ps1-d3-pmtpa is OK: ps1-d3-pmtpa-infeed-load-tower-A-phase-Z OK - 1188 [20:43:07] RECOVERY - ps1-d3-pmtpa-infeed-load-tower-A-phase-Z on ps1-d3-pmtpa is OK: ps1-d3-pmtpa-infeed-load-tower-A-phase-Z OK - 1188 [20:49:36] PROBLEM - Host rendering.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [20:49:37] PROBLEM - Host rendering.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [20:53:04] notpeter: ping? [20:53:16] PROBLEM - Host api.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [20:53:17] PROBLEM - Host api.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [20:53:30] that looks bad... [20:53:42] yeah [20:53:58] notpeter: when you're done firefighting pls stop by #-labs again :) [20:54:39] kk [20:55:12] jeremyb: it wouldn't look that bad if there was corrent amount of nagios-wm :P [20:55:28] http://ganglia.wikimedia.org/ [20:55:30] correct* [20:55:34] yes [20:55:35] :) [20:55:45] search, scalers, pdf, api out [20:56:23] hi, who can help me setup a simple filter to track NARA related images on the image squids for a couple of days? [20:56:32] mid outage at the moment [20:56:39] so noone at the moment ;] [20:56:49] oh, that's all [20:57:01] but nagios isn't showing anything broken... [20:57:08] hrm [20:57:10] I trust ganglia more than nag [20:57:11] drdee: in the mean time you want to debug ssh in #-labs? [20:57:25] oh, yeah [20:57:28] but that's a data point [20:57:41] another datapoint is RobH's upload failed [20:57:43] that's a weird collection of things [20:57:51] is an lvs box down? [20:57:57] commonist uses api i would think [20:58:01] I expect so [20:58:18] I guess [20:58:19] grrr [20:58:24] jeremyb: sure, how can I help? [20:58:32] ok, I am not exactly sure how to fix the api issue [20:58:33] which lvs are thos on? [20:58:33] no response now from ganglia. [20:58:51] drdee: i just poked you in #wikimedia-labs [20:59:07] sorry didn't see it [20:59:14] np :) [20:59:27] apergos: lvs3, on it now [20:59:31] ok [21:00:09] well, that has the api pool, maybe wrong... [21:00:32] lvs4 is the active one [21:01:33] now even ganglia's broken [21:01:41] cuz everone went on it. [21:01:48] so fragile [21:01:55] =[ [21:02:27] so the lvs isnt even pushing connections to the api cluster [21:02:30] they are all at 0 [21:02:53] Max concurrent service checks (64) has been reached. (nagios) [21:02:56] so that's probably that [21:03:03] RobH: I can try checking the lvs ? [21:03:11] the lvs shows the api servers up and pooled [21:03:24] I dunno why its not actually sending them traffic, cuz pybal thinks they are up [21:03:43] restart it [21:03:49] lvs3 has the loopback for api.svc.pmtpa.wmnet, but lvs4 doesn't [21:04:08] binasher: it needs the loopback to function? [21:04:15] yes [21:04:25] did puppet just run and strip it out or something? =/ [21:05:17] loopback address are present on lvs3 [21:05:17] i see the same loopback interfaces on both, i must not be chekcing right place [21:05:22] hmm pybal got restarted [21:05:39] where do you list the loopbacks for it? (notpeter?) [21:05:47] ip addr [21:06:06] ahhh [21:06:15] i see what you guys are talking about, wtf made it go away =P [21:06:29] i have not run anything, but would puppet add it back if run? [21:07:32] (I got the timeout in commonist and figured it was the app, not the cluster, heh) [21:08:16] !log ran ip addr add 10.2.1.22/32 label "lo:LVS" dev lo on lvs4 [21:08:17] Logged the message, Master [21:08:50] !log that fixed it. but how did that happen? [21:08:51] Logged the message, Master [21:09:04] API is reported back up [21:09:18] that.. was crazy. [21:09:27] that was muy crazy [21:09:32] wtf removed it? [21:10:51] puppet running also prolly would have fixed it, cuz its included in its puppet config [21:11:02] if i am reading it right that is. [21:11:19] stuff's been out for more than 25 mins acccordin to ganglia though [21:11:49] tougher outage to catch right away [21:11:52] need to go meet a friend for lunch, will help hunt down what happened if its still a mystery when i get back [21:11:59] less folks bitch about api downtime [21:12:02] yeah just wondering if puppet really would have got it [21:12:15] me too, i didnt wanna run it while asher was working on it though [21:12:26] binasher: pretty sure its still gonna be a mystery, have a good lunch ;] [21:12:48] apergos: what was exact time of it? [21:12:58] I can't load ganglia atm... [21:13:09] around 20:40 utc [21:13:40] i think spence is overloaded right now [21:14:03] that wil be anotehr reason nagios didn't have much to say [21:14:15] gotta split that off onto its own host [21:14:36] Jan 3 19:48:04 lvs4 puppet-agent[20518]: (/Stage[main]/Lvs::Balancer/File[/etc/pybal/pybal.conf]/content) content changed [21:14:40] that's odd [21:14:49] ganglia tended to break sometimes when tons of folks hit in the past, let alone now with spence being far too burdened [21:16:03] those ips arent in the pybal.conf through [21:16:24] they are listed in the main site.pp, but they may be assigned via pybal.conf..... still diggin [21:16:52] who's on spence mgmt right now ? [21:16:56] hrmm, nope, site.pp also has the info to tag the interface with the info [21:17:03] LeslieCarr: not I, you getting com2 error? [21:17:12] yep, saying already in use [21:17:19] ιτ λιεσ [21:17:21] er [21:17:22] it lies [21:17:23] if someone stayed on it till timeout, its just an error [21:17:34] they hit timeout, are forced out, but the port doesnt get freed up [21:17:39] known issue on drac/5 [21:17:41] lame [21:17:44] only fix is racadm racreset [21:17:46] any good way to free the port ? [21:17:48] and wait for drac to come back [21:17:52] okay [21:17:58] (it wont affect actual server, just the lights out manager) [21:18:11] !log resetting DRAC 5 on spence for management connectivity [21:18:12] Logged the message, Mistress of the network gear. [21:18:45] hrmm, they havent confirmed my training dates, they said it takes up to 24 hours... lame. [21:19:01] oh, nvm, new pybal.conf is pushed out each run [21:19:05] yeah you'll get an email pretty soon though [21:19:17] i guess the training folks manually review them [21:19:23] and ensure its legit and they are giving the class [21:19:53] so basically i signed up on the training vendor site, and in payment (linked to that site when hitting register on juniper site) they ask for the juniper credit code like normal [21:19:56] and… spence is giving me nada [21:20:01] so seems legit, just has human interaction [21:20:05] LeslieCarr: on the serial console? [21:20:05] cmjohnson1: are you onsite ? [21:20:09] yep RobH [21:20:14] we can reboot it remotely [21:20:15] lesliecarr yes [21:20:18] if needed, lemme look [21:20:37] LeslieCarr: can you kick off port for a moment? [21:20:59] for the record, spence is an r300 [21:21:02] which sucks. [21:21:19] luckily, we just purchased 6 more shiny high performence servers for sdtpa, chris is working on racking them today =] [21:21:24] yay [21:21:39] RobH what's the disconnect for this one again ? [21:21:47] since it's diff than the other ones [21:21:51] ctrol + \ [21:21:55] nm [21:21:56] got it [21:21:57] hehe [21:22:01] off [21:22:13] connect: com2 port is currently in use [21:22:16] heh, have to reset it again [21:22:23] resetting it now [21:22:36] dumb drac5 [21:23:20] LeslieCarr: you have done power cycles on dells right? [21:23:34] if not, you can do this instead of me, i just wanted to check it out first [21:23:42] yeah, racadm servaction hardreset ? [21:23:57] racadm serveraction powercycle, i confirm its just dead [21:24:11] so i am going to powercycle it now (since you have done it) [21:24:18] cool :) [21:24:20] !log spence is unresponsive to ssh and serial console, rebooting [21:24:21] Logged the message, RobH [21:25:02] lets see if it comes back =] [21:25:18] fingers crossed… [21:27:44] hrmm, booting [21:27:49] in os load now [21:28:13] !log nagios and ganglia down due to spence reboot, system still coming back online [21:28:14] Logged the message, RobH [21:28:23] 'restoring RRDs' [21:28:29] this always takes awhile =[ [21:30:35] poor spence. [21:30:41] -rw-r--r-- 1 root root 2499 2012-01-03 20:33 boot.log [21:30:44] its still struggling online [21:30:47] lvs4 just fell over [21:30:54] and the lo didn't back up properly... [21:31:01] RECOVERY - RAID on storage3 is OK: OK: State is Optimal, checked 14 logical device(s) [21:31:02] ? [21:31:05] *come back up [21:31:14] there is a bootlog from today [21:31:24] from about an hour ago [21:31:30] hrmm [21:31:44] that would explain it, but why didnt it keep the lo info [21:31:45] beds were shat [21:31:56] oh, yeah, I mean, that's lame too [21:32:28] but that's how it lost it in the first place [21:33:17] i don't see that loopback in /etc/network/interfaces in either lvs [21:33:21] which would explain the loss ? [21:33:35] oh well it's added to lvs4 (is that new?) [21:33:37] but not lvs3 [21:34:00] oh no it's not, nevermind me, not on either [21:35:58] LeslieCarr: yea it confused the shit out of me too ;] [21:36:27] spence is stuck on Starting Ganglia Monitor Meta-Daemon: gmetad. [21:36:42] i think it went right back to being overloaded already [21:37:00] nagios is back up, as apache is working, but ganglia is borked [21:37:11] server is also still insanely slow [21:37:21] still doesnt ssh for me. [21:37:42] kill ganglia for now ? not as important as nagios ? [21:38:05] cannot boot it to a command prompt in normal mode so far [21:38:07] and i guess that "get ganglia puppetized and running on another server" ticket is suddenly a do now [21:38:36] my serial console went unresponsive as well [21:40:00] .... i am not sure rebooting it again is going to fix anything. [21:40:09] it gets overloaded right away [21:40:11] single user mode? [21:40:41] racresetting again, com2 in use =P [21:41:05] its kind of amusing that our monitoring server was able to send out the outage page, then it died [21:41:12] atleast it felt obligated to warn us ;] [21:41:57] !log resetting spence and dropping to serial to try to fix it [21:41:58] Logged the message, RobH [21:42:42] heh, 4gb memory. [21:42:49] this server is slow. [21:43:02] single cpu dual core 3ghz, 4gb ram [21:43:13] robh: anything I can do from here? [21:43:25] cmjohnson1: not yet [21:43:46] as long as the serial console works, we are good, its booting again, damn it it didnt take my grub interrupt =P [21:44:34] ok, goign to let it try to normal boot once more, if it fails again I try to do single user again... [21:44:44] if my serial commands dont work, may need you cmjohnson1 to hook up console [21:47:55] RECOVERY - RAID on storage3 is OK: OK: State is Optimal, checked 14 logical device(s) [21:48:31] !log spence back online, ganglia and nagios confirmed operational [21:48:32] Logged the message, RobH [21:48:38] it made it that time, odddddd [21:49:09] !log ganglia graphs will have missing data for past 30 to 40 minutes [21:49:10] Logged the message, RobH [21:49:31] cmjohnson1: we wont need the crash cart for spence after all, for now atleast ;] [21:49:40] Reedy: have a time? [21:49:51] It's 21:49 [21:49:52] alright...i will wheel it back to it's corner [21:49:54] :) [21:49:59] I meant, if you aren't busy [21:50:10] just on a phone call [21:50:40] ok, folks from -ops wanted to change the channel in downtime notice on wmf sites from #wikipedia to #wikimedia-downtime [21:50:56] can you do that or who can? [21:51:05] where? [21:51:44] when there is error in web server there is message that we are having some problems etc... visit #wikipedia on freenode for more... [21:51:54] something like that [21:51:59] they wanted to change the channel [21:52:18] because #wikipedia is getting spammed everytime when something is down [21:52:44] you know what I talk about? [21:53:26] the channel people should drop in on is #wikimedia-tech [21:53:30] i thought it read #wikimedia-tech, not wikipedia? [21:53:36] petan: ^ [21:53:39] they say it doesn't [21:53:49] in a 500 [21:53:59] i'm trying to find a 500... [21:53:59] another problem was that -tech is not controlled by any irc operators (only few people with +o there) [21:54:27] you don't have to be op to set the topic [21:54:28] (shhhh) [21:54:37] + they like that channel to be rather silent so that they can see what's going on, if 300 people come there I don't know if anyone would know what is going on [21:54:53] that's another problem [21:55:05] we expect people to show up in -tech, to ask qs and to get updated [21:55:07] when there is downtime we need to set +t [21:55:08] we have locked the topic in the past, it only results in folks never updating it. [21:55:09] we don't do our work in there [21:55:16] it's for updates [21:55:19] because there are too many trolling people messing with topic in that time [21:55:27] we really haven't had that happen [21:55:36] because there is #wikipedia in that message [21:55:46] if there was tech it probably would happen [21:55:49] no, I mean when the channel is full of folks coming in and asking [21:55:56] any ideas how to get a 500? [21:56:00] on cluster [21:56:00] we don't have people randomly resetting the topic [21:56:19] believe me, if that channel was in that message you would have many people doing that there [21:56:28] https://bugzilla.wikimedia.org/29599 don't work anymore :-P [21:56:36] we'll deal with it then [21:56:46] bah, closed wrong window [21:56:53] right, so you definitely want to have -tech there? [21:56:54] welcome [21:56:57] jeremyb: nothing comes to mind but my minds says it's almost midnight too and I forgot to eat dinner >_< [21:57:10] yes, wikimedia-tech [21:57:13] we dont want folks coming in operations when there is an outage ;] [21:57:24] ok, so can you change it from current name? [21:57:25] apergos: i just had lunch aka breakfast at 3:50pm! [21:57:28] nice job [22:02:44] i'm going to work on the ganglia server again and try to get the new one working... [22:02:53] first starting in labs [22:02:56] good luck [22:34:11] RECOVERY - Host rendering.svc.pmtpa.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [22:38:25] LeslieCarr: from way earlier, the vip loopbacks are created on lvs servers via the /etc/network/if-up.d/wikimedia-lvs-realserver script. everything under if-up.d/ is executed every time an interface is brought up, though that one is a no-op unless $IFACE = lo [22:42:47] RobH: [22:42:51] spence is flipped out again [22:46:05] i cannot even login to spence =P [22:50:28] !log stopping and then starting apache2 on spence to try and lower load [22:50:29] Logged the message, Mistress of the network gear. [23:34:23] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [23:34:24] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [23:42:13] * jeremyb stabs spence. you were multiply rebooted. why do you have 2 nagios-wm still?! [23:46:53] does anyone have an idea why dhcp-server on brewster won't actually hand out a lease to the new nickel, even though it is receiving the request and is running ? [23:50:20] * jeremyb pokes notpeter [23:50:35] LeslieCarr: what's the log say? [23:50:50] jeremyb: nothing - it's like it didn't see the request [23:50:57] althoguh the tcpdump says otherwise [23:51:10] tcpdump both sides? [23:51:14] yeah [23:51:29] the server side dumped from the router and brewster dumped on itself [23:51:47] why not just dump right on nickel? [23:52:27] because it needs an os to do that [23:52:38] and i'm trying to pxe boot it [23:52:46] which log are you looking at? [23:52:47] oh [23:52:50] brand new machine [23:53:03] well, try from a livecd and see what happens and then try the PXE again [23:53:05] /var/log/messages on brewster [23:53:11] or you're remote... [23:53:19] sadly can't livecd - it's in the dc, i'm in the office in sf [23:53:27] for remote connection you need a working network :P [23:53:43] ah [23:54:03] LeslieCarr: shut/no shut? [23:54:15] the port ? [23:54:23] yeah on juniper [23:54:29] it's up and in the proper vlan [23:54:34] idk the juniper syntax [23:54:59] since the dhcp server is actually seeing the request on its port [23:55:02] yeah but just do `shutdown` wait 5 secs and then `no shutdown` [23:55:37] (but make sure you have it right for juniper... don't want to take the whole switch down!) [23:55:48] haha yeah, if there's one thing i know, it's juniper syntax [23:55:53] * jeremyb reads http://debianclusters.org/index.php/Troubleshooting_DHCP [23:56:25] can you take some other machine down and try to PXE it? [23:56:46] ahha [23:56:53] var/log/syslog had the info (thanks for the link) [23:57:00] it thinks it has no free leases [23:57:08] which is strange [23:57:19] yeah, i was wondering if it was the wrong log [23:59:01] * jeremyb pokes notpeter [23:59:35] ahha interesting, figured it out, for some reason it's not getting the dns entry