[00:00:02] rdns? [00:00:17] hey I need to go, really tired but if you found it out jeremyb ping me or something, or just install it yourself [00:00:27] yeah [00:00:30] nacht [00:00:37] nacht? :) [00:01:14] good luck Leslie ;) [00:01:36] btw brewster is debian? wow [00:01:49] I thought we have ubuntu almost on all boxes [00:01:55] ubuntu is debian! [00:01:58] tell me in czech? [00:01:59] ;) [00:02:05] I have debian [00:02:18] I don't really like ubuntu for some reasons :D [00:07:05] New patchset: Lcarr; "Adding in gangliaweb class and putting on nickel" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1775 [00:07:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1775 [00:07:25] can i get a review plz ? [00:11:52] New review: Petrb; "looks good, but someone else must approve :|" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1775 [00:12:03] :o [00:12:57] it's so short! i expected something longer [00:13:05] (i even fetched to use git locally) [00:14:47] hehe i have a big change planned but i just need to get ganglia working [00:15:18] what happens when you get to an element that no one can spell or pronounce? [00:15:37] we start naming after cats [00:15:51] alias it to it's element code [00:16:00] ssh Ni [00:16:02] haha [00:16:08] ssh C [00:51:22] LeslieCarr: here? [00:51:27] yep [00:51:53] could you look at an RT ticket for me ? [00:52:12] and point me to the right person for it: https://rt.wikimedia.org/Ticket/Display.html?id=2190 [00:52:24] (or take care of it :) ) [00:52:35] looking [00:52:57] hrm... maybe just kick ryan [00:53:04] ryan's not in today [00:53:06] :( [00:53:16] well, I tried :P [00:53:34] maybe put tim's approval in the ticket itself ? [00:54:00] i do have the ability to add the updated package, but want to make sure it has been tested first [00:54:07] sure [00:54:17] I'll ask Tim to look at it [00:54:31] cool :) [01:01:37] Reedy: https://www.mediawiki.org/wiki/Special:LinkSearch [01:01:45] getLanguage isn't in 1.18 Reedy [01:02:03] johnduhart: i think that's being discussed in #wikimedia-tech ? [01:02:17] ah [01:02:23] god damn duplicat channels [01:02:59] agreedish :) [01:03:39] johnduhart, fsck duplication [01:03:51] how many get lang functions do we have in phase31? [01:03:57] phase3!? [01:04:25] getLang was deprecated in 1.19 [01:04:28] I know [01:04:31] getLangauge is the new one [01:04:34] heh [01:04:40] I was attempting to push getLanguage into 1.18wmf1 for ease [01:04:49] irony [01:05:00] Exactly [01:05:10] Probably means it's possibly broken in 1.18 also [01:05:19] so keep the getLangauge function and just make it point to getLang for 1.18 [01:05:20] (i attempted to do the same thing) [01:05:25] Saw that as well [01:09:44] Seems to be working now [01:09:54] where's the log bot though [01:10:16] oh, again, duplicate channels [01:10:27] Thank you Reedy [01:26:40] New patchset: Lcarr; "Creating ganglia frontend class, gmetad.conf added" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1776 [01:26:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1776 [01:32:00] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1776 [01:34:14] New review: Lcarr; "adding in because accidentally put this in requirements" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1775 [01:34:15] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1776 [01:34:15] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1775 [02:56:21] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [02:56:22] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [03:04:51] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Wed Jan 4 03:04:41 UTC 2012 [03:04:52] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Wed Jan 4 03:04:41 UTC 2012 [04:23:09] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:23:09] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:23:39] RECOVERY - Disk space on es1004 is OK: DISK OK [04:23:39] RECOVERY - Disk space on es1004 is OK: DISK OK [04:37:44] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [04:37:44] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [04:42:24] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2638* [04:42:24] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2638* [04:43:51] i don't have the slightest clue how to interpret that alert :( [04:45:21] i suppose enwiki probably has decent coverage of phases [04:47:44] http://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?host=ps1-b5-sdtpa seems to just have the 3 phases monitored. does it matter if they are monitored in relation to each other? [04:50:04] what units is it reporting and what are the specs for the device? e.g. ideal range [05:01:54] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2350 [05:01:54] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2350 [07:20:21] New patchset: tstarling; "Updated IP address for upload-lb on ms6.esams" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1777 [07:20:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1777 [07:20:51] New review: tstarling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1777 [07:20:52] Change merged: tstarling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1777 [09:43:36] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [09:43:37] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [09:57:56] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 441484 MB (3% inode=99%): [09:57:57] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 441484 MB (3% inode=99%): [09:59:46] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 435564 MB (3% inode=99%): [09:59:47] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 435564 MB (3% inode=99%): [10:08:36] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:08:37] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:16:26] jeremyb: the wiki on search-test is looking fine. only it has loose html tags - can you install tidy to clean these up - e.g. bottom of /w/index.php/Italic [11:37:41] Morning ops folks [11:37:49] Could someone merge and deploy https://gerrit.wikimedia.org/r/#change,1691 please? [11:38:07] It looks like logrotate isn't just not rotating the log file, it's actually destroying data now [11:42:42] why is ther not an olddir? [11:42:55] ie where did archive go? [11:43:21] It never existed [11:43:24] well, I just created it [11:43:40] I didn't know logrotate would fail that badly when given a nonexistent olddir [11:43:50] At first it was just not rotating the log file at all [11:43:50] there sure used to be one [11:43:56] and we want it [11:43:59] Now it's actually truncating it [11:44:08] I created the archive dir just now, it didn't use to exist [11:44:13] /var/log/aft/archive that is [11:44:22] Maybe it existed on locke, but not on emery [11:44:38] /var/log/aft/archive? [11:44:44] Also, lack of an olddir line will just cause /var/log/aft to contain all the old aft files [11:44:47] the diff I'm looking at is for /home/wikipedia/logs/archive [11:45:03] Oh, crap [11:45:07] Different diff altogehter [11:45:14] Then where is my commit removing that other olddir [11:45:29] Oh, wait [11:45:37] Maybe that was it and I touched the wrong file? [11:45:53] Yup that's it [11:45:55] OK that's embarassing [11:46:16] no worries [11:46:45] * RoanKattouw amends [11:47:11] is emery tight on space (i.e. do we not want archives)? [11:48:07] /dev/md0 1.4T 455G 845G 35% / [11:48:25] hmm [11:48:49] all right, they can always figure it out later if they want to keep em longer [11:49:12] New patchset: Catrope; "Logrotate doesn't work with a missing olddir." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1691 [11:50:28] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1691 [11:50:35] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1691 [11:50:36] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1691 [11:50:51] I always forget and hit the "publish comments" button :-P [11:51:55] Well thanks for merging that [11:52:10] just a sec [11:52:48] now it's merged (i.e. deployed) [11:53:38] Thanks [11:53:41] and now it's time to cook lunch (2pm!) [11:53:42] yw [11:53:54] need anything instantly? otherwise I am afk for a little bit [11:54:33] No, go eat [11:54:41] k biab [11:54:45] This was just to make sure logrotate doesn't screw me over again tomorrow (at 06:36 UTC) [11:54:47] *06:33 [12:12:59] that would be a sucky time for sdrewage [12:13:02] *screwage [12:13:15] I have to see what kind of greens these are before I can figure out how to cook them :-/ [12:16:16] Meh, you've made me hungry too [12:28:39] excellent [12:28:44] * apergos rubs hands gleefully [12:28:50] my evil plan is working... [12:28:57] (food is cooking now) [14:39:33] apergos: ping [14:39:45] hexmode: pong (with a bit o english spin on it) [14:39:51] heh [14:40:51] what's up? [14:40:55] apergos: I was wanting to use labs to set up a test server for 1.19 deploy for next week. I know labs is Ryan, but I think I want to use a dump or three. do you know if there is an easy way to do that in labs? [14:41:08] I have no idea [14:41:16] unfortunately that is absolutely ryan [14:41:26] I don't even have the buttons over there to set someone up [14:41:32] well, it *should* be in there... and I'll tell ryan :) [14:41:43] ok [14:41:45] sorry, but you were here and he wasn't ;) [14:42:05] if you are thinking of importing a dump to create a local mirror of some project, I can certainly give you pointers if you get stuck [14:42:15] but if you mean something else I probably can't do much for ya [14:43:04] apergos: pointers? you have any off the top of your head? otherwise I'll wait till I get stuck ;) [14:43:41] if it's a smallish wiki you can just use importDump and rebuildall and be a happy camper, don't forget to read int he subsidiary tables anyways, I think rebuild doesn't do them all [14:44:20] if it's large then you have to play with do you want mwdumper, where is a copy that's nott busted and works with the current schema or can you build a new one and why won't ant build it correctly, etc [14:44:33] (but I used some perl script after updating it a bit and it worked fine, I think it's on meta) [14:45:08] large -> you want to use a script to convert the xml to sql, then shove it in via mysql [14:45:12] then dump in the other tables [14:45:18] then rebuild say rc [14:45:36] now I have never loaded up the private tables (i.e. user data) [14:45:43] I am assumign we don't want to do that here either [14:47:36] no, no private tables [14:47:51] Just want to have a place for people to test [14:48:22] could you clarify this: then rebuild say rc [14:48:36] recent changes? [14:48:40] yes [14:48:47] when you import stuff the rc table is empty [14:48:57] there's a maintenance script [14:49:00] k [14:49:21] "rebuildrecentchanges.php" :-) [15:35:07] hexmode: hey [15:35:16] hey [15:35:20] I saw you wanted to create a clone on labs [15:35:28] yes [15:35:41] I am doing one at the moment for Oren (several clones of simple wiki) [15:35:49] so maybe I could help you a bit... [15:35:55] petan|work: awesome [15:36:03] I am gonna eavesdrop a bit [15:36:12] I want to test the 1.19 release [15:36:17] starting next week [15:36:32] petan|work: so what do you need me to do? [15:36:39] anyway the problem is that Ryan wanted to create dedicated server for database, so I need to talk to him before we start creating large databases [15:36:58] yes, you want to replicate something small for now [15:37:00] in first place we need Ryan :) [15:37:36] because I need to sort out this db stuff with him, I wrote a small proposal but he didn't explain it enough so I don't know how it is going to be working [15:38:04] I guess there will be separate hw for maria db where all projects will have db's [15:38:24] hexmode: what kind of clone you want to create? how big? [15:38:30] you could install a mariadb instance on localhost though [15:38:50] yes we already have many sql server instances even one maria [15:38:50] on a labs instance [15:39:03] but Ryan wanted to replace them with dedicated server instead of vm's [15:39:19] because of poor performance of mysql server in vm [15:39:29] i see, just saying you can still test that way until that has been done [15:39:39] petan|work: for now I want to have something that I can point people at and say "Test this to see if we have any obvoious problems that will affect deployment" [15:39:51] petan|work: simple would be fine for a first run [15:39:53] yes I know that's why I need to know how big is that db going to be :) [15:40:13] because atm we exceeded already disk space on labs, we alloced more space than we have :D [15:40:20] allocate [15:40:27] * d [15:41:15] so in case it's a really big db (more than 80gb) it would be probably needed to be sorted out with Ryan [15:41:29] but later I would like to try hiwiki and, ideally, enwiki [15:41:35] hexmode: simple full or latest rev? [15:41:55] petan|work: lets try latest first [15:41:56] ok, for enwiki we definitely need Ryan, I will check size of hiwiki [15:42:11] right and which mediawiki version? head I guess [15:42:16] 1.19? [15:42:48] petan|work: 1.19 will be available starting next week, but yes, head for now [15:42:59] I have a clone of simple wiki latest rev on one instance, but it's 1.18 [15:43:08] ok [15:43:11] 1.19 branch, that is [15:44:22] hiwiki look ok, so hiwiki and simple wiki latest rev, right? [15:44:50] is it both just a temporary thing or you want to make some permanent project for that? [15:44:56] petan|work: I really need to look over release-notes, and https://www.mediawiki.org/wiki/MediaWiki_1.19 to get a better idea, but yes, for now [15:45:27] petan|work: I'd like to be able to do this every time we start to prep for deployment [15:45:43] so, it will be repeated with different code base [15:45:46] ok, I will check if there is a suitable project now, if not we will need to create one [15:46:07] for that I need someone with admin in labsconsole, like Sara, I guess Ryan isn't available now... [15:46:09] petan|work: tyvm! [15:48:08] mutante: what is testlabs about? [15:48:16] that probably isn't a place for this [15:48:37] if not we need to create a new project because there is no other project for this [15:49:16] I know that Leslie and mutante are members of testlabs but I got no idea what is going on there, apart of that it has many instances with full storage :D [15:49:55] petan|work: any idea how long this will take? I just want to know when I should be ready [15:50:15] ok hexmode if it isn't urgent we will need to wait for Ryan he said he will be here today [15:50:50] I did it wrong way and it took 2 days, if I do it correct way it would take less, I hope [15:50:58] petan|work: want me to create a project? [15:51:03] apergos is expert on that :) [15:51:12] petan|work: testlabs is just like "main" or something, afaik [15:51:22] anything not part of a special project ? [15:51:52] mutante: if you can do that, create a project called "deploymentprep" or something like that [15:52:09] or maybe deployment-preparation, whatever hexmode likes :) [15:53:41] which wiki user should be admin for it? [15:54:25] give it to me and hexmode (Petrb) [15:55:22] !log added project deployment-prep for hexmode and petan [15:55:23] Logged the message, Master [15:55:29] :) [15:55:29] hexmode: when it's created you can insert all people who need to access shell to project, it's easy [15:55:47] there is a guide on labsconsole for that [15:55:53] url? [15:56:00] can you join -labs? [15:56:07] sure 1s [15:56:37] Failed to add hexmode to deployment-prep. [15:56:42] hexmode: do you have a Labs account yet? [15:56:55] yeah, good idea, lets continue there [15:57:11] and that was the wrong log .too :p [15:58:29] :-D [16:00:45] np,it's wiki:) [16:04:01] yup! [17:03:55] robh: can you set up mgmt interfaces for new ms-fe servers (when you get a chance) [17:04:40] racktables updated? [17:04:43] i can do right now if so [17:04:53] yes [17:05:00] cool, doing now [17:05:15] 23-30 in b4 [17:06:23] whats 'spacer'? [17:07:08] cmjohnson1: ? [17:07:42] cuz we dont want to put intentional space between servers in the rack [17:07:50] and we wouldnt put an object in to designate a space [17:07:52] there is something in the rack now, It is not a piece of equipment [17:07:58] ? [17:08:00] there is a metal object [17:08:07] like a casing for something [17:08:07] its a shelf? [17:08:16] no [17:08:31] i will take a pic and attach [17:08:35] i dislike having open spaces in the middle of the rack, cool [17:10:32] offtopic, i am really digging keypassx [17:16:51] robh: http://rt.wikimedia.org/Ticket/Display.html?id=2200 [17:17:11] huh.. i didnt put that in [17:17:12] pull it [17:17:22] we dont need to move the servers, a couple of spaces wont hurt much [17:17:31] just may rack new servers in there in the future is all [17:17:36] so that looks like a shelf [17:17:46] andrew put it in when he reorganized. [17:18:07] and i will kill the racktables entry for it [17:18:35] k..np...i have to tweak a few things anyway [17:18:50] yea its not emergency or anything, its just useless =] [17:18:57] and taking up a space we can use in the future [17:19:46] we will need to replace the pdu for more than 5 or 6 servers [17:20:44] replace or use Y cables yea [17:20:54] it sucks, cuz now that company offers 84 port models [17:20:58] which we use in eqiad [17:21:16] so you can really rig a 47U rack with redundant power [17:21:46] its half on one side, half the other, so there we would have gotten 42 ports, which would be good [17:22:49] cuz you can see each circuit is at 12ish [17:22:52] 12XX ish that is [17:23:10] we can technically go to 1440 and thats our soft limit (80%) circuit capacity [17:24:24] right...so plenty of power left, i forgot about y-cables...i don't care for using them, seems like a poor work around...one accidental pull and 2 machines go down [17:27:28] its a horrible work around [17:27:43] it pains me to see it done =P but its less painful then paying for power and not using it ;] [17:27:59] any new racks in sdtpa will have the newer, far more port versions [17:28:07] unfortunately retrofitting means full downtime [17:29:32] cmjohnson1: added ips in the ticket for you =] [17:32:48] robh: thx ping you when i have the new ms-fe's done [17:42:05] !log powercycling knsq11 [17:42:06] Logged the message, Master [17:45:35] cmjohnson1_: I am going to go ahead and create a network ticket for these servers. can you tell me what port msfe1 is in? (I assume they are in acending order [17:49:26] i would login to switch and check, but for some reason its not letting me [17:50:07] basically we want to label all the ports on the switch with the server names when possible, makes tracking issues easier. so i am dropping a ticket for leslie to do that, plus tag vlans on ms-fe1 and ms-fe2 [17:50:15] in the software, label that is. [17:55:55] !log knsq11 is broken. boots into installer, then "Dazed and confused" at hardware detection (NMI received for unknown reason 21 on CPU 0). -> RT 2206 [17:55:56] Logged the message, Master [18:01:09] robh: what port for mgmt or network? [18:01:15] network [18:01:26] normally i run mgmt same way but its not labeled or setup of course [18:01:53] don't know yet..have a question....on the network switch ports 1/2/17 are unused do you want me to use those first and then go to the remainder? [18:01:54] on the mgmt network, only the primary mgmt switch(s) are managed (like msw1) [18:02:03] huh..... [18:02:13] LeslieCarr: [18:02:25] I don't want to make decisions about your network, you about? [18:02:28] hey [18:02:31] ports 0-23 are in use? except 1/2/17 on asw-b4 [18:02:31] lemme read [18:02:37] chris is wiring 8 new servers in b4 [18:02:44] and it has three open ports in the lower ranges [18:02:55] we normally plug them in in acending order, so we can use those ports, but i wanted to check with you [18:03:03] i assume best to use all low ports first. [18:03:26] if thats the case, i am making a ticket for you with all the port# and labels [18:03:31] use them according to the standard when possible but it's more important to get them all plugged in [18:03:35] as well as what vlan each need tagged [18:03:45] haha i was just about to ask if you could then make sure to say which server in which port :) [18:03:49] cool [18:03:55] you're so on top of it :) [18:04:00] so use lower ports? (seems industry standard to me) [18:04:11] yeah [18:04:15] okay and i will make note and let you both know what is where [18:04:17] ACKNOWLEDGEMENT - Host knsq11 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #2206 - hardware failure [18:04:17] ACKNOWLEDGEMENT - Host knsq11 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #2206 - hardware failure [18:04:18] cmjohnson1_: you heard the lady ;] [18:04:35] cmjohnson1_: cool, i have a ticket open and filled out, just lacking the port#s =] [18:04:39] got it! [18:05:00] cool [18:05:34] which is the good bot: /home/wikipedia/bin/ircecho vs. /usr/ircecho/bin/ircecho [18:05:54] no clue =[ [18:08:28] LeslieCarr: I haz a networking request [18:08:36] okay ? [18:08:56] 42 [18:09:01] ok, settled. [18:09:11] when I try to pxe boot searchidx1001 it doesn't look like the dhcp request gets through, can you take a look at the interface? [18:09:20] !log duplicate nagios-wm instances on spence (/home/wikipedia/bin/ircecho vs. /usr/ircecho/bin/ircecho) killed them both, restarted with init.d/ircecho [18:09:20] Logged the message, Master [18:09:24] ip is 10.65.7.100 [18:09:45] okay [18:09:49] thank you ! [18:11:26] interesting i haven't seen any flapping, are you on the console for searchidx1001 ? [18:12:17] not at the moment [18:12:20] this was yesterday [18:12:51] cool, mind if i play with it ? [18:13:29] go for it [18:13:32] what's up with db19,db41,db43 ? can we remove them from monitoring? [18:13:33] arghghhh [18:13:48] cmjohnson1_: so i closed my ticket window, so since you are getting the info you can make the ticket, i pming you with details [18:13:48] trying to lower the number of criticals a bit again [18:15:11] RobH - hrm so why would searchidx1001 be in port 24 instead of port 33 ? ( https://racktables.wikimedia.org/index.php?page=object&object_id=1110 ) [18:16:06] the memcached servers use SFP cables to masw-2-a5-eqiad [18:16:10] msw even [18:16:19] notpeter: dimm a1 is bad on this computer [18:16:22] mc1XXX servers are badass networking wise. [18:16:51] ah interesting [18:16:53] so search1001 is the first server into asw-a5 [18:16:53] LeslieCarr: ah [18:17:02] LeslieCarr: thanks! [18:17:02] if you guys drop a ticket into eqiad for it [18:17:07] i will call and get it replaced [18:17:12] but please include output showing its bad [18:17:19] or i just have to replicate your work ;] [18:17:42] notpeter that doesn't have anything to do with the pxe boot but also worth noting :) [18:17:42] tomorrow I will go in, and migrate the bad dimm to another slot to ensure its the dimm and not the socket [18:18:08] LeslieCarr: do you see activity on the port? [18:18:26] i saw it reset then i am not sure if the machien even went into pxe booting [18:18:32] monitoring traffic on port this time around [18:18:55] miught not actually get to the pxe boot. which would explain why I didn't see it in the dhcp logs :/ [18:19:06] you dont see it get to it on serial? [18:19:09] apergos: is .19 db compatible with .18? [18:19:21] so when I run import dump of a fump from simple [18:19:57] ACKNOWLEDGEMENT - Host srv191 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #2193 - failed hdd [18:20:07] RobH I resized the term window which just made everything into a mess of jumble letters [18:20:58] https://rt.wikimedia.org/Ticket/Display.html?id=2208 [18:21:15] LeslieCarr: add more reasonable output to that? [18:22:42] !rt is http://rt.wikimedia.org/Ticket/Display.html?id=$1 [18:22:42] Key was added! [18:22:54] !rt 2208 [18:22:54] http://rt.wikimedia.org/Ticket/Display.html?id=2208 [18:22:58] :) [18:23:00] off top of head anyone remember the f1 key combo for dell's mgmt ? [18:23:16] ? [18:23:19] esc 1 [18:23:20] bios f2 [18:23:20] I think [18:23:28] boot menu f11 [18:23:30] pxe f12 [18:23:36] but you can tell drac to do it so you dont have to [18:24:00] two commands, dont forget the first or it changes the option for each boot [18:24:01] racadm config -g cfgServerInfo -o cfgServerBootOnce 1 [18:24:01] racadm config -g cfgServerInfo -o cfgServerFirstBootDevice PXE [18:24:11] you can replace PXE with BIOS [18:24:26] that makes the next reboot, and only the next reboot due to the first command, go to whatever you specify [18:24:28] LeslieCarr: esc + , if greater than 10, esc shift - 10 [18:24:31] or do it rob's way [18:24:43] its a hell of a lot easier than trying to catch it during post [18:24:58] i actually have it in a text doc [18:25:02] so i can paste all 4 lines [18:25:06] racadm config -g cfgServerInfo -o cfgServerBootOnce 1 [18:25:07] racadm config -g cfgServerInfo -o cfgServerFirstBootDevice PXE [18:25:07] racadm serveraction powercycle [18:25:07] console com2 [18:25:16] and also use them in a for loop [18:25:35] eg: for db in db{48..49}; do echo -e "racadm config -g cfgServerInfo -o cfgServerBootOnce 1\nracadm config -g cfgServerInfo -o cfgServerFirstBootDevice PXE\nracadm serveraction powercycle" | ssh root@$db.mgmt.pmtpa.wmnet; done [18:26:08] its why i detest working on the old SM machines where you must catch them in the boot ;] [18:26:12] thanks [18:26:41] notpeter it wanted f1 to continue [18:27:53] notpeter - brewster is now receiving the request but claims no free leases… i am guessing something with dns [18:28:48] i don't see dns either, i think that's your problem [18:29:56] LeslieCarr: cool thank you [18:39:34] !pxe is http://wikitech.wikimedia.org/view/Dell_PowerEdge_R410#PXE_booting [18:39:34] Key was added! [18:41:44] !log powercycling srv199 [18:41:45] Logged the message, Master [18:48:08] notpeter so you will have to kill the negative cache on that attempt [18:48:14] when you finish updating dns [18:48:44] on ns0/dobson, if ya dunno, rec_control wipe-cache fqdn [18:49:03] or even when you add dns the dhcp for the servers you tried to netboot will have the same error [18:50:33] is anyone currently editing the zone files? [18:50:40] there are uncommited changes.... [18:50:58] oh shit [18:51:03] i forgot to push my dns changes [18:51:07] they look safe [18:51:08] mount.nfs: DNS resolution failed for ms5.pmtpa.wmnet: Temporary failure in name resolution [18:51:11] mount.nfs: DNS resolution failed for ms7.pmtpa.wmnet: Temporary failure in name resolution [18:51:12] but wanted to check [18:51:12] cmjohnson1_: if you tried to test the mgmt it wouldnt have worked [18:51:14] is that related? [18:51:23] i just saw that on srv199, in the second you said that [18:51:23] shouldn't be [18:51:28] mutante: not my changes, i just added mgmt for new servers [18:51:42] and i didnt even check them in =P [18:51:45] i got sidetracked [18:51:53] oh, ok, guess ms5 and 7 are just outdated [18:51:54] notpeter: so you will push my changes with yers then please? [18:52:06] RobH: cool. no problem [18:52:09] thx [18:52:18] just wanted to make sure someone else wasn't actively editing [18:52:19] i got distracted halfway thorugh it =P [18:52:46] exited it and svn diff'd to paste into chris's ticket [18:52:50] and then didnt do anything else, heh [18:53:28] btw DHCP, srv199 came back up but tried to get an IP via DHCP and failed, now its up with just "lo" and "lo:LVS" [18:54:02] it has dns, so something is wrong with it [18:54:19] was it possibly offline and had a mainboard replaced? [18:54:24] eth0 Network is down [18:54:33] if mainboard was swapped, could be chipset change, but dells dont normally do that ever. [18:54:34] DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 4 [18:54:34] send_packet: Network is down [18:55:02] i would reboot and ensure the network is on in bios [18:55:11] then if it is, ask LeslieCarr to see if the port is up on the switch [18:55:25] the link is down: 2: eth0: mtu 1500 qdisc noop state DOWN qlen 1000 [18:55:25] robh: test which mgmt? [18:55:40] cmjohnson1_: i gave you the IPs for the new boxes, and you are programming them right? [18:55:45] yes [18:55:48] if you finished before now, your test would have failed [18:55:54] cuz i forgot to push the changes i gave you [18:55:59] notpeter is pushing them all now [18:56:14] oh..ok...i set all up then test...so thx for telling me (JIC) [18:58:01] is srv199 where you are right now?:) [18:58:19] it looks like it might be the cable [18:59:27] mutante: srv199 is here [18:59:34] I will check cables [19:00:22] cmjohnson1_: thanks, its not that urgent though, if it really is the cable it happened like 15 days ago [19:01:12] 15 days ago I was in Colorado so, it would've had to go bad on its own...but I will check [19:03:33] oh.. Embedded Gb NIC1: Enabled with PXE ..but: [19:03:43] MAC Address: Not Present [19:03:50] is binasher around? [19:03:56] cmjohnson1_: nevermind i think its the NIC then [19:04:11] cmjohnson1_: i can create an RT for that of course [19:04:43] mutante: thats messed up man [19:04:48] i have never seen that. [19:04:52] will "MAC Address: Not Present" be enough to make Dell believe its broken? [19:05:16] i dont think there is anywhere else in bios to disable the nic [19:05:19] other than the one you pasted [19:05:29] so i woudl think so, but if not they will tell chris what else needs to be tested [19:05:43] no matter what, that mac should show up, even when disabled if I recall correctly [19:05:58] i'll make a ticket,it doesnt need high priority [19:07:14] !log srv199 boots but without eth0, NIC1 is Enabled in BIOS but MAC Address "Not Present" - creating hardware ticket [19:07:15] Logged the message, Master [19:08:05] robh: the mac will not show up if the nic is disabled [19:08:17] good to know [19:08:18] but it is also "Enabled with PXE" [19:08:30] mutante: did you change it to that [19:08:34] or was it already on that? [19:08:40] was like that [19:08:42] if you changed, you may have to save and reboot to see [19:08:43] ahh, ok. [19:08:49] then its prolly fubar [19:09:14] though i have not seen it do that before, doesn't mean its not doing it now =] [19:12:38] RobH: actually, you want to read over my dns changes and push it out? [19:12:41] I like more eyes... [19:12:47] sure [19:13:29] notpeter: it looks good, did you commit or shall i? [19:13:33] New patchset: Jgreen; "puppetizing impression log collection scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1778 [19:13:53] ahh, its not, commiting and pushing [19:15:04] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1778 [19:15:05] !log updating dns for mgmt of ms-fe1/2 and other new servers in tampa, as well as search boxen in eqiad [19:15:05] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1778 [19:15:06] Logged the message, RobH [19:15:42] werd up. thanks! [19:16:17] !log dns update successful and none of them fell over [19:16:18] Logged the message, RobH [19:19:10] !rt 2209 | cmjohnson1_ [19:19:10] cmjohnson1_: http://rt.wikimedia.org/Ticket/Display.html?id=2209 [19:19:17] (with screenshot ;) [19:20:44] mutante: thx, i will let you know what i find [19:22:07] robh: are you still working on dns updates [19:22:10] cmjohnson1_: cool, no rushing ,its just one of the srvs :) [19:22:14] they are done now [19:22:18] should be working [19:22:45] try ssh into WMF3641.mgmt [19:22:50] not working for me [19:23:08] hrmmmmm, something is up [19:23:09] checking [19:23:14] nevermind [19:23:16] its up [19:23:31] New patchset: Lcarr; "Fixing ganglia web installation on nickel" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1779 [19:23:38] cool [19:23:45] can i get a review plz ? https://gerrit.wikimedia.org/r/1779 [19:25:48] ACKNOWLEDGEMENT - Host srv199 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #2209 NIC failure [19:25:57] LeslieCarr: seems ok to me, reviewing now [19:26:04] thanks [19:27:12] New review: RobH; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1779 [19:27:35] New review: Dzahn; "instead of require "generic::webserver::php5-mysql" and $ssl=true, you can also:" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/1779 [19:29:09] LeslieCarr: ah,ok, you want "generic::webserver::php5-mysql", that is probably different again, but "generic::webserver::php5" already has the "ssl true/false" [19:29:52] thanks for that :) [19:30:29] i wan it for the just php5 to make sure we get the ssl [19:30:33] updating now [19:31:44] do you really need both, an "include" and install_certificate to install the cert? [19:32:08] New patchset: Lcarr; "Fixing ganglia web installation on nickel" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1779 [19:32:22] the other machines have it so i assumed it was required [19:32:31] other machines with the star cert [19:33:46] ok, i think i just used install_certificate but thats maybe something i am missing [19:37:07] look good to you ? [19:38:10] ah, i see you are doing this: generic::apache::no-default-site nice, i didnt see that, i ended up with this: apache_site { no_default: name => '000-default', ensure => absent } [19:38:15] yes [19:39:07] i would merge. but Rob still reviewing? [19:39:42] !log restarting dhcpd on brewster [19:39:43] Logged the message, and now dispaching a T1000 to your position to terminate you. [19:39:50] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/1779 [19:40:11] i didnt notice the varient [19:40:18] and approved it [19:40:46] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1779 [19:40:47] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1779 [19:41:33] robh: when you get a chance please look into this http://rt.wikimedia.org/Ticket/Display.html?id=2210 [19:42:03] cool, you think you have enough until thursday? [19:42:13] LeslieCarr: its merge on sockpuppet ..now [19:42:22] i think i have plenty of spares in eqiad, can send you some rather than buy more [19:42:31] if i dont, i order then [19:42:36] thanks :) [19:42:49] yes..definitely [19:42:50] yw [19:42:57] thx [19:43:11] i was redoing the cabling in my house this weekend and realized i really need a labeler [19:45:52] TekGun Ethernet Cable Labeler -> ;) http://www.youtube.com/watch?v=-ou4yQrrPKE [19:46:01] it prints on the cable directly [19:49:34] uh heh, the comments dont sound overly excited though .heh [19:49:34] woah, my mind is blown - also - what horrible cgi -- why don't they just show us a picture of the actual product ? [19:50:38] hehe handy pen attachment [19:50:41] that reminds me of equipment we misused when I was 16 working at a pickle factory for a week [19:52:38] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [19:52:53] brady. [19:53:03] if you wanna spend a lot of money, but never have to reapply a label. [19:53:04] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1774 [19:53:05] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1774 [19:53:05] i should clarify for the filthy minded who shall remain nameless . . . it was a labeler along a conveyor which spat date stamps at anything that went past [19:53:12] its like 300 bucks for the printer, then each cart is 50 [19:53:30] RobH: thanks, i was trying to remember the word "brady":) [19:54:05] http://www.bradyid.com/bradyid/pfpv/Printers-and-Label-Makers~Portable-Label-Printers~IDXPERT%E2%84%A2-Labeling-Printers.html [19:54:27] we have the xpert-key model in all three datacenters [19:54:33] atleast i think mark has one [19:54:39] mutante: does he? you were there last [19:54:58] it rocks. [19:55:27] RobH: yes, i used it quite a bit, and labeled servers in Amsterdam. the big yellow box :) [19:55:35] yep [19:55:46] now for servers we use a specific small label, no cutting [19:56:00] but i'm not sure if the same labels are also good for cables [19:56:08] it has different ones [19:56:13] self laminating for labels [19:56:16] seems like you want one kind of labeler / labels for server cases, and another "technology" for cables [19:57:46] same labeler [19:57:46] like those clear labels that wrap around the cable [19:57:49] different cartridges [19:57:54] I am still stuck on the "pickle factory" I mean c'mon does anyone else think that is hilarious..the jokes that are running through my mind [19:57:55] thats what we use [19:57:57] was looking for link [19:58:08] mutante: yah but tedious when you have a lot of labelling to do [19:58:09] cmjohnson1_: none of mine were polite enough for use in channel [19:58:21] so the self laminating ones we use are nice [19:58:33] you can run off a batch and its done on a ribbon, peel off and wrap on cable end [19:58:35] cmjohnson1_: yes it is! hence the clarification :-) [19:58:35] i like the self laminating labels we use for cables [19:58:39] the clear part overlaps the printed [19:58:47] Jeff_Green: sigh yes, had to label hundreds of cables like that at former job [19:59:20] like this [19:59:21] http://www.bradyid.com/bradyid/pdpv/XC-750-427.html [19:59:33] if you have ever seen the intro to the 70's tv show Laverne and Shirley, it was just like that [19:59:51] yeah, looks very similar [19:59:58] Jeff_Green: http://www.youtube.com/watch?v=mRmKzxhMzwo [20:00:04] cool, but 1 roll 55 Dollars? :p [20:00:05] thats what you get for making me remember that [20:00:13] hahahah [20:00:19] yes we did the dance on the way to the factory too [20:00:22] hehe, i am totally listening to it. [20:00:37] it's clashing with my lou reed, dammit [20:00:56] argh, thaqt near killed me [20:01:04] i made myself laugh, and im getting over a cold [20:01:10] coughing fit =P [20:01:11] hahahah. sorry [20:01:19] heh [20:02:14] can someone log into nickel and tell me wtf all the spew in /var/log/messages means plz [20:02:15] ? [20:02:40] .....interesting. [20:02:46] looks like filesystem crash to me [20:02:59] kernel is upset [20:03:19] yeah [20:03:47] dpkg is somehow involved [20:04:31] LeslieCarr: what is this machine going to do, just asking cuz 8gb ram and no swap [20:04:41] prolly fine but thought i would mention it [20:04:46] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1768 [20:04:47] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1768 [20:04:47] unrelated to the problem though [20:04:53] ganglia web collector [20:05:00] gah [20:05:18] hrmm, i have no idea what kind of memory ganglia would actually use on its own [20:05:24] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1769 [20:05:25] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1769 [20:05:54] hrm, okay, will check out partman for a better profile [20:06:27] but again, i dont think thats the issue for this [20:06:33] yeah [20:06:35] jeff pointed it out, it goes to call dpkg and dies [20:06:36] ganglia runs on spence now? that's only 4GB [20:06:54] true [20:06:56] i thought filesystem partly b/c I saw mention of inode in the barf [20:07:00] and spence is dying all the time :( bad spence! [20:07:08] syslog shows the dpkg call [20:07:21] Jan 4 19:55:01 nickel CRON[16746]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) [20:07:22] Jan 4 19:55:25 nickel kernel: [64519.706262] INFO: task dpkg:16736 blocked for more than 120 seconds. [20:07:22] Jan 4 19:55:25 nickel kernel: [64519.712666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [20:07:22] Jan 4 19:55:25 nickel kernel: [64519.720497] dpkg D 00000000ffffffff 0 16736 16725 0x00000000 [20:07:24] then death [20:07:33] yeah, I dunno. maybe dpkg is hung up on a disk operation due to a filesystem issue? [20:07:50] or maybe it has nothing to do with disk at all [20:08:25] lets make a new nagios in labs (to resolve "completely puppetize nagios" and then move it away from spence) [20:08:33] no disk failures in dracmgmt but thats only for really big fails. [20:08:38] leslie--spence dies but it also has a GB of RAM free currently, so I'd imagine 8GB is going to be plenty [20:09:01] hrm [20:10:21] i think the ganglia web collector host should be one of the newer ssd equipped cache or misc servers [20:12:43] binasher: why ssds? the rrds usually live on a ramdisk, so disk speed is less important. [20:13:12] or are you suggesting we switch to ssd for the rrd storage instead of ramdisk? [20:13:27] New review: Dzahn; "URLs changed on gallium and work fine" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1774 [20:14:26] New review: Dzahn; "installed maven on gallium without problems" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1769 [20:14:54] leslie: process apt-get (16725) is horked. I would probably kill it and try again [20:15:23] maplebed: rsync'ing from tmpfs to a slow disk in order to be able to run whatsoever is a neat hack but still fucking lame if you have resources better than servers with single sata drives and care about data loss. [20:15:42] New patchset: Lcarr; "Adding warning + nickel into netboot.cfg" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1780 [20:15:56] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1780 [20:16:02] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1780 [20:16:03] strace looks like it's just spinning away on pselect6 [20:16:11] New review: Dzahn; "no problems, pulled in these:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1768 [20:17:04] binasher: aka - we should just put the rrds on ssd instead of in ram. fine by me if it's fast enough. Is it? rrds require an insane number of iops. [20:17:55] maybe we should stop using rrd's.. [20:18:00] ++ ! [20:18:29] ganglia is a valuable tool. I would rather not stop using it. [20:18:30] ganglia has no other backend does it? [20:19:30] welli need to redo the partitions anyways, i am going to reinstall this server (and see if the partman config works) [20:20:07] oh random thought--this is software RAID? [20:20:11] yes [20:20:12] it is [20:20:18] the python port of gmetad can apparently use other storage engines [20:20:42] did the raid have a chance to fully build before dpkg started? [20:20:54] it should have [20:21:09] how mch time would it need ? [20:21:30] hi binasher, it's diederik [20:21:32] depends on the flavor and size, but it can be up to hours for huge filesystems [20:21:41] i can't remember how to check offhand, looking [20:21:55] drdee: hey [20:23:51] binasher: so i committed the glam filter to trunk/mediawiki/udp2log (i think) what is the next step? [20:27:25] RobH and cmjohnson1_: I'm afraid I need to ask you to undo some of this morning's work. The two ms-fe hosts must be in separate racks/etc. for proper failover. [20:27:54] hrmm [20:28:06] I'm sorry I didn't say so in the ticket. [20:28:06] gotta find someplace =P [20:28:24] cmjohnson1_: you can just move ms-fe1 since it is on bottom [20:28:38] looks nicer to have one gap instead of two, easier to fit future servers as well if they are a 2u [20:28:43] clearly, only one of them needs to be moved. [20:28:49] lemme see where to put it [20:29:14] maplebed: the puppetization of the udp2log stuff you did is live, right? [20:29:15] heh.... i think the best place to put it in d1 in pmtpa. [20:29:21] but still checking [20:29:26] okay...what about b3 [20:29:38] b3 is purely search so far [20:29:46] where d1 is already mixed use and dual power [20:29:46] binasher: my memory says it's live on one host but not both. but I recommend double checking. [20:29:53] is b3 dual power? [20:29:57] (I dont recall) [20:30:01] negs [20:30:22] yea, lets leave just search in that rack for now, since it has a half rack left [20:30:29] we may want to put a large group of clustered servers there [20:30:31] maplebed: thanks [20:30:43] so then the rack would be search and one other thing, not mixed use, keeping mixed use racks to a minimum whenever possible [20:30:51] easier to balance power in clustered racks and all [20:31:02] make sense? [20:31:02] mutante: your mac problem has gotten to be something bigger...boot process is hanging [20:31:14] yep..makes sense [20:31:15] if the nic is bad, its on the mainboard [20:31:19] chances are the entire mainboard is fubar [20:31:32] i would expect other issues to crop up on that system [20:31:52] cmjohnson1_: cool, so this means you need to move it, update racktables, and drop new network ticket, or reopen the old one with new details [20:31:57] cmjohnson1_: it hangs for a while, until it gives up getting an IP ..but then continues.. right.. thats how it was for me [20:32:01] i will update ticket with move info just so its on record [20:33:22] mutante when i first plugged the crash cart into it...it was still where you left it...i exited...it was looking for an ip and gave up...i rebooted to go into bios and it is hanging after mpt boot rom [20:33:45] robh: thanks...i will need to amend network ticket to lesliecarr as well [20:33:47] drdee: just reviewed the filter, it looks potentially ok, though i'm not sure if python will allow for a very high sampling rate. could you open an rt request for the install that says how long it should be in place for? [20:33:53] yep [20:36:14] cmjohnson1_: uhm..oh well, NIC is embedded anyways, so no big difference which other part of the board is broken..hm:) [20:37:07] mutante: i am not done with it yet...just giving you my once over results [20:37:30] binasher: i will do that [20:39:54] cmjohnson1_: thank you. i would also expect other problems with that board, just like RobH said.. not sure what you need to make Dell believe its broken..but besides that i dont think it matters [20:40:12] the mac being not there is enough reason to return the mainboard [20:40:19] no other testing is really needed unless dell says [20:40:31] but they shouldnt, it not showing the mac when its enabled is borked. [20:45:13] I have to call dell this week about 191 so i will get their opinion on 199 as well. [20:49:10] binasher: ticket was created and you have been cc'ed [20:49:26] New patchset: Ryan Lane; "Adding in a pre-login banner for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1781 [20:50:07] drdee: thanks, i'll take the ticket. about to go to lunch, but i have some more questions about how the logs will be used [20:50:31] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1781 [20:50:32] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1781 [20:51:48] binasher: sure, happy to answer them :) [20:52:05] PROBLEM - HTTP on nickel is CRITICAL: Connection refused [20:56:05] PROBLEM - SSH on nickel is CRITICAL: Connection refused [20:56:54] New patchset: Jgreen; "refactoring banner impression log handling scripts for easier maintenance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1783 [20:57:38] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1783 [20:57:39] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1783 [22:07:42] RECOVERY - SSH on nickel is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:24:21] lesliecarr: not sure if you got my earlier email about port configurations but i had to make some changes [22:28:18] New patchset: tstarling; "Updated collector location" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1785 [22:28:50] New review: tstarling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1785 [22:28:50] Change merged: tstarling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1785 [22:41:02] New patchset: Asher; "install percona-nagios-checks in the right place, add nrpe template" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1786 [22:45:30] New patchset: Asher; "install percona-nagios-checks in the right place, add nrpe template" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1786 [22:46:15] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1786 [22:46:15] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1786 [23:00:50] PROBLEM - NTP on nickel is CRITICAL: NTP CRITICAL: No response from NTP server [23:16:51] PROBLEM - mobile traffic loggers on cp1042 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [23:31:36] PROBLEM - mobile traffic loggers on cp1042 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [23:36:26] RECOVERY - HTTP on nickel is OK: HTTP OK HTTP/1.1 200 OK - 455 bytes in 0.053 seconds [23:49:16] RECOVERY - NTP on nickel is OK: NTP OK: Offset -0.09375977516 secs [23:51:16] RECOVERY - mobile traffic loggers on cp1042 is OK: PROCS OK: 2 processes with command name varnishncsa