[00:00:12] New patchset: Lcarr; "Make test puppet repo act like production (pull from git)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2096 [00:01:15] lvs4 is still monitoring them [00:01:28] mark: 7th time's the charm ? [00:17:24] New patchset: Bhartshorne; "put in a default so things don't break on new servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2103 [00:17:44] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2103 [00:17:45] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2103 [00:19:21] LeslieCarr: cool if I merge your changes? [00:19:21] your change is unmerged on sockpuppet maplebed - is it safe to merge ? [00:19:25] haha [00:19:26] jinx [00:19:27] sure [00:19:28] :) [00:19:30] New patchset: Asher; "snapshot db26" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2104 [00:19:53] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2104 [00:19:54] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2104 [00:19:59] merged [00:30:24] maplebed: seems fine [00:30:38] you can do a second invocation for sda2 and sdb2 with partition_nr => 2 [00:31:02] won't they fail because sda and asb are 'os' disks? [00:31:17] oh yeah [00:31:20] they make a new partition table [00:31:22] yeah that won't work [00:33:19] mark: https://rt.wikimedia.org/Ticket/Display.html?id=2328 are for the two enwiki db servers [00:33:24] one quote per datacenter [00:34:01] two per data center, right [00:34:06] yep [00:34:10] well, nothing stopping me from making the final two partitions by hand. [00:34:19] i'm making a major prod puppet change, so if it breaks, yell at me :) [00:34:37] mark: RobH: http://pastebin.com/uY9ztij3 [00:35:17] nice [00:35:22] though less impressive than with 48 drives ;-) [00:35:28] lol [00:35:40] cheaper than the 48 disk version. [00:35:45] mark: Is there anything else you want done to this host before we order 4 more? [00:35:47] those thors were pricy [00:35:59] or woosters or anybody else? [00:36:01] if rob and you are happy with them [00:36:08] then I'm comfortable buying a few more [00:36:10] I am happy with them for this project [00:36:21] RobH: you're satisfied with the ipmi stuff as a method for management? [00:36:22] i would want to hack at the c series before moving it to anything else mind you [00:36:26] we'll get to know them better in the next few weeks [00:36:40] maplebed: yep, it does what we need it to do just fine [00:36:46] ok, sounds like we're set to order. [00:36:48] and when i finish polishing script it will be easy for everyone [00:36:54] suggest we place the order asap, maplebed [00:36:59] RobH: maybe you should order 5 instead of 4 (4 for me and 1 for you to continue playing with)? [00:36:59] ok, i am going to ask dell to update quote for better pricing since we are ordering a lot of shit [00:37:21] nah, too expensive a system for testbed [00:37:24] i think we are good to go [00:37:38] any testing i do for scripting works against R series just fine [00:37:50] well, if you need to play with them more before moving on to anything else, but you won't order one to play with, how do you propose ever moving them into something else? [00:37:59] and we need to test swift's fault tolerance anyway, right? ;-) [00:38:07] rob plays with production [00:38:12] damn right [00:38:13] riiight... [00:38:18] no fun to play in a sandbox. [00:38:38] ok. RobH do you need anything from me (rt or whathaveyou) or are you good to go? [00:38:47] the question of them being worth using is more of price versus ease of use [00:38:58] if the system is fine, then i am set, will ask for updated quote [00:41:26] maplebed: so you want 4 additional hosts [00:41:30] in addition to the one you have now [00:41:32] correct? [00:42:15] asked for 4 [00:47:49] mark: err: /Stage[main]/Puppetmaster::Gitclone/Git::Clone[operations/software]/Exec[git clone operations/software]/returns: change from notrun to 0 failed: git clone https://gerrit.wikimedia.org/r/p/operations/puppet returned 128 instead of one of [0] at /var/lib/git/operations/puppet/manifests/generic-definitions.pp:756 ? [00:47:52] know what's up with that ? [00:49:42] yes leslie [00:49:45] you didn't get the origin right ;-p [00:49:49] New patchset: Lcarr; "fixing software repo for puppetmaster in prod" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2105 [00:50:00] * LeslieCarr throws a can over the cube wall [00:50:34] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2105 [00:50:35] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2105 [00:53:44] PROBLEM - DPKG on ms-be1 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:54:07] RobH: yes, that's correct. total of 5 - ms-be1 + 4 more new ones. [00:55:35] PROBLEM - Disk space on ms-be1 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:56:02] maplebed: cool, quote update requested [00:56:17] Jeff_Green: https://rt.wikimedia.org/Ticket/Display.html?id=2323 is for tampa right? [00:57:11] #2323: procure a pair of hosts to do lvs/pybal for the new payments cluster [00:57:21] i assume yes, but i rather be certain ;] [00:57:58] low-perf misc server cluster would be fine [00:58:18] New patchset: Lcarr; "trying moving the git clone to last" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2106 [00:58:27] pretty sure we dont have any spare of those in tampa, so will have to order [00:58:29] RobH: thats eqiad actually [00:58:34] oh, then we have a ton of them [00:58:35] yay! [00:58:41] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2106 [00:58:41] Jeff_Green: need public IP? [00:58:41] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2106 [00:58:46] i assume yes, lvs [00:58:48] er [00:58:52] these go in the extra secure rack [00:58:56] mark: so that means i have to snag from row b for both for now. [00:58:58] so you can't allocate [00:59:03] ahh, have to wait for row C [00:59:04] row C ;) [00:59:04] ? [00:59:11] RobH: oh we fixed the glitch :) [00:59:13] and that rack will be handled differently [00:59:16] wasnt sure if we would do now or wait, i will update the ticket and tie to the row C [00:59:16] we have public ip's in row a now [00:59:19] yeah exactly, secure rack [00:59:28] not relevant to this case, but good to know for the future [00:59:35] Jeff_Green: So that order is in for those racks, I expect a two week leadtime to get them in, perhaps 3. [00:59:39] i'm going to have to build out a pxeboot situation there [00:59:41] LeslieCarr: indeed =] [00:59:58] you're going to have to build out EVERYTHING in there ;) [00:59:59] robh that's totally fine [01:00:03] mark: yes [01:00:03] it'll be an autonomous island [01:00:10] no air is allowed in [01:00:14] ;) [01:00:14] PROBLEM - RAID on ms-be1 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:00:15] it needs a name [01:00:20] greenland [01:00:23] ha [01:00:24] mark: so i can just take two exisiting ones and move them then? [01:00:28] or we wanna order more? [01:00:31] mark: that's funny on multiple levels [01:00:34] RobH: either way [01:00:47] will hold off, try to use existing once new row is in [01:00:52] and if they get used before then, will order more [01:01:00] tying to the row C expansion and such [01:01:00] where green is me, the color of US bills, island, etc. [01:01:04] :) [01:01:12] and it's cccccold there [01:01:39] it'll need it's own logo that looks Disney-ish too [01:02:57] we can actually put other servers in that rack right [01:03:03] otherwise unrelated to fundraising [01:03:05] sure [01:03:14] so we could build our normal infrastructure, as long as there's room for your stuff, plus it's secure [01:03:14] just has to lock shut and the like [01:03:18] indeed [01:03:26] so would be easiest to toss our shit in top of that rack [01:03:28] so we'll put a normal EX4200 in etc [01:03:30] i was thinking it might make sense to move aluminium, and fundraising dbs etc there [01:03:31] yeah [01:03:31] and just set aside the bottom half for payments [01:03:37] indeed [01:03:37] Jeff_Green: indeed [01:03:45] jinx. [01:03:46] is there redundant power? [01:03:50] everywhere [01:03:50] yep [01:03:57] all eqiad is redundant, and tampa will slowly migrate to be [01:04:06] can we do two switches? [01:04:11] ehm [01:04:15] not for our normal production stuff [01:04:30] why do you need that? [01:04:34] it'd be swell not to lose everything if we lose one power circuit or one switch, just because we chose to put it all in the same rack [01:04:38] if a switch dies i would imagine [01:04:41] yeah [01:04:43] then let's not put it all in one rack [01:04:49] =/ [01:04:52] ok, agreed [01:05:10] i would like to avoid cross rack within the same datacenter [01:05:21] what do you mean by cross rack? [01:05:35] dont all the payments stuff have to plug into some jujiper device we are ordering [01:05:43] that will all be in one rack [01:05:57] payments stuff in other racks wont need to plug into that? [01:06:08] unless I misunderstood jeff [01:06:15] that one secure rack will be redundant [01:06:17] so there will be 2 firewalls, one per power circuit [01:06:22] yeah [01:06:36] but stuff NOT in the secure realm, will be on our normal production switches [01:06:42] and of those, we'll only have one switch per rack [01:06:52] and if we end up using switches within the payments cluster there will be two little switches on each power circuit [01:06:54] if we place payments stuff in two racks, the second rack will not need to hit those firewalls with direct plugged connection? [01:06:54] so if THAT needs to be more redundant, let's not put THAT in one rack [01:07:01] yeah--that's what I was referring to [01:07:01] Jeff_Green: that's fine [01:07:04] ok [01:07:14] * RobH is confused [01:07:15] RobH: payments stuff will be in one rack [01:07:33] fundraising stuff will remain spread out across racks? [01:07:35] there will be no cross rack connections anywhere [01:07:37] (excepting payments) [01:07:39] RobH: think of it this way [01:07:41] ah, yes [01:07:41] err [01:07:56] payments is the walled garden that is internally redundant on that rack [01:08:13] payments will plug into the firewalls direct, and not need an access switch, correct? [01:08:24] and the rest of fundraising is essentially aluminium, db's, activemq box, maybe some log processing stuff [01:08:35] if they plug into a switch, you are back to a non-redundant spof in that access swtich. [01:09:05] robh if we do switches within the payments environment there need to be a pair but they can be tiny [01:09:17] so two 24 port junipers [01:09:22] or less [01:09:56] (cuz this means we should be adding those to the juniper order) [01:10:00] yeah, i mean we're talking about ports for: 4 payments servers, 2 lvs boxes, 1 shell/bastion, 1 logger box [01:10:12] the SRXes can do that just fine [01:10:14] no switches needed [01:10:23] they are 8 port? [01:10:27] 16 [01:10:27] yeah I'm totally on board for doing direct to the SRXs [01:10:29] cool. [01:10:33] it would be pretty funny to have more RU of network equipment than servers :) [01:10:34] ok, i am on same page [01:10:56] LeslieCarr: that's what we ended up with at CL isn't it? [01:10:58] :-P [01:11:09] haha very true :) [01:11:23] * mark wonders when perl will sneak in [01:11:45] mark: did you see CL donated $100K to perl foundation? [01:11:50] i did [01:11:54] hence my comment :D [01:11:59] and it's already there, like a disease [01:12:07] all my scripting is in perl :-P [01:12:07] yeah I saw that [01:12:17] not much longer ;-P [01:12:36] mark: so we wanna have sdtpa expand row d to redundnat power right? [01:12:42] did you want to email and ask for quote on that? [01:12:54] I can write it in perl and know it'll work, or attempt it in python (to be cool) and know it'll break until I get python chops [01:13:05] d3 will be empty when we decom those servers, which would be a good spot for the gluster servers [01:13:07] well, half of them [01:13:09] that's fine, we can fix it then ;) [01:13:10] split for redundancy [01:13:25] RobH: do the ciscos have redundant power? [01:13:29] yes [01:13:34] if i recall correctly. [01:13:39] lemme login to one to confirm [01:13:41] prolly do [01:15:31] 99.99% they do [01:15:55] indeed, they do [01:16:57] I love it when I make the $1 test donation to make sure I didn't just break payments, and still get the "You are amazing..." email from civicrm-Sue [01:17:22] take that self esteem boost and run with it [01:17:24] seems like for $1 for a US donor a "gee thanks" would do [01:17:25] it cost you a buck ;] [01:17:44] RobH: someday when I donate big it'll be in $1 increments [01:18:16] It looks like the same email, but she's actually being sarcastic. [01:18:37] ah thinking of it that way restores my sense of universal balance [01:19:45] mark speaking of perl-bashing I've had the pleasure(?) of hacking a CGI for OTRS reporting over the past couple days, complete with half-using their API and half avoiding it like the plague [01:20:00] Jeff_Green: have you ever donated not in english? you will get a translated thank you email :-) [01:20:01] API is the wrong word [01:20:27] I haven't--that sounds like a good too. [01:20:30] err good test too [01:20:35] heh [01:20:52] just make sure its a bigger-ish language, i didn't have time to add things like wolof [01:21:03] hahahah [01:21:17] is esperanto supported? [01:21:28] klingon? [01:21:33] I might understand the forms enough to function in French or German [01:21:35] we closed that wiki [01:21:50] must support all open wikis! [01:21:52] RobH: no one has translated that one [01:21:56] https://meta.wikimedia.org/wiki/Category:Translation/Fundraising_2011/Thank_You_Mail [01:24:09] alright--i'm out. have a good evening folks [01:24:16] night [01:35:16] New patchset: Pyoungmeister; "copy/paste error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2107 [01:35:49] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2107 [01:35:49] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2107 [01:57:48] New patchset: Asher; "snapshot db32" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2108 [01:58:14] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2108 [01:58:14] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2108 [02:03:52] PROBLEM - Frontend Squid HTTP on cp1002 is CRITICAL: Connection refused [02:06:52] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [02:09:32] PROBLEM - Memcached on ms-fe1 is CRITICAL: Connection refused [02:10:52] PROBLEM - Memcached on ms-fe2 is CRITICAL: Connection refused [02:23:38] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1710s [02:32:58] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 2316s [02:35:59] New patchset: Catrope; "Fix puppet restart for udp2log-aft" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2109 [02:43:08] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:44:18] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:52:58] RECOVERY - Apache HTTP on srv197 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.206 second response time [03:53:56] New patchset: Catrope; "Fix puppet restart for udp2log-aft" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2109 [03:54:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2109 [04:17:48] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:24:18] RECOVERY - Disk space on es1004 is OK: DISK OK [04:40:18] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [06:04:16] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [06:20:06] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [08:13:15] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [10:01:50] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 442696 MB (3% inode=99%): [10:06:47] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 412796 MB (3% inode=99%): [10:28:30] New patchset: Hashar; "integration site now mobile aware" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2110 [10:28:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2110 [11:13:14] RECOVERY - MySQL slave status on es1004 is OK: OK: [12:47:03] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [12:47:03] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [15:55:31] PROBLEM - check_minfraud3 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:00:31] RECOVERY - check_minfraud3 on payments1 is OK: HTTP OK: HTTP/1.1 200 OK - 8644 bytes in 0.223 second response time [16:14:21] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [16:30:21] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [16:40:31] PROBLEM - check_minfraud3 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:45:31] RECOVERY - check_minfraud3 on payments2 is OK: HTTP OK: HTTP/1.1 200 OK - 8644 bytes in 0.225 second response time [17:39:55] robh: regarding srv199, i think the best option is to turn off sata port a in bios. [17:40:09] !rt 2293 [17:40:09] https://rt.wikimedia.org/Ticket/Display.html?id=2293 [17:40:12] uhh [17:40:24] wouldnt that then disable using that drive? [17:41:08] its covered under warranty until 2012-02-06 [17:41:50] basically if its under warranty [17:41:53] i dont want to just disable something [17:41:57] i want it fixed ;] [17:42:24] is the hard disk plugged into sata port a, or into a controller card? [17:42:30] cmjohnson: 6? [17:42:33] ^ even [17:42:34] heh [17:42:41] into the controller card....nothing is on sata port a [17:42:49] in the other 1950's it is not used as well. [17:42:59] and looks like it was disabled in bios [17:43:08] ahh, ok [17:43:22] as long as its not the primary, then the error is more than likely it expects to see it, but doesnt [17:43:29] so yea, disable the port in bios and we should be ok [17:43:48] then make sure it boots without that, need it shutdown? [17:44:00] yes please [17:45:15] !log shutting down srv199 for bios tinkering by chris [17:45:18] Logged the message, RobH [17:45:28] cmjohnson: done [17:49:41] PROBLEM - Host srv199 is DOWN: PING CRITICAL - Packet loss = 100% [17:53:32] robh i am done...booting now [17:53:45] had to go get keyboard from crash cart upstairs [17:54:18] cool [17:55:58] robh: can you powercycle it and see if boots correctly now [17:56:29] it should show it on initial boot with crash cart [17:56:36] com2 in use, resetting drac [17:56:44] RECOVERY - Host srv199 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [17:57:05] cmjohnson: if you have the crash cart connected, you should be able to confirm the message is gone during post [17:57:11] since it requires someone to hit f1 to boot [17:57:36] ok then it should be fine..becuase it went through post and is at os login [17:57:44] cool, then its ok [18:03:34] PROBLEM - Apache HTTP on srv199 is CRITICAL: Connection refused [18:04:14] !log forcing puppet run on srv199 [18:04:15] Logged the message, RobH [18:06:26] yea, nagios will clear up [18:09:00] stuff on noc.wikimedia.org/conf/ is in SVN... is the repo public? [18:13:54] RECOVERY - Apache HTTP on srv199 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [18:22:54] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [18:27:19] hello ops :) [18:27:27] can someone could potentially merge https://gerrit.wikimedia.org/r/#change,2110 ? :) [18:27:42] that is just some CSS / html tweaks for the continuous integration website [18:27:52] no need to deploy it right now, but a merge would be great [18:32:57] maplebed: are we buying eqiad storage nodes as well? [18:33:05] yes but not right now. [18:33:13] wanna try in tampa first? [18:49:22] does anyone know the story with the cronspammy "lsusb | grep..." cronjob on spence? [18:49:57] I don't but I'd bet it has something to do with a USB SMS gateway. [18:50:28] i'm gonna fix it--looks like the barf is just a path issue [18:51:06] Jeff_Green: yes, that's me [18:51:11] I can fix that [18:51:18] i'm in there anyway, should I just do it? [18:51:35] sure [18:51:42] LeslieCarr: So I fixed the udp2log init script issue. At first I thought it was a bug in the init-scripts library, but as it turns out it was really just Ryan's fault [18:51:49] haha [18:51:54] of course it was ;) [18:51:59] ok. perhaps cronspam will become smsspam :-P [18:52:27] i dunno, i was enjoying having the longest stretch ever of not getting paged [18:53:09] Jeff_Green: weeee [18:53:14] the spool should be clean [18:54:20] New review: Lcarr; "good, looks like it will now actually look for the right pidfile :)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2109 [18:54:50] Jeff_Green: but it's good to know that our sms sending doober needs to be restarted so often... [18:55:21] sounds like quality hardware yah [18:55:29] LeslieCarr: For laughs, read patchset 1 on that change (including the commit-msg), that's when I was still convinced /lib/lsb/init-functions was at fault [18:56:17] :) [18:57:56] New patchset: Mark Bergsma; "Added eqiad service IPs for lvs realservers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2112 [18:58:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2112 [18:58:20] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2112 [18:58:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2112 [19:06:46] New patchset: Mark Bergsma; "Copied text-squid role class into role/cache.pp, renamed to role::cache::squid::text" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2113 [19:07:23] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2113 [19:07:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2113 [19:07:26] Jeff_Green: ugh. I'll redirect stderr and out for that thing [19:08:23] Jeff_Green: that second | was needed I'llreadd it [19:11:12] New patchset: Mark Bergsma; "Fix variable name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2114 [19:11:29] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2114 [19:11:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2114 [19:14:23] New patchset: Mark Bergsma; "Work around Puppet bug" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2115 [19:14:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2115 [19:15:01] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2115 [19:15:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2115 [19:21:00] New patchset: Asher; "path to support legacy mysql installs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2116 [19:21:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2116 [19:21:19] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2116 [19:21:19] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2116 [19:23:07] New patchset: RobH; "upated added new simple shell script for ipmi mgmt" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2084 [19:23:14] notpeter: sorry, was on phone with the heating contractor. cool re. /dev/null [19:23:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2084 [19:24:49] !log disregard any flapping by mw1001, its my script testbed [19:24:50] Logged the message, RobH [19:25:17] ok, someone review my script ;] [19:27:03] New patchset: Asher; "provide socket path for older installs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2117 [19:27:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2117 [19:27:23] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2117 [19:27:23] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2117 [19:29:11] mark: so you dont think the mgmt script should just be on all bastion hosts? [19:29:22] bast1001, fenari, etc.. [19:30:07] well I'd prefer it if ops started handling sensitive stuff and passwords and shit on a host where not every third party contractor or volunteer logs in ;) [19:30:10] cuz i can toss the file to be included as part of misc::bastionhost [19:30:18] that would be very lazy of you ;) [19:30:29] but those hosts can access ipmi anyhow [19:30:36] with a quick ipmi tunnel through the bastion anyhow [19:30:47] but they don't have the password [19:30:49] (plus fenari already has it installed ;) [19:31:00] the software isn't sensitive, the password is [19:31:06] right, but the script doesnt have the password, though i would assume folks will set it in shell environment [19:31:17] but if they dont, it will simply prompt per command run [19:31:19] yes [19:31:21] but if you install that on fenari [19:31:24] people will use it there [19:31:54] true, but they already have the ability to do all the stuff the script does, by manually running the commands [19:32:09] so the arguement is to get all ops to use an ops bastion? [19:32:16] yes [19:32:30] So should I install bast1002 or something for ops only use? [19:32:37] then create a ops bastion misc group for this stuff? [19:32:45] or something with a nicer name :P [19:32:52] (i would assume i should pull ipmitool OFF fenari) [19:32:55] New patchset: Asher; "upgrading mysql on db37" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2118 [19:33:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2118 [19:33:12] i prefer bastion1002... [19:33:18] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2118 [19:33:18] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2118 [19:33:23] but we named bast1001 already ;] [19:33:24] can we call it ops-fortress-of-solitude [19:33:25] ? [19:33:32] bastion1002 sounds like it's exactly the same as bastion1001 [19:33:34] which it's not [19:33:48] so, clearly, my name suggestion is the better one ;) [19:33:50] call it basta [19:34:30] =/ [19:34:44] cp1001 is similar to cp1048 but one is varnish and the other squid [19:34:48] so i dont see the issue [19:34:57] yes but you don't need to log in to those [19:35:12] opsbastion1001 [19:35:12] I don't want to have to think like "do i need number 37 or 42 for this task" [19:35:47] pillowfort.wikimedia.org? [19:35:59] I think bast1001 is a stupid name for a bastion host already, let's not make it worse ;) [19:36:17] bastion1001 would have been better ;] [19:36:22] the_lions_den.wikimedia.org [19:36:31] err dashes i guess [19:38:00] how about helpful cnames instead? [19:38:14] i dont understand how all of you cannot memorize long random strings of characters. [19:38:21] how can any of you be sysadmins? ;p [19:38:50] have to do ipmi, log into host 'wmf1047' [19:38:56] name everything by asset tag! [19:39:10] * RobH is only partially sarcastic [19:41:37] actually RobH I have seen that work when you have a good machine database, often that does automatic cnames [19:42:08] opsbastion is what i am leaning towards. [19:42:12] unless someone gives me a better name. [19:42:25] just whatever the misc server you use is named now? [19:42:32] just use element name then? [19:42:35] why is there the sudden urge to renmae everything? [19:42:54] so it will also get nfs yes? [19:43:04] it can yeah [19:43:05] setup similar to fenari with far less access, no noc, nfs mount [19:43:17] basically same as bast1001 with even less access [19:43:18] really this doesn't need to be a bastion host [19:43:20] it can be any host [19:43:26] we can login via the bastion hosts [19:43:30] then ssh into this management host [19:43:37] kinda annoying to know 'login to bastion, then onto this host to do lights out mgmt' [19:43:39] i dislike the idea [19:43:47] I dislike using a new server for this too [19:43:52] it can be any host which we already have [19:43:57] something like the puppet server [19:44:12] it's not really necessary to spend a few thousand dollar on [19:44:19] making a misc::mgmthost manifest [19:44:29] then we can just include on puppetmasters i guess [19:44:33] yeah [19:51:04] New patchset: RobH; "added in misc::mgmt to include ipmitool and ipmi script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2119 [19:51:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2119 [19:52:05] someone who isnt me wanna check my two ipmi related changes? [19:58:46] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2119 [19:58:52] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2084 [20:07:36] New patchset: RobH; "upated added new simple shell script for ipmi mgmt updated Change-Id: I33e6afa9b9d34e8bead610f7a2d4cb713065b88b" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2084 [20:07:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2084 [20:09:22] New patchset: RobH; "added in misc::mgmt to include ipmitool and ipmi script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2119 [20:09:54] Change abandoned: RobH; "combined to another patch by mistake, abandoning this one" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2119 [20:17:46] New review: Demon; "(no comment)" [test/mediawiki/core] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1841 [20:17:47] Change merged: Demon; [test/mediawiki/core] (master) - https://gerrit.wikimedia.org/r/1841 [20:21:06] New patchset: Jgreen; "adding mysql::packages to storage3's config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2120 [20:21:24] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/2120 [20:21:44] grrr. [20:23:30] New patchset: Jgreen; "adding mysql::packages to storage3's config, plus a comma so puppet stops chundering" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2120 [20:23:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2120 [20:23:49] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2120 [20:23:50] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2120 [21:04:45] New patchset: Bhartshorne; "moving tampa swift cluster from test to prod configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2121 [21:05:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2121 [21:06:03] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2121 [21:06:04] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2121 [21:12:10] !log upgraded storage3 mysqld from 5.1.47 to mysql-at-facebook-r3753 [21:12:12] Logged the message, Master [21:32:57] New patchset: Lcarr; "moving all of the misc:: and generic:: webserver classes to own class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2122 [21:39:05] hey, would anyone like to check this out ? i got fed up with webserver classes having different names and renamed them all to the same file [21:39:24] mark: https://gerrit.wikimedia.org/r/#change,2084,patchset=3 ;] [21:48:06] New review: RobH; "Self review is the best kind of review" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2084 [21:48:07] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2084 [21:49:23] LeslieCarr: behold http://d1hsxkpnft2izn.cloudfront.net/image.aspx/media/images/_web-assets/mens/Ingenuity/longtail_HC_box.png-402x479 [21:49:51] is that real ? [21:49:54] because that's awesome [21:49:55] It is! [21:50:24] the same company sells a line of jeans for men called 'ballroom' :) [21:51:13] http://www.duluthtrading.com/store/mens/duluth-ingenuity/mens-ballroom-jeans/features/AD_ballroom.aspx [21:51:27] Generally products targeted for plumbers. [21:51:56] New patchset: RobH; "tagged sockpuppet into misc::ipmimgmthost role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2123 [21:52:29] oh man, that ad is great. [21:52:41] New review: RobH; "seems fine, just adding a misc role to sockpuppet" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2123 [21:55:02] oh my god [21:55:11] i approve of this company [21:55:49] * jeremyb glares @ the jeans [21:56:22] Change abandoned: RobH; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2123 [21:57:42] far and away my biggest problem with jeans is that the pockets grow holes (so, e.g. coins fall out and down your pants!) when the rest of the garment is in perfect working order [21:58:07] makes me want to proactively sew an extra layer on the pockets before something breaks [21:58:12] jeremyb: Those same jeans have reinforced pockets so that you can carry pliers and screwdrivers and such. [21:58:35] * andrewbogott bursts with Minnesota pride [21:58:48] SF now or where? [22:00:19] have you met [[garrison Keillor]]? [22:00:23] maplebed: having trouble getting LVS working? [22:00:46] hrm, i don't have that problem. i do have a major problem with women's jeans that i can't fit anything in their pockets [22:00:57] like a lot of them i can't get my wallet in them, and my wallet is small [22:00:58] :( [22:00:59] jeremyb: He's from Saint Paul, those guys are jerks. [22:01:24] eww, hate prarie home companion [22:01:47] mark: haven't been trying. [22:02:05] mark: I've been formatting ms-be1 with its last two partitions and putting together the cluster. [22:02:11] my favorite thing was once i was on my bike, stuck behind a convertible blasting PHC… i started ranting about how much i hate it and then the guy turned it off [22:02:37] hehe [22:02:55] I like it and hate it both at once. The show seems to get more condescending as it ages. [22:03:17] orly [22:03:32] Also, weirdly, Keillor actually likes rock and roll, he only plays all that weird bluegrass and irish folk because he knows his Public Radio demo. [22:04:45] * jeremyb pictures jorm on phc [22:05:24] it's just not funny [22:05:26] or interesting [22:05:31] * andrewbogott realizes he should not slander celtic music in a room full of sysadmins [22:05:32] and it's played so often on the weekends [22:06:41] LeslieCarr obviously doesn't have enough stations to choose from [22:06:52] In the 80's their house band was great (with Butch Thompson) and most of my find memories are from... very very long ago. [22:07:32] when i want to rent a car it's usually the weekends… and there's hardly anything good playing [22:07:43] And they used to have Peter Ostrouschko on a lot, and he gets a lifetime pass for being on Blood On The Tracks [22:08:07] ...and now I'll stop dropping names of Minnesotan musicians who no one has heard of! [22:09:40] New patchset: RobH; "added in ipmi mgmt host misc to sockpuppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2124 [22:09:57] New patchset: Asher; "new mysql monitoring, test on two dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2125 [22:10:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2124 [22:10:17] New review: RobH; "added ipmi mgmt host entry for sockpuppet" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2124 [22:10:18] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2124 [22:12:06] err: /Stage[main]/Puppetmaster::Gitclone/File[/var/lib/git/operations/private/.git/hooks/post-merge]/ensure: change from absent to file failed: Could not set 'file on ensure: No such file or directory - /var/lib/git/operations/private/.git/hooks/post-merge.puppettmp_6467 at /var/lib/git/operations/puppet/manifests/puppetmaster.pp:139 [22:12:16] so that shows on sockpuppet during a puppet run [22:12:36] mark: (any ideas what that really means?) [22:12:39] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2122 [22:12:40] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2122 [22:12:49] i didnt change anything in the private repo [22:12:58] so not sure wtf is causing sockpuppet to vomit on puppet run [22:13:23] different error now, disregard... wtf [22:14:04] !log poking at puppet change breaking things on sockpuppet puppet runs [22:14:05] Logged the message, RobH [22:14:14] https://en.wikipedia.org/wiki/Garrison_Keillor#In_popular_culture [22:14:43] New patchset: Asher; "new mysql monitoring, test on two dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2125 [22:15:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2125 [22:15:02] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2125 [22:15:03] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2125 [22:15:38] RobH: that's due to leslie's work yesterday [22:15:46] ok, now i have another error, heh [22:15:51] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class misc::ipmimgmthost for sockpuppet.pmtpa.wmnet at /var/lib/git/operations/puppet/manifests/site.pp:1775 on node sockpuppet.pmtpa.wmnet [22:15:57] yet site.pp imports misc/* [22:16:12] oh [22:16:13] and not a typo on the sockpuppet include line [22:16:15] ipmimgmthostwtfbbq11111 [22:16:26] sorry robh that specific line is a known issue [22:16:35] please rename that to misc::management::ipmi or so [22:16:36] yea it stopped bitching about that after two runs [22:16:42] cool [22:17:49] New patchset: RobH; "removing my change to site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2126 [22:17:59] LeslieCarr: i'm going to merge your last change on sockpuppet if that's ok [22:18:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2126 [22:18:08] New review: RobH; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2126 [22:18:15] New review: RobH; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2126 [22:18:16] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2126 [22:18:55] sure [22:19:03] thanks binasher [22:19:09] binasher: you merging my revoking that lind in site.pp as well i think [22:19:32] lemme know when its live so i can update sockpuppet puppet run please [22:20:21] New patchset: Asher; "fix typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2127 [22:20:36] mark: you mean like class misc::management::ipmi ? [22:20:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2127 [22:20:42] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2127 [22:20:42] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2127 [22:21:09] RobH: yeah [22:21:17] in misc/management.pp [22:21:18] RobH: your change should be live [22:21:23] thx [22:23:03] New patchset: RobH; "renaming to more easily read misc::management::ipmi" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2128 [22:23:36] mark: care to glance at https://gerrit.wikimedia.org/r/#change,2128 ? [22:24:18] sec [22:24:37] please rename that file to misc/management.pp [22:24:40] as is the class name [22:25:14] New patchset: Asher; "update class name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2129 [22:26:05] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2129 [22:26:05] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2129 [22:27:04] New patchset: RobH; "renaming to more easily read misc::management::ipmi" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2128 [22:27:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2128 [22:28:35] New patchset: Asher; "typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2130 [22:28:53] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2130 [22:28:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2130 [22:28:58] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2130 [22:28:59] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2130 [22:29:57] mark: https://gerrit.wikimedia.org/r/#change,2128 like better? [22:30:36] also remove the "host" in the description [22:30:40] it already appends "server" [22:30:52] in the system_role [22:30:55] otherwise, good [22:33:02] New patchset: RobH; "renaming to more easily read misc::management::ipmi" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2128 [22:33:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2128 [22:36:11] New patchset: RobH; "renaming to more easily read misc::management::ipmi added in sockpuppet to role of ipmi mgmt Change-Id: I828cf708396493e413580839bc6fc1fde5314d4f" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2128 [22:36:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2128 [22:36:49] Ok, so now thats all set, updated site.pp, need someone other than me to review [22:36:52] https://gerrit.wikimedia.org/r/#change,2128,patchset=3 [22:36:58] ack, bad link [22:37:09] https://gerrit.wikimedia.org/r/#change,2128 [22:37:31] !g 2128 [22:37:31] https://gerrit.wikimedia.org/r/2128 [22:37:36] RobH: ---^^ [22:37:48] damn wmbot is fancy these days [22:39:02] Also, git-review allows you to download it with git review -d 2128 [22:39:06] (another shameless plug) [22:40:09] * RobH is going to give it another few minutes before he self reviews. [22:41:52] New review: Catrope; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2128 [22:41:59] PROBLEM - RAID on srv193 is CRITICAL: Connection refused by host [22:42:10] New review: RobH; "self review is like self help, doomed to fail" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2128 [22:42:11] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2128 [22:42:29] PROBLEM - mobile traffic loggers on cp1043 is CRITICAL: Connection refused by host [22:42:29] PROBLEM - Disk space on cp1043 is CRITICAL: Connection refused by host [22:42:49] PROBLEM - Disk space on es2 is CRITICAL: Connection refused by host [22:42:49] PROBLEM - RAID on es2 is CRITICAL: Connection refused by host [22:43:29] PROBLEM - Disk space on snapshot3 is CRITICAL: Connection refused by host [22:43:39] PROBLEM - DPKG on srv193 is CRITICAL: Connection refused by host [22:44:09] PROBLEM - RAID on cp1043 is CRITICAL: Connection refused by host [22:44:31] New patchset: Bhartshorne; "adding LVS addresses to ms-fe boxen. Removing ms-be as ganglia aggregators." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2131 [22:44:39] PROBLEM - MySQL disk space on es2 is CRITICAL: Connection refused by host [22:44:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2131 [22:45:06] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2131 [22:45:07] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2131 [22:45:49] PROBLEM - Disk space on srv193 is CRITICAL: Connection refused by host [22:46:10] PROBLEM - DPKG on cp1041 is CRITICAL: Connection refused by host [22:46:19] PROBLEM - RAID on bast1001 is CRITICAL: Connection refused by host [22:46:19] PROBLEM - MySQL disk space on db1018 is CRITICAL: Connection refused by host [22:46:39] PROBLEM - DPKG on ms5 is CRITICAL: Connection refused by host [22:46:49] PROBLEM - RAID on ganglia1001 is CRITICAL: Connection refused by host [22:47:13] huh [22:47:18] ipmi stuff on fenari to mw1001 runs [22:47:19] PROBLEM - RAID on virt2 is CRITICAL: Connection refused by host [22:47:23] ipmi stuff on sockpuppet to mw1001 fails [22:47:39] PROBLEM - MySQL disk space on db22 is CRITICAL: Connection refused by host [22:47:53] mark or LeslieCarr is the mgmt network for eqiad not supposed to be reachable from sockpuppet? [22:47:59] PROBLEM - RAID on ms5 is CRITICAL: Connection refused by host [22:48:02] cuz i cannot even ping mw1001.mgmt from sockpuppet, but can from fenari [22:48:09] PROBLEM - jenkins_service_running on aluminium is CRITICAL: Connection refused by host [22:48:16] hence my mgmt stuffs on sockpuppet now arent useful =[ [22:48:24] RobH: my guess is that it is FQDN [22:48:34] use fqdn [22:48:35] i am pinging with the fqdn [22:48:38] i am. [22:48:39] PROBLEM - DPKG on db1007 is CRITICAL: Connection refused by host [22:48:39] PROBLEM - DPKG on cp1043 is CRITICAL: Connection refused by host [22:48:49] PROBLEM - Disk space on cp1041 is CRITICAL: Connection refused by host [22:48:49] PROBLEM - RAID on db22 is CRITICAL: Connection refused by host [22:49:09] PROBLEM - RAID on snapshot3 is CRITICAL: Connection refused by host [22:49:11] pinging mw1001.mgmt.eqiad.wmnet from sockpuppet fails [22:49:33] i can host and get IP from it, it knows what it is [22:49:39] just cannot route to it from sockpuppet it seems. [22:49:49] PROBLEM - Disk space on ms5 is CRITICAL: Connection refused by host [22:50:39] PROBLEM - Disk space on es3 is CRITICAL: Connection refused by host [22:50:39] PROBLEM - DPKG on virt2 is CRITICAL: Connection refused by host [22:50:39] PROBLEM - DPKG on db1008 is CRITICAL: Connection refused by host [22:50:49] PROBLEM - RAID on db1018 is CRITICAL: Connection refused by host [22:50:49] PROBLEM - RAID on db1008 is CRITICAL: Connection refused by host [22:50:59] PROBLEM - DPKG on ganglia1001 is CRITICAL: Connection refused by host [22:50:59] PROBLEM - MySQL disk space on db1020 is CRITICAL: Connection refused by host [22:50:59] PROBLEM - RAID on es1003 is CRITICAL: Connection refused by host [22:50:59] PROBLEM - Disk space on es1003 is CRITICAL: Connection refused by host [22:50:59] PROBLEM - DPKG on es2 is CRITICAL: Connection refused by host [22:51:00] PROBLEM - MySQL disk space on es3 is CRITICAL: Connection refused by host [22:51:49] PROBLEM - DPKG on snapshot3 is CRITICAL: Connection refused by host [22:51:49] PROBLEM - DPKG on snapshot1 is CRITICAL: Connection refused by host [22:51:59] RECOVERY - RAID on srv193 is OK: OK: no RAID installed [22:52:19] PROBLEM - DPKG on bast1001 is CRITICAL: Connection refused by host [22:52:29] PROBLEM - mobile traffic loggers on cp1041 is CRITICAL: Connection refused by host [22:52:29] PROBLEM - RAID on cp1041 is CRITICAL: Connection refused by host [22:52:29] RECOVERY - mobile traffic loggers on cp1043 is OK: PROCS OK: 2 processes with command name varnishncsa [22:52:39] PROBLEM - Disk space on virt2 is CRITICAL: Connection refused by host [22:52:39] PROBLEM - Disk space on virt4 is CRITICAL: Connection refused by host [22:52:39] PROBLEM - Disk space on db1008 is CRITICAL: Connection refused by host [22:52:39] PROBLEM - DPKG on db22 is CRITICAL: Connection refused by host [22:52:39] RECOVERY - Disk space on cp1043 is OK: DISK OK [22:52:49] PROBLEM - Disk space on db22 is CRITICAL: Connection refused by host [22:52:49] PROBLEM - MySQL disk space on db1007 is CRITICAL: Connection refused by host [22:52:49] PROBLEM - Disk space on srv223 is CRITICAL: Connection refused by host [22:52:49] RECOVERY - Disk space on es2 is OK: DISK OK [22:52:49] RECOVERY - RAID on es2 is OK: OK: State is Optimal, checked 2 logical device(s) [22:52:59] PROBLEM - Disk space on db26 is CRITICAL: Connection refused by host [22:52:59] PROBLEM - Disk space on ganglia1001 is CRITICAL: Connection refused by host [22:52:59] PROBLEM - Disk space on es4 is CRITICAL: Connection refused by host [22:53:09] PROBLEM - RAID on es4 is CRITICAL: Connection refused by host [22:53:09] PROBLEM - Disk space on db1020 is CRITICAL: Connection refused by host [22:53:29] PROBLEM - Disk space on db1007 is CRITICAL: Connection refused by host [22:53:49] RECOVERY - DPKG on srv193 is OK: All packages OK [22:53:49] RECOVERY - Disk space on snapshot3 is OK: DISK OK [22:53:49] PROBLEM - Disk space on snapshot1 is CRITICAL: Connection refused by host [22:53:49] PROBLEM - Disk space on bast1001 is CRITICAL: Connection refused by host [22:53:59] PROBLEM - DPKG on srv238 is CRITICAL: Connection refused by host [22:53:59] RECOVERY - Disk space on ms-fe2 is OK: DISK OK [22:53:59] PROBLEM - DPKG on srv190 is CRITICAL: Connection refused by host [22:54:09] PROBLEM - RAID on srv223 is CRITICAL: Connection refused by host [22:54:19] PROBLEM - Disk space on cp1042 is CRITICAL: Connection refused by host [22:54:19] PROBLEM - MySQL disk space on db1008 is CRITICAL: Connection refused by host [22:54:19] RECOVERY - RAID on cp1043 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [22:54:29] PROBLEM - DPKG on aluminium is CRITICAL: Connection refused by host [22:54:29] PROBLEM - RAID on aluminium is CRITICAL: Connection refused by host [22:54:29] PROBLEM - Disk space on srv276 is CRITICAL: Connection refused by host [22:54:39] PROBLEM - RAID on db1020 is CRITICAL: Connection refused by host [22:54:39] PROBLEM - Disk space on db1018 is CRITICAL: Connection refused by host [22:54:39] PROBLEM - RAID on srv276 is CRITICAL: Connection refused by host [22:54:49] RECOVERY - MySQL disk space on es2 is OK: DISK OK [22:54:59] PROBLEM - DPKG on srv276 is CRITICAL: Connection refused by host [22:55:09] RECOVERY - Memcached on ms-fe2 is OK: TCP OK - 0.001 second response time on port 11211 [22:55:09] PROBLEM - MySQL disk space on db26 is CRITICAL: Connection refused by host [22:55:49] PROBLEM - DPKG on db1018 is CRITICAL: Connection refused by host [22:55:59] RECOVERY - Disk space on srv193 is OK: DISK OK [22:56:09] PROBLEM - Disk space on srv238 is CRITICAL: Connection refused by host [22:56:19] PROBLEM - mobile traffic loggers on cp1042 is CRITICAL: Connection refused by host [22:56:19] PROBLEM - Disk space on srv190 is CRITICAL: Connection refused by host [22:56:19] RECOVERY - DPKG on cp1041 is OK: All packages OK [22:56:19] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [22:56:19] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [22:56:29] PROBLEM - Disk space on aluminium is CRITICAL: Connection refused by host [22:56:29] RECOVERY - RAID on bast1001 is OK: OK: no RAID installed [22:56:39] PROBLEM - RAID on db1007 is CRITICAL: Connection refused by host [22:56:39] PROBLEM - RAID on db1002 is CRITICAL: Connection refused by host [22:56:39] RECOVERY - MySQL disk space on db1018 is OK: DISK OK [22:56:39] PROBLEM - DPKG on db1020 is CRITICAL: Connection refused by host [22:56:49] PROBLEM - DPKG on db25 is CRITICAL: Connection refused by host [22:56:49] PROBLEM - RAID on db26 is CRITICAL: Connection refused by host [22:56:49] RECOVERY - DPKG on ms5 is OK: All packages OK [22:56:59] RECOVERY - RAID on ganglia1001 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [22:57:19] PROBLEM - RAID on db1034 is CRITICAL: Connection refused by host [22:57:39] PROBLEM - RAID on virt4 is CRITICAL: Connection refused by host [22:57:39] RECOVERY - RAID on virt2 is OK: OK: State is Optimal, checked 2 logical device(s) [22:57:59] PROBLEM - DPKG on srv223 is CRITICAL: Connection refused by host [22:57:59] RECOVERY - MySQL disk space on db22 is OK: DISK OK [22:57:59] PROBLEM - Disk space on db13 is CRITICAL: Connection refused by host [22:58:09] PROBLEM - DPKG on srv239 is CRITICAL: Connection refused by host [22:58:09] RECOVERY - RAID on ms5 is OK: OK: Active: 50, Working: 50, Failed: 0, Spare: 0 [22:58:19] RECOVERY - jenkins_service_running on aluminium is OK: PROCS OK: 3 processes with args jenkins [22:58:19] PROBLEM - Disk space on srv272 is CRITICAL: Connection refused by host [22:58:39] PROBLEM - Disk space on db53 is CRITICAL: Connection refused by host [22:58:39] PROBLEM - MySQL disk space on es4 is CRITICAL: Connection refused by host [22:58:49] PROBLEM - DPKG on db1002 is CRITICAL: Connection refused by host [22:58:49] RECOVERY - DPKG on cp1043 is OK: All packages OK [22:58:49] RECOVERY - DPKG on db1007 is OK: All packages OK [22:58:59] PROBLEM - DPKG on db26 is CRITICAL: Connection refused by host [22:58:59] RECOVERY - Disk space on cp1041 is OK: DISK OK [22:58:59] PROBLEM - RAID on snapshot1 is CRITICAL: Connection refused by host [22:58:59] PROBLEM - Disk space on db25 is CRITICAL: Connection refused by host [22:58:59] RECOVERY - RAID on db22 is OK: OK: 1 logical device(s) checked [22:59:09] PROBLEM - DPKG on db52 is CRITICAL: Connection refused by host [22:59:09] PROBLEM - MySQL disk space on db13 is CRITICAL: Connection refused by host [22:59:09] PROBLEM - RAID on db53 is CRITICAL: Connection refused by host [22:59:19] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [22:59:19] PROBLEM - Disk space on db1001 is CRITICAL: Connection refused by host [22:59:19] PROBLEM - DPKG on es3 is CRITICAL: Connection refused by host [22:59:29] PROBLEM - RAID on es3 is CRITICAL: Connection refused by host [22:59:29] PROBLEM - RAID on snapshot2 is CRITICAL: Connection refused by host [22:59:29] PROBLEM - Disk space on srv201 is CRITICAL: Connection refused by host [22:59:49] PROBLEM - RAID on srv238 is CRITICAL: Connection refused by host [22:59:59] RECOVERY - RAID on ms-fe2 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [22:59:59] PROBLEM - RAID on cp1042 is CRITICAL: Connection refused by host [23:00:09] RECOVERY - Disk space on ms5 is OK: DISK OK [23:00:29] PROBLEM - DPKG on mw3 is CRITICAL: Connection refused by host [23:00:39] PROBLEM - Disk space on srv239 is CRITICAL: Connection refused by host [23:00:49] RECOVERY - Disk space on es3 is OK: DISK OK [23:00:59] RECOVERY - DPKG on db1008 is OK: All packages OK [23:00:59] PROBLEM - DPKG on virt4 is CRITICAL: Connection refused by host [23:00:59] RECOVERY - RAID on db1008 is OK: OK: State is Optimal, checked 2 logical device(s) [23:00:59] RECOVERY - RAID on db1018 is OK: OK: State is Optimal, checked 2 logical device(s) [23:01:09] RECOVERY - DPKG on virt2 is OK: All packages OK [23:01:09] PROBLEM - DPKG on es4 is CRITICAL: Connection refused by host [23:01:09] PROBLEM - Disk space on db47 is CRITICAL: Connection refused by host [23:01:09] RECOVERY - DPKG on ganglia1001 is OK: All packages OK [23:01:09] PROBLEM - DPKG on db53 is CRITICAL: Connection refused by host [23:01:10] RECOVERY - MySQL disk space on db1020 is OK: DISK OK [23:01:10] RECOVERY - Disk space on es1003 is OK: DISK OK [23:01:11] RECOVERY - RAID on es1003 is OK: OK: State is Optimal, checked 2 logical device(s) [23:01:19] PROBLEM - Disk space on srv192 is CRITICAL: Connection refused by host [23:01:19] PROBLEM - MySQL disk space on db1034 is CRITICAL: Connection refused by host [23:01:29] RECOVERY - DPKG on es2 is OK: All packages OK [23:01:29] PROBLEM - Disk space on db52 is CRITICAL: Connection refused by host [23:01:29] RECOVERY - MySQL disk space on es3 is OK: DISK OK [23:01:29] PROBLEM - DPKG on db1017 is CRITICAL: Connection refused by host [23:01:29] PROBLEM - RAID on searchidx2 is CRITICAL: Connection refused by host [23:01:39] RECOVERY - DPKG on ms-fe2 is OK: All packages OK [23:01:59] PROBLEM - Disk space on mw3 is CRITICAL: Connection refused by host [23:01:59] RECOVERY - DPKG on snapshot3 is OK: All packages OK [23:01:59] RECOVERY - DPKG on snapshot1 is OK: All packages OK [23:02:09] PROBLEM - RAID on mw1115 is CRITICAL: Connection refused by host [23:02:10] PROBLEM - DPKG on snapshot2 is CRITICAL: Connection refused by host [23:02:19] PROBLEM - DPKG on srv192 is CRITICAL: Connection refused by host [23:02:29] PROBLEM - RAID on srv190 is CRITICAL: Connection refused by host [23:02:39] RECOVERY - mobile traffic loggers on cp1041 is OK: PROCS OK: 2 processes with command name varnishncsa [23:02:49] PROBLEM - DPKG on cp1042 is CRITICAL: Connection refused by host [23:02:49] PROBLEM - MySQL disk space on db1002 is CRITICAL: Connection refused by host [23:02:49] RECOVERY - Disk space on virt2 is OK: DISK OK [23:02:49] RECOVERY - Disk space on virt4 is OK: DISK OK [23:02:49] RECOVERY - Disk space on db1008 is OK: DISK OK [23:02:59] RECOVERY - DPKG on bast1001 is OK: All packages OK [23:02:59] PROBLEM - Disk space on db1002 is CRITICAL: Connection refused by host [23:02:59] RECOVERY - Disk space on db22 is OK: DISK OK [23:02:59] RECOVERY - MySQL disk space on db1007 is OK: DISK OK [23:03:09] RECOVERY - Disk space on srv223 is OK: DISK OK [23:03:09] RECOVERY - DPKG on db22 is OK: All packages OK [23:03:09] RECOVERY - RAID on cp1041 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [23:03:09] PROBLEM - MySQL disk space on db1001 is CRITICAL: Connection refused by host [23:03:19] RECOVERY - Disk space on db26 is OK: DISK OK [23:03:19] PROBLEM - MySQL disk space on db25 is CRITICAL: Connection refused by host [23:03:19] RECOVERY - Disk space on ganglia1001 is OK: DISK OK [23:03:19] RECOVERY - RAID on es4 is OK: OK: State is Optimal, checked 2 logical device(s) [23:03:29] RECOVERY - Disk space on es4 is OK: DISK OK [23:03:39] RECOVERY - Disk space on db1020 is OK: DISK OK [23:03:39] RECOVERY - Disk space on db1007 is OK: DISK OK [23:03:49] PROBLEM - Disk space on mw56 is CRITICAL: Connection refused by host [23:03:49] PROBLEM - DPKG on searchidx2 is CRITICAL: Connection refused by host [23:03:49] PROBLEM - Disk space on snapshot2 is CRITICAL: Connection refused by host [23:03:59] RECOVERY - Disk space on snapshot1 is OK: DISK OK [23:03:59] RECOVERY - Disk space on bast1001 is OK: DISK OK [23:03:59] PROBLEM - RAID on srv201 is CRITICAL: Connection refused by host [23:04:09] RECOVERY - DPKG on srv238 is OK: All packages OK [23:04:29] RECOVERY - DPKG on srv190 is OK: All packages OK [23:04:29] PROBLEM - RAID on srv272 is CRITICAL: Connection refused by host [23:04:39] RECOVERY - MySQL disk space on db1008 is OK: DISK OK [23:04:39] PROBLEM - RAID on db1001 is CRITICAL: Connection refused by host [23:04:39] RECOVERY - DPKG on aluminium is OK: All packages OK [23:04:39] RECOVERY - Disk space on srv276 is OK: DISK OK [23:04:39] RECOVERY - RAID on aluminium is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [23:04:49] PROBLEM - Disk space on db1004 is CRITICAL: Connection refused by host [23:04:49] PROBLEM - MySQL disk space on db1017 is CRITICAL: Connection refused by host [23:04:49] RECOVERY - Disk space on db1018 is OK: DISK OK [23:04:49] RECOVERY - RAID on db1020 is OK: OK: State is Optimal, checked 2 logical device(s) [23:04:59] PROBLEM - RAID on db25 is CRITICAL: Connection refused by host [23:04:59] RECOVERY - RAID on srv223 is OK: OK: no RAID installed [23:04:59] RECOVERY - RAID on srv276 is OK: OK: no RAID installed [23:04:59] PROBLEM - DPKG on db13 is CRITICAL: Connection refused by host [23:04:59] PROBLEM - DPKG on db1034 is CRITICAL: Connection refused by host [23:05:00] PROBLEM - MySQL disk space on db53 is CRITICAL: Connection refused by host [23:05:09] PROBLEM - Disk space on db1033 is CRITICAL: Connection refused by host [23:05:09] RECOVERY - DPKG on srv276 is OK: All packages OK [23:05:30] RECOVERY - MySQL disk space on db26 is OK: DISK OK [23:05:59] PROBLEM - DPKG on mw1115 is CRITICAL: Connection refused by host [23:05:59] PROBLEM - DPKG on srv201 is CRITICAL: Connection refused by host [23:05:59] PROBLEM - Disk space on searchidx2 is CRITICAL: Connection refused by host [23:05:59] RECOVERY - DPKG on db1018 is OK: All packages OK [23:06:09] PROBLEM - RAID on srv191 is CRITICAL: Connection refused by host [23:06:19] PROBLEM - DPKG on srv272 is CRITICAL: Connection refused by host [23:06:29] RECOVERY - Disk space on srv190 is OK: DISK OK [23:06:29] RECOVERY - mobile traffic loggers on cp1042 is OK: PROCS OK: 2 processes with command name varnishncsa [23:06:29] PROBLEM - MySQL disk space on db52 is CRITICAL: Connection refused by host [23:06:39] PROBLEM - DPKG on db1001 is CRITICAL: Connection refused by host [23:06:39] PROBLEM - DPKG on db1033 is CRITICAL: Connection refused by host [23:06:39] RECOVERY - Disk space on aluminium is OK: DISK OK [23:06:39] PROBLEM - Disk space on db1034 is CRITICAL: Connection refused by host [23:06:49] PROBLEM - MySQL disk space on db1004 is CRITICAL: Connection refused by host [23:06:49] PROBLEM - MySQL disk space on db1033 is CRITICAL: Connection refused by host [23:06:49] RECOVERY - DPKG on db1020 is OK: All packages OK [23:06:49] RECOVERY - RAID on db1007 is OK: OK: State is Optimal, checked 2 logical device(s) [23:06:49] RECOVERY - RAID on db1002 is OK: OK: State is Optimal, checked 2 logical device(s) [23:06:59] RECOVERY - DPKG on db25 is OK: All packages OK [23:06:59] RECOVERY - RAID on db26 is OK: OK: 1 logical device(s) checked [23:07:09] PROBLEM - RAID on db52 is CRITICAL: Connection refused by host [23:07:29] PROBLEM - RAID on mw3 is CRITICAL: Connection refused by host [23:07:29] PROBLEM - Disk space on mw1115 is CRITICAL: Connection refused by host [23:07:29] RECOVERY - RAID on db1034 is OK: OK: State is Optimal, checked 2 logical device(s) [23:07:39] PROBLEM - MySQL disk space on db47 is CRITICAL: Connection refused by host [23:07:49] RECOVERY - RAID on virt4 is OK: OK: State is Optimal, checked 2 logical device(s) [23:08:09] RECOVERY - DPKG on srv223 is OK: All packages OK [23:08:09] RECOVERY - Disk space on db13 is OK: DISK OK [23:08:19] RECOVERY - DPKG on srv239 is OK: All packages OK [23:08:29] RECOVERY - Disk space on srv272 is OK: DISK OK [23:08:49] RECOVERY - Disk space on db53 is OK: DISK OK [23:08:49] RECOVERY - MySQL disk space on es4 is OK: DISK OK [23:08:59] RECOVERY - DPKG on db1002 is OK: All packages OK [23:08:59] PROBLEM - Disk space on db1017 is CRITICAL: Connection refused by host [23:09:09] RECOVERY - DPKG on db26 is OK: All packages OK [23:09:09] PROBLEM - DPKG on db47 is CRITICAL: Connection refused by host [23:09:09] RECOVERY - RAID on snapshot1 is OK: OK: no RAID installed [23:09:09] RECOVERY - Disk space on db25 is OK: DISK OK [23:09:19] RECOVERY - DPKG on db52 is OK: All packages OK [23:09:19] RECOVERY - MySQL disk space on db13 is OK: DISK OK [23:09:19] RECOVERY - RAID on db53 is OK: OK: State is Optimal, checked 12 logical device(s) [23:09:29] RECOVERY - Disk space on db1001 is OK: DISK OK [23:09:29] RECOVERY - DPKG on es3 is OK: All packages OK [23:09:29] PROBLEM - RAID on db16 is CRITICAL: Connection refused by host [23:09:40] RECOVERY - Disk space on srv201 is OK: DISK OK [23:09:40] RECOVERY - RAID on snapshot2 is OK: OK: no RAID installed [23:09:40] RECOVERY - RAID on es3 is OK: OK: State is Optimal, checked 2 logical device(s) [23:09:40] RECOVERY - Disk space on srv238 is OK: DISK OK [23:09:49] RECOVERY - RAID on srv238 is OK: OK: no RAID installed [23:09:59] PROBLEM - DPKG on srv191 is CRITICAL: Connection refused by host [23:10:09] PROBLEM - mobile traffic loggers on cp1044 is CRITICAL: Connection refused by host [23:10:09] RECOVERY - RAID on cp1042 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [23:10:20] PROBLEM - RAID on mw56 is CRITICAL: Connection refused by host [23:10:22] PROBLEM - RAID on srv192 is CRITICAL: Connection refused by host [23:10:39] PROBLEM - Disk space on srv191 is CRITICAL: Connection refused by host [23:10:49] RECOVERY - DPKG on mw3 is OK: All packages OK [23:10:59] PROBLEM - RAID on db1017 is CRITICAL: Connection refused by host [23:11:00] RECOVERY - Disk space on srv239 is OK: DISK OK [23:11:11] PROBLEM - MySQL disk space on db1019 is CRITICAL: Connection refused by host [23:11:11] PROBLEM - RAID on es1001 is CRITICAL: Connection refused by host [23:11:19] RECOVERY - Disk space on db47 is OK: DISK OK [23:11:19] PROBLEM - DPKG on db11 is CRITICAL: Connection refused by host [23:11:19] RECOVERY - DPKG on es4 is OK: All packages OK [23:11:19] RECOVERY - DPKG on db53 is OK: All packages OK [23:11:19] RECOVERY - DPKG on virt4 is OK: All packages OK [23:11:20] PROBLEM - Disk space on es1001 is CRITICAL: Connection refused by host [23:11:29] PROBLEM - RAID on db1033 is CRITICAL: Connection refused by host [23:11:39] RECOVERY - Disk space on srv192 is OK: DISK OK [23:11:39] RECOVERY - MySQL disk space on db1034 is OK: DISK OK [23:11:39] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [23:11:52] PROBLEM - DPKG on mw56 is CRITICAL: Connection refused by host [23:11:52] RECOVERY - Disk space on db52 is OK: DISK OK [23:11:52] RECOVERY - DPKG on db1017 is OK: All packages OK [23:11:59] PROBLEM - Disk space on db1035 is CRITICAL: Connection refused by host [23:12:19] RECOVERY - DPKG on snapshot2 is OK: All packages OK [23:12:19] PROBLEM - RAID on nfs1 is CRITICAL: Connection refused by host [23:12:19] RECOVERY - Disk space on mw3 is OK: DISK OK [23:12:19] RECOVERY - RAID on mw1115 is OK: OK: no RAID installed [23:12:29] PROBLEM - Disk space on nfs1 is CRITICAL: Connection refused by host [23:12:29] PROBLEM - DPKG on db1003 is CRITICAL: Connection refused by host [23:12:29] RECOVERY - RAID on srv190 is OK: OK: no RAID installed [23:12:39] PROBLEM - DPKG on nfs1 is CRITICAL: Connection refused by host [23:12:39] RECOVERY - DPKG on srv192 is OK: All packages OK [23:12:59] PROBLEM - DPKG on db1004 is CRITICAL: Connection refused by host [23:12:59] RECOVERY - DPKG on cp1042 is OK: All packages OK [23:13:16] PROBLEM - Disk space on db16 is CRITICAL: Connection refused by host [23:13:16] PROBLEM - Disk space on db1003 is CRITICAL: Connection refused by host [23:13:16] PROBLEM - RAID on db1004 is CRITICAL: Connection refused by host [23:13:16] RECOVERY - MySQL disk space on db1002 is OK: DISK OK [23:13:16] PROBLEM - DPKG on es1001 is CRITICAL: Connection refused by host [23:13:26] PROBLEM - RAID on db1003 is CRITICAL: Connection refused by host [23:13:36] PROBLEM - RAID on db11 is CRITICAL: Connection refused by host [23:14:06] RECOVERY - DPKG on mw56 is OK: All packages OK [23:14:56] PROBLEM - MySQL disk space on db1035 is CRITICAL: Connection refused by host [23:14:56] RECOVERY - RAID on db1033 is OK: OK: State is Optimal, checked 2 logical device(s) [23:15:06] RECOVERY - MySQL disk space on db47 is OK: DISK OK [23:15:06] RECOVERY - MySQL disk space on db52 is OK: DISK OK [23:15:06] PROBLEM - MySQL disk space on es1001 is CRITICAL: Connection refused by host [23:15:46] RECOVERY - Disk space on mw56 is OK: DISK OK [23:15:46] RECOVERY - MySQL disk space on db1001 is OK: DISK OK [23:15:46] RECOVERY - RAID on nfs1 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [23:16:06] RECOVERY - RAID on srv201 is OK: OK: no RAID installed [23:16:26] RECOVERY - Disk space on db1002 is OK: DISK OK [23:16:36] PROBLEM - RAID on cp1044 is CRITICAL: Connection refused by host [23:16:36] RECOVERY - DPKG on db1004 is OK: All packages OK [23:16:46] PROBLEM - RAID on db1019 is CRITICAL: Connection refused by host [23:16:46] PROBLEM - Disk space on db11 is CRITICAL: Connection refused by host [23:16:46] RECOVERY - Disk space on db1017 is OK: DISK OK [23:16:46] RECOVERY - Disk space on db16 is OK: DISK OK [23:16:46] RECOVERY - DPKG on db1033 is OK: All packages OK [23:16:56] RECOVERY - MySQL disk space on db53 is OK: DISK OK [23:16:56] RECOVERY - RAID on db1004 is OK: OK: State is Optimal, checked 2 logical device(s) [23:17:06] RECOVERY - MySQL disk space on db25 is OK: DISK OK [23:17:09] hey everyone - so it appears that a common thread on these borked machines is nagios-nrpe didn't exit correctly [23:17:26] RECOVERY - DPKG on mw1115 is OK: All packages OK [23:17:46] RECOVERY - Disk space on nfs1 is OK: DISK OK [23:17:56] RECOVERY - Disk space on searchidx2 is OK: DISK OK [23:18:06] RECOVERY - RAID on srv191 is OK: OK: no RAID installed [23:18:06] RECOVERY - DPKG on srv201 is OK: All packages OK [23:18:06] RECOVERY - RAID on srv272 is OK: OK: no RAID installed [23:18:36] RECOVERY - Disk space on cp1042 is OK: DISK OK [23:18:36] RECOVERY - Disk space on snapshot2 is OK: DISK OK [23:18:36] PROBLEM - MySQL disk space on db1003 is CRITICAL: Connection refused by host [23:18:36] RECOVERY - Disk space on db1004 is OK: DISK OK [23:18:36] RECOVERY - RAID on db1001 is OK: OK: State is Optimal, checked 2 logical device(s) [23:18:46] PROBLEM - MySQL disk space on db11 is CRITICAL: Connection refused by host [23:18:46] RECOVERY - DPKG on db13 is OK: All packages OK [23:19:16] RECOVERY - RAID on db1017 is OK: OK: State is Optimal, checked 2 logical device(s) [23:19:26] RECOVERY - DPKG on es1001 is OK: All packages OK [23:19:36] RECOVERY - RAID on mw3 is OK: OK: no RAID installed [23:19:56] RECOVERY - MySQL disk space on db1017 is OK: DISK OK [23:20:06] RECOVERY - DPKG on srv191 is OK: All packages OK [23:20:06] RECOVERY - Disk space on db1033 is OK: DISK OK [23:20:06] RECOVERY - RAID on db25 is OK: OK: 1 logical device(s) checked [23:20:16] RECOVERY - DPKG on srv272 is OK: All packages OK [23:20:26] RECOVERY - RAID on db52 is OK: OK: State is Optimal, checked 2 logical device(s) [23:20:46] RECOVERY - Disk space on db1034 is OK: DISK OK [23:20:46] RECOVERY - MySQL disk space on db1033 is OK: DISK OK [23:20:56] RECOVERY - DPKG on db47 is OK: All packages OK [23:20:56] RECOVERY - RAID on es1001 is OK: OK: State is Optimal, checked 2 logical device(s) [23:21:36] RECOVERY - Disk space on mw1115 is OK: DISK OK [23:21:36] RECOVERY - DPKG on db1034 is OK: All packages OK [23:21:46] RECOVERY - DPKG on nfs1 is OK: All packages OK [23:22:06] RECOVERY - Disk space on srv191 is OK: DISK OK [23:22:06] RECOVERY - RAID on srv192 is OK: OK: no RAID installed [23:22:16] RECOVERY - DPKG on searchidx2 is OK: All packages OK [23:22:36] RECOVERY - RAID on db1003 is OK: OK: State is Optimal, checked 2 logical device(s) [23:23:06] RECOVERY - MySQL disk space on db1004 is OK: DISK OK [23:23:06] RECOVERY - RAID on db16 is OK: OK: 1 logical device(s) checked [23:23:06] RECOVERY - RAID on db11 is OK: OK: 1 logical device(s) checked [23:23:16] RECOVERY - RAID on mw56 is OK: OK: no RAID installed [23:23:26] RECOVERY - mobile traffic loggers on cp1044 is OK: PROCS OK: 2 processes with command name varnishncsa [23:24:06] RECOVERY - DPKG on db1001 is OK: All packages OK [23:24:56] RECOVERY - MySQL disk space on db1019 is OK: DISK OK [23:25:06] RECOVERY - Disk space on db1035 is OK: DISK OK [23:25:06] RECOVERY - Disk space on es1001 is OK: DISK OK [23:25:06] RECOVERY - DPKG on db11 is OK: All packages OK [23:25:06] RECOVERY - MySQL disk space on db1035 is OK: DISK OK [23:25:16] PROBLEM - DPKG on es1002 is CRITICAL: Connection refused by host [23:25:26] RECOVERY - MySQL disk space on es1001 is OK: DISK OK [23:26:06] RECOVERY - DPKG on db1003 is OK: All packages OK [23:26:40] !log Deploying modified squid configs of modified squid config generator to text.knams [23:26:42] Logged the message, Master [23:26:56] RECOVERY - Disk space on db11 is OK: DISK OK [23:26:56] RECOVERY - RAID on cp1044 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [23:26:56] RECOVERY - RAID on db1019 is OK: OK: State is Optimal, checked 2 logical device(s) [23:27:06] RECOVERY - Disk space on db1003 is OK: DISK OK [23:27:16] PROBLEM - RAID on es1002 is CRITICAL: Connection refused by host [23:27:26] PROBLEM - Disk space on es1002 is CRITICAL: Connection refused by host [23:28:42] !log Deployed squid configs to all squids [23:28:44] Logged the message, Master [23:28:56] RECOVERY - MySQL disk space on db11 is OK: DISK OK [23:29:06] RECOVERY - MySQL disk space on db1003 is OK: DISK OK [23:29:06] PROBLEM - MySQL disk space on es1002 is CRITICAL: Connection refused by host [23:29:08] the nrpe init script is screwy [23:29:13] very [23:34:20] New patchset: Bhartshorne; "correcting container name for eqiad test swift cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2132 [23:34:46] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2132 [23:34:46] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2132 [23:34:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2132 [23:35:36] RECOVERY - DPKG on es1002 is OK: All packages OK [23:37:16] PROBLEM - mysqld processes on db32 is CRITICAL: PROCS CRITICAL: 1 process with command name mysqld [23:37:26] RECOVERY - RAID on es1002 is OK: OK: State is Optimal, checked 2 logical device(s) [23:37:36] RECOVERY - Disk space on es1002 is OK: DISK OK [23:39:26] RECOVERY - MySQL disk space on es1002 is OK: DISK OK [23:40:44] New patchset: Asher; "fix mysqld check, since doesn't match mysqld_safe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2133 [23:41:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2133 [23:41:06] PROBLEM - mysqld processes on db36 is CRITICAL: PROCS CRITICAL: 1 process with command name mysqld [23:41:13] uhoh, pushing out nrpe_local.cfg again.. nrpe mayhem [23:41:35] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2133 [23:41:36] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2133 [23:45:31] heh.. /etc/nagios/nrpe.cfg tells nrpe to try to create a pidfile at /var/run/nrpe.pid, which it can't as the nagios user. while the init script tells start-stop-daemon to look at /var/run/nagios/nrpe.pid [23:45:48] ah... [23:45:51] hehehe [23:46:14] we don't push out either file via puppet, it's broken in the package [23:46:16] wow [23:46:36] wow that's impressive [23:47:15] This is why I want labs [23:47:36] RECOVERY - mysqld processes on db32 is OK: PROCS OK: 1 process with command name mysqld [23:47:43] There are quite a few puppet classes that are just functional enough to still appear to work, but just broken enough to not be able to actually spin up a new box [23:49:16] :) [23:51:14] New patchset: Asher; "override the incorrect pid_file def in the packaged config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2134 [23:51:30] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2134 [23:51:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2134 [23:51:36] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2134 [23:51:37] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2134 [23:52:26] PROBLEM - RAID on ms-fe2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:54:16] RECOVERY - Frontend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.200 seconds [23:54:36] PROBLEM - DPKG on ms-fe2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:15] PROBLEM - MySQL disk space on db30 is CRITICAL: Connection refused by host [23:58:05] PROBLEM - Disk space on ms-fe2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.