[00:23:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.337 seconds [00:26:58] !log disabled lvm snapshots and puppet on db32 for revision sha1 alter [00:27:02] Logged the message, Master [00:28:26] !log started enwiki.revision alter on db32 [00:28:30] Logged the message, Master [00:33:59] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 348 seconds [00:34:26] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 376 seconds [00:36:42] binasher: is nagios going to complain about db32 a lot now? [00:37:00] during the enwiki.revision alter... [00:37:19] i can disable notifications [00:37:29] or try [00:59:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:03:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.889 seconds [01:35:42] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 69 MB (0% inode=61%): /var/lib/ureadahead/debugfs 69 MB (0% inode=61%): [01:39:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:45:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.034 seconds [01:46:30] RECOVERY - Disk space on srv219 is OK: DISK OK [02:19:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:48] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [02:25:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.025 seconds [03:04:12] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Mar 22 03:03:51 UTC 2012 [03:34:21] PROBLEM - Disk space on stafford is CRITICAL: DISK CRITICAL - free space: /var/lib/puppet 758 MB (3% inode=92%): [04:27:44] RECOVERY - Disk space on stafford is OK: DISK OK [05:53:14] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [05:55:20] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [06:03:26] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [06:03:26] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [06:29:32] PROBLEM - Disk space on ms1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:08] PROBLEM - RAID on ms1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:56] PROBLEM - DPKG on ms1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:47:06] PROBLEM - Disk space on search1015 is CRITICAL: DISK CRITICAL - free space: /a 3220 MB (2% inode=99%): [07:58:07] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 235 MB (3% inode=61%): /var/lib/ureadahead/debugfs 235 MB (3% inode=61%): [07:58:07] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 255 MB (3% inode=61%): /var/lib/ureadahead/debugfs 255 MB (3% inode=61%): [07:58:16] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 19 MB (0% inode=61%): /var/lib/ureadahead/debugfs 19 MB (0% inode=61%): [08:02:19] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [08:02:19] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 251 MB (3% inode=61%): /var/lib/ureadahead/debugfs 251 MB (3% inode=61%): [08:04:34] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 4 MB (0% inode=61%): /var/lib/ureadahead/debugfs 4 MB (0% inode=61%): [08:10:34] RECOVERY - Disk space on srv224 is OK: DISK OK [08:10:43] RECOVERY - Disk space on srv219 is OK: DISK OK [08:10:52] RECOVERY - Disk space on srv222 is OK: DISK OK [08:14:55] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 215 MB (3% inode=61%): /var/lib/ureadahead/debugfs 215 MB (3% inode=61%): [08:15:13] RECOVERY - Disk space on srv221 is OK: DISK OK [08:17:01] RECOVERY - Disk space on srv220 is OK: DISK OK [08:17:01] RECOVERY - Disk space on srv223 is OK: DISK OK [08:42:09] PROBLEM - Puppet freshness on linne is CRITICAL: Puppet has not run in the last 10 hours [09:10:12] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [09:36:09] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [09:49:13] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [10:13:42] someone who know how gerrit and git works [10:14:05] I am having troubles to access puppet repository we use on labs [10:14:19] it seems that I am writing to production configs rather than labs [10:14:31] how do I switch to labs branch [10:14:38] ^demon: ^ [10:14:39] :P [10:15:02] months ago I changed templates/nrpe_local [10:15:09] that file seems to be completely different now [10:15:25] I guess it's production one [11:02:28] <^demon> Checkout test branch [14:45:35] !log db1020 still offline, requires firmware update on raid controller per rt 2621, will perform later today [14:45:39] Logged the message, RobH [14:51:17] New patchset: Pyoungmeister; "2 more lvs pools for search" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3380 [14:51:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3380 [14:51:45] mark: can you look at ^^ when you have a chance? [15:08:43] !log shutting down search1015 & search1016 for hdd additions [15:08:46] Logged the message, RobH [15:12:26] PROBLEM - Host search1015 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:20] PROBLEM - Host search1016 is DOWN: PING CRITICAL - Packet loss = 100% [15:21:49] New review: Mark Bergsma; "Looks ok, except for the regexes in site.pp." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/3380 [15:34:10] Dell's design on the R410 line is pretty slick, even for non hot swap drives [15:34:24] the cabled drives still have nice carriers that remove easily [15:34:31] tell me about them :P [15:34:45] not so great for 2.5" ssds :P [15:39:07] better than the hot swaps for that ;] [15:39:23] but yea, no enterprise grade servers seem to fit the 3.5 to 2.5 adapters easily. [15:44:54] RECOVERY - Host search1015 is UP: PING OK - Packet loss = 0%, RTA = 26.85 ms [15:45:24] !log search 1015 and search1016 back up with added disks [15:45:25] notpeter: ^ [15:45:27] Logged the message, RobH [15:45:49] notpeter: one hopes that your search data lives in an LVM ;] [15:46:00] though i suppose reinstall isnt a big deal [15:48:25] RobH: definitely reinstall time [15:48:26] thank you! [15:48:30] RECOVERY - Host search1016 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [15:50:34] cmjohnson1: i am stepping out to lunch, but i wanna confirm some stuff on order before i go [15:50:39] actually, when i get back is fine [15:50:46] you will be around today in about an hour or so right? [15:50:57] robh: yeah..i am around...ping me when you get back [15:50:59] cool [15:51:12] back shortly [15:51:16] damn..irc keeps kicking me out [15:54:21] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [15:56:27] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [16:05:27] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [16:05:27] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [16:37:24] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 275 MB (3% inode=61%): /var/lib/ureadahead/debugfs 275 MB (3% inode=61%): [16:48:41] cmjohnson1: Ok, I know we reviewed this but I did not write it down, so I forget. [16:49:10] for row C in pmtpa, I recall you need the following, 1 mgmt switch, 1 access switch, and 2 of the fiber modules for install in asw-c1 and asw-c3 [16:49:19] as the ends of each row access switch get fiber module. [16:49:43] or it may be you only have the 1 spare ex4200 right now [16:49:45] i dont quite recall [16:49:59] rob, i need 2 managment switches, 3 access switches and I have 1 spare 4200 [16:50:26] i would need the fiber modules for the spare 4200 [16:51:05] and I need more D rings...that ticket is stalled i believe [16:51:29] RECOVERY - Disk space on srv221 is OK: DISK OK [16:53:26] ahh, ok, so lets keep 1 spare ex4200 [16:53:36] so i need to order you all the access switch and modules for row c [16:53:40] and a single mgmt switch [16:53:48] cmjohnson1: I have that assigned to mark who have have pinged about it [16:53:52] and just did so again [16:54:50] cmjohnson1: lemme know when the power cables i ordered for you arrive so i can close that ticket, in fact I am just kicking it to you ;] [16:54:55] yeah [16:54:56] oh, it arleady is, nm [16:55:09] get 3 switches, 2 sfp+ modules, extra PSUs [16:55:22] hmm [16:55:30] or maybe we can add row C to the row D stack [16:55:38] otherwise that MX80 fills up so quickly [16:55:48] so without the sfp+ modules [16:55:53] and extra long stacking cables [16:56:13] cmjohnson1: can you measure whether a thick stacking cable from C2 would reach into D2? [16:56:20] I believe the longest length is 5m [16:56:34] yes, i will measure [16:56:40] i doubt that it will reach. [16:56:52] considering the location of the access switch in d1 [16:56:54] we can stack with fiber too I guess [16:57:18] so still need fiber modules for the end cabinets yes? [16:57:23] dunno yet [16:57:26] wait :P [16:57:37] we could ...but 5m should be enough to make the trip...it is not that far but I will measure [16:58:21] brb in 5 [16:59:01] I don't really want to spend another 2 10G MX80 ports on a 3-rack row [16:59:21] mark: https://rt.wikimedia.org/Ticket/Display.html?id=2685 is for two mgmt switches, assigned to you. [17:00:13] approved [17:00:15] you can order those [17:01:41] doesnt ct have to approve? [17:01:48] no, it's within my approval limit [17:01:56] (which is the same as CT's anyway...) [17:02:04] the ex4200 ticket https://rt.wikimedia.org/Ticket/Display.html?id=2684 is assigned to you so you can tell me if you want fiber modules, if we answer here in irc i can steal it back. [17:02:19] depends on what chris will tell us [17:02:21] so the ct escalation is so he can get erik to sign off then [17:02:23] if 5m can make it I would prefer [17:02:26] yeah [17:02:30] they cannot get longer than 5m eh? [17:02:35] and of course it's good for him to be involved in big orders too [17:02:42] but these small ones don't make a dent in the budget anyway ;) [17:02:44] no [17:03:32] you can stack with fiber as well, but then you need the uplink modules, and the speed is lower [17:03:37] (10G vs 32 G) [17:04:01] still better than going to the MX80 [17:04:44] will two 10g connections suffice if we put somethign like caching in there? [17:04:49] or will we just need to not do that [17:05:17] preferably not no ;) [17:05:33] i want to put ciscos there but i have reservations about the power limitations in pmtpa. [17:05:49] you'd need two racks then [17:05:55] i figure labs use would be perfectly fine there network wise [17:06:03] mark and robh: it will be really tight...with some measuring cushion I am just over 5m [17:06:11] ok [17:06:16] then its not gonna work, you need a good 2' slack [17:06:24] if you want to plug and unplug them without unracking the switches [17:06:33] and d1 access switch is halfway down the rack [17:06:41] unless you measured from mid rack? [17:06:44] and it needs to be a braid [17:06:47] so it's not just to d1 [17:06:49] also one to d2 [17:06:51] etc [17:06:58] indeed, i think the run is too long for a 5m [17:07:03] yeah [17:07:05] I would expect to put 10m fiber in for it. [17:07:06] i'll check if 10m is available [17:07:08] but I don't think so [17:08:49] mgmt switches ordered. [17:10:12] 5m is max [17:10:13] :( [17:10:17] that sucks [17:10:22] well get 2 fiber modules then [17:10:27] indeed, stealing ticket back [17:11:39] RT times out now when i try to attach pdfs [17:11:44] been happening for the past few days [17:12:00] I was hoping it was my cable modem being slow, but as im on the wifi in the datacenter, on our network [17:12:03] something is up with it. [17:12:15] i dont have time to check it out now, going to drop an unassigned ticekt in core-ops [17:14:44] mark: can we using stacking cables from c1 to d1? 5m would work [17:15:06] cmjohnson1: well, for proper redundancy, we need to make a braid [17:15:08] so from c2 to d1 [17:15:11] and from c1 to d2 [17:15:28] because otherwise if c1 or d1 switch dies, everything dies [17:15:47] it's always setup as a ring [17:15:58] (although that is difficult to see physically :) [17:16:05] ahh i see [17:17:06] cmjohnson1: https://rt.wikimedia.org/Ticket/Attachment/12933/9358/EX4200%20stack%20braid.png [17:17:15] that's how the 8-rack long rows in eqiad are connected [17:18:41] hrm [17:18:46] we could actually buy one stacking cable [17:18:51] and finish the ring with one fiber [17:18:55] then that fiber would normally be unused [17:18:58] only if the ring breaks [17:19:07] how about that [17:19:48] yeah I like that [17:19:54] RobH: order one 5m stacking cable [17:21:28] they come with the shorter ones right? [17:21:36] sorry, nm, they come iwth the tttttiny ones [17:21:37] they come with 50cm [17:21:38] yeah [17:21:43] so how will these join in the row? [17:21:49] dont i need to order more than 1 stacking? [17:21:57] no [17:22:03] because we do one end with stacking cable [17:22:05] and one end with fiber [17:22:09] best of both worlds [17:22:18] but how do the siwtches within the row join? [17:22:26] oh, yeah, you should order those [17:22:27] 3m cables [17:22:31] =P [17:22:34] i hate the 3m [17:22:36] but for interconnecting the rows you need to order one 5m [17:22:45] rack to rack its really tight. [17:22:45] (or you get all 5m, whatever works) [17:22:51] ok, i prefer that =] [17:23:01] so 4 5m cables, 3 ex4200, two fiber modules [17:23:18] yes [17:23:19] as the 5m will only be able to join c3 to d3 [17:23:22] and 3 extra PSUs [17:23:23] ok [17:23:30] yep [17:23:48] I should send some modules from esams [17:23:55] all 6 switches in esams have the 2x XFP module [17:23:58] and we're only using 2 of them [17:23:59] c3 to d3 will not work for stacking cables [17:24:05] cmjohnson1: that's fine [17:24:08] we'll run a fiber for that [17:25:17] those ex4200s are so damn heavy for switches [17:25:22] so just need the 3 5m cables then. [17:25:23] they're like servers [17:25:25] ? [17:25:30] RobH: yeah but get 4 [17:25:34] who knows, maybe we can make it [17:25:39] otherwise we'll use it next time [17:25:43] good to have [17:25:43] doesnt hurt ot have a spare on site either. [17:25:52] it can just sit with the spare ex4200 [17:26:02] yes [17:26:12] (3) EX4200 with redundant power supplies [17:26:13] (2) 4xGigE/2x10G SFP+ modules [17:26:13] (4) 5m stacking cables [17:26:18] EX4200-48T [17:26:22] ahh, yes [17:26:30] (3) EX4200-48T with redundant power supplies [17:26:33] yep [17:26:45] http://www.juniper.net/us/en/products-services/switching/ex-series/ex4200/#ordering [17:28:59] oh, what warranty should i ask for [17:29:05] i just sent the request without specifying that =P [17:29:26] whatever we got previously [17:29:34] we have a spare, so warranty return and only software support is fine [17:29:39] ok, he will do that i assume, so we will check when it comes back [17:29:50] i just sent you the latest price list [17:30:45] food, bbl [17:37:26] Any update on the state of the ciscos? [17:37:58] New patchset: Pyoungmeister; "2 more lvs pools for search" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3380 [17:38:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3380 [17:39:37] oh yes, that reminds me, thanks dschoon [17:39:50] mark: would you like to allocate a subnet for the virt vlan in eqiad [17:40:08] i was going to work on those for ryan last night, but realized we didnt ask you guys (you or leslie) about that yet [17:40:46] also ryan is havin gissues getting it to dhcp, so i was going to try these here with a physical console. i think he also wants to have you or leslie check the network to see if you see dhcp requests from that port (virt5) when its attemptee [17:40:49] attempted. [17:41:20] <3 [17:41:42] yeah, my understanding is that ryan_lane can't get them to pixiboot [17:41:47] brb [17:42:02] correct. having issues getting them to pxe [17:42:29] * [1]schoolcraftT finds the closest large object and gives [1]schoolcraftT a slap with it [17:43:43] mark: https://gerrit.wikimedia.org/r/#change,3380 does that look better? if yes, I will go ahead and make dns changes as needed [17:52:23] why does no one sell an m5 screw shorter than 0.8 [17:52:26] i need a .5 [17:53:24] oh well, the .8 will work, just may need to dremel off part of it [18:04:56] RobH -- did i miss anything about the ciscos when i relogged? [18:05:11] nope, asked mark about the subnet but I expect he is busy so i will have to drop a networking ticket [18:05:25] that being siad, i will be on site here again tomorrow so if its not today its then [18:05:26] coolio. just making sure [18:05:30] its getting late in his time zone =] [18:05:30] sweet. [18:05:30] np [18:05:32] thank you. [18:05:46] we're all totally psyched about getting started on those machines [18:05:49] quite welcome [18:12:56] cmjohnson1: So i had to reorder a different screw for grounding the racks [18:13:09] but now that i have the small qty from lowes (all they had) to size, i am order a ton more [18:13:10] ok..the others didn't fit [18:13:20] so i am going to hold off shipping you the SFPs until these come in [18:13:27] also may wait for the labels, which i am ordering today [18:13:30] the asset tags that is [18:13:44] since honestly the asset tags are the ones you need soonest i imagine =] [18:13:51] ok...not pressing [18:14:00] also going to send you a bunch of hdd screws [18:14:05] no...i am good for about another 50 items [18:14:09] just getting low [18:14:09] cuz im tired of scrambling for them, so ordered two bags of 100 [18:14:13] ok, cool [18:14:33] so i expect to send all this stuff to you midweek next week [18:14:40] basicalyl screws will be in tomorrrow, labels are the hold up. [18:14:49] trying to stay ahead...also while you were out ...had to replace a hdd on brewster it was the only 1tb drive in the DC...not sure if you want to order a spare [18:15:18] nah, cuz sometimes we need sas [18:15:26] ok [18:15:27] brewster was misc so thats sata [18:15:40] so easier to just order as needed, plus hdd costs are still kinda high right now [18:15:40] th [18:15:52] the floods in asia messed up hdd production [18:16:03] seems there are only half a dozen factories on earth making the drive motors [18:16:06] and they all got flooded [18:19:51] hrmm, ps1-a1-eqiad has failed to respond to observium once since being grounded. [18:20:04] i think it may have been a combination of the mgmt switch and the ps [18:20:17] gonna have to ground mgmt switches, which use same screw as servertechs atleast. [18:20:30] (well, the juniper non mgmt do, gotta check the mgmt) [18:27:38] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 212 MB (2% inode=61%): /var/lib/ureadahead/debugfs 212 MB (2% inode=61%): [18:31:34] Ryan_Lane: so you may wanna have chris attempt netboot on the virt5 [18:31:45] since he can attach a physical console to see if it perhpas has output [18:31:52] i cannot do here since the network isnt setup yet [18:37:50] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 27 MB (0% inode=61%): /var/lib/ureadahead/debugfs 27 MB (0% inode=61%): [18:38:08] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 203 MB (2% inode=61%): /var/lib/ureadahead/debugfs 203 MB (2% inode=61%): [18:38:08] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 263 MB (3% inode=61%): /var/lib/ureadahead/debugfs 263 MB (3% inode=61%): [18:39:21] New patchset: Asher; "fix digi malaysia ip range" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3411 [18:39:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3411 [18:39:52] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3411 [18:39:54] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3411 [18:40:14] RECOVERY - Disk space on srv222 is OK: DISK OK [18:40:14] RECOVERY - Disk space on srv220 is OK: DISK OK [18:43:50] PROBLEM - Puppet freshness on linne is CRITICAL: Puppet has not run in the last 10 hours [18:46:14] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 227 MB (3% inode=61%): /var/lib/ureadahead/debugfs 227 MB (3% inode=61%): [18:48:38] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 163 MB (2% inode=61%): /var/lib/ureadahead/debugfs 163 MB (2% inode=61%): [18:50:26] RECOVERY - Disk space on srv224 is OK: DISK OK [18:50:44] RECOVERY - Disk space on srv219 is OK: DISK OK [19:00:33] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3380 [19:11:53] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [19:37:36] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [19:48:30] New patchset: Hashar; "remove +x bits from files of /srv/org/mediawiki/integration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3433 [19:48:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3433 [19:49:18] PROBLEM - Host ms1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:50:21] RECOVERY - Disk space on ms1002 is OK: DISK OK [19:50:30] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [19:50:30] RECOVERY - Host ms1002 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [19:50:48] RECOVERY - RAID on ms1002 is OK: OK: State is Optimal, checked 2 logical device(s) [19:51:42] RECOVERY - DPKG on ms1002 is OK: All packages OK [19:58:54] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Thu Mar 22 19:58:20 UTC 2012 [19:59:12] RECOVERY - Host magnesium is UP: PING WARNING - Packet loss = 37%, RTA = 65.15 ms [20:01:49] !log magnesium goign down and up again, troubleshooting the disks [20:01:53] Logged the message, RobH [20:23:24] !rebuilding search1015 and 1016 for disk shuffles [20:25:34] !log rebuilding search1015 and 1016 for disk shuffles [20:25:38] Logged the message, and now dispaching a T1000 to your position to terminate you. [20:26:12] PROBLEM - Host search1015 is DOWN: PING CRITICAL - Packet loss = 100% [20:29:21] RECOVERY - Puppet freshness on magnesium is OK: puppet ran at Thu Mar 22 20:29:10 UTC 2012 [20:32:08] maplebed: what exactly am i supposed to ask dell [20:32:16] they will ask us to run a hardware test suite which we did [20:32:30] so its either grounding, which I am trying to solve presently, or its our software, which isnt their problem [20:32:42] (c2100 issue) [20:33:03] so they are gonna say 'if the hardware tests dont fail, what would you like us to do?' [20:33:28] well, what I would say is "we got this new hardware, we know it's new, the hardware test doesn't show anything, but they're crashing. Exactly the same software running on X does'nt crash, so ... umm... find us a solution." [20:33:34] now, we have had some odd stuff that we think is grounding, and I just ordered a bunch of stuff ot fix it, which when it comes in tomorrow, i will be dropping half in a box to go to tampa [20:33:42] they wont find us one [20:33:52] but i am happy to get them to tell us to go pound sand for you ;] [20:34:05] the whole reason to buy from a company like dell instead of something like silicon mechanics is that we get to say shit like "it's just not working and it's your job to fix it." [20:34:07] ;) [20:34:16] whatever. [20:34:23] they are going to tell you what i said, but i will ask. [20:34:25] that's what I would say. [20:34:52] you're probably right, that they'll tell us to stfu, [20:35:15] but ... yeah, that's all I've got. [20:35:24] I really have no clue why they are failing [20:35:37] i am reallllly hoping grounding things properly will eliminate the issue [20:35:38] =P [20:36:55] so the logs show no crazy errors post crash? [20:37:24] i will also pull up the firmware on them and compare to recent releases [20:37:38] once they are up to date and crash again, then i call dell, cuz they will ask that first ;] [20:37:47] i will try to get to that tomorrow [20:38:01] (anyone else who wants to can also update firmware, as it can be done remotely ;) [20:38:48] RECOVERY - Host search1015 is UP: PING OK - Packet loss = 0%, RTA = 26.41 ms [20:41:07] I haven't been able to find anything in the logs. [20:41:30] I figure that since it's a new hardware platform, they might be more receptive to 'weird shit's going on' kind s of reports. [20:42:44] yea, i will ask [20:43:09] PROBLEM - SSH on search1015 is CRITICAL: Connection refused [20:43:18] PROBLEM - DPKG on search1015 is CRITICAL: Connection refused by host [20:44:57] PROBLEM - RAID on search1015 is CRITICAL: Connection refused by host [20:50:22] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 326 seconds [20:50:31] PROBLEM - Lucene on search1015 is CRITICAL: Connection refused [20:50:40] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 345 seconds [20:52:20] !log stopping puppet on brewster temporarily [20:52:24] Logged the message, and now dispaching a T1000 to your position to terminate you. [20:56:49] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [20:57:07] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [20:59:04] PROBLEM - Host search1016 is DOWN: PING CRITICAL - Packet loss = 100% [20:59:58] RECOVERY - SSH on search1015 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:05:13] RECOVERY - Host search1016 is UP: PING OK - Packet loss = 0%, RTA = 26.41 ms [21:10:10] PROBLEM - Disk space on search1016 is CRITICAL: Connection refused by host [21:10:37] PROBLEM - RAID on search1016 is CRITICAL: Connection refused by host [21:10:55] PROBLEM - SSH on search1016 is CRITICAL: Connection refused [21:11:22] PROBLEM - DPKG on search1016 is CRITICAL: Connection refused by host [21:17:58] PROBLEM - Lucene on search1016 is CRITICAL: Connection refused [21:28:37] PROBLEM - NTP on search1015 is CRITICAL: NTP CRITICAL: No response from NTP server [21:29:58] RECOVERY - SSH on search1016 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:34:50] RobH: are you around? [21:34:55] PROBLEM - NTP on search1016 is CRITICAL: NTP CRITICAL: Offset unknown [21:36:04] hey ops. what's our process for requesting techblog access ? I need phil to get an account [21:36:16] PROBLEM - Host search1015 is DOWN: PING CRITICAL - Packet loss = 100% [21:37:37] RECOVERY - Disk space on search1016 is OK: DISK OK [21:37:55]