[00:23:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.337 seconds [00:26:58] !log disabled lvm snapshots and puppet on db32 for revision sha1 alter [00:27:02] Logged the message, Master [00:28:26] !log started enwiki.revision alter on db32 [00:28:30] Logged the message, Master [00:33:59] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 348 seconds [00:34:26] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 376 seconds [00:36:42] binasher: is nagios going to complain about db32 a lot now? [00:37:00] during the enwiki.revision alter... [00:37:19] i can disable notifications [00:37:29] or try [00:59:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:03:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.889 seconds [01:35:42] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 69 MB (0% inode=61%): /var/lib/ureadahead/debugfs 69 MB (0% inode=61%): [01:39:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:45:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.034 seconds [01:46:30] RECOVERY - Disk space on srv219 is OK: DISK OK [02:19:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:48] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [02:25:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.025 seconds [03:04:12] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Mar 22 03:03:51 UTC 2012 [03:34:21] PROBLEM - Disk space on stafford is CRITICAL: DISK CRITICAL - free space: /var/lib/puppet 758 MB (3% inode=92%): [04:27:44] RECOVERY - Disk space on stafford is OK: DISK OK [05:53:14] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [05:55:20] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [06:03:26] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [06:03:26] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [06:29:32] PROBLEM - Disk space on ms1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:08] PROBLEM - RAID on ms1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:56] PROBLEM - DPKG on ms1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:47:06] PROBLEM - Disk space on search1015 is CRITICAL: DISK CRITICAL - free space: /a 3220 MB (2% inode=99%): [07:58:07] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 235 MB (3% inode=61%): /var/lib/ureadahead/debugfs 235 MB (3% inode=61%): [07:58:07] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 255 MB (3% inode=61%): /var/lib/ureadahead/debugfs 255 MB (3% inode=61%): [07:58:16] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 19 MB (0% inode=61%): /var/lib/ureadahead/debugfs 19 MB (0% inode=61%): [08:02:19] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [08:02:19] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 251 MB (3% inode=61%): /var/lib/ureadahead/debugfs 251 MB (3% inode=61%): [08:04:34] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 4 MB (0% inode=61%): /var/lib/ureadahead/debugfs 4 MB (0% inode=61%): [08:10:34] RECOVERY - Disk space on srv224 is OK: DISK OK [08:10:43] RECOVERY - Disk space on srv219 is OK: DISK OK [08:10:52] RECOVERY - Disk space on srv222 is OK: DISK OK [08:14:55] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 215 MB (3% inode=61%): /var/lib/ureadahead/debugfs 215 MB (3% inode=61%): [08:15:13] RECOVERY - Disk space on srv221 is OK: DISK OK [08:17:01] RECOVERY - Disk space on srv220 is OK: DISK OK [08:17:01] RECOVERY - Disk space on srv223 is OK: DISK OK [08:42:09] PROBLEM - Puppet freshness on linne is CRITICAL: Puppet has not run in the last 10 hours [09:10:12] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [09:36:09] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [09:49:13] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [10:13:42] someone who know how gerrit and git works [10:14:05] I am having troubles to access puppet repository we use on labs [10:14:19] it seems that I am writing to production configs rather than labs [10:14:31] how do I switch to labs branch [10:14:38] ^demon: ^ [10:14:39] :P [10:15:02] months ago I changed templates/nrpe_local [10:15:09] that file seems to be completely different now [10:15:25] I guess it's production one [11:02:28] <^demon> Checkout test branch [14:45:35] !log db1020 still offline, requires firmware update on raid controller per rt 2621, will perform later today [14:45:39] Logged the message, RobH [14:51:17] New patchset: Pyoungmeister; "2 more lvs pools for search" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3380 [14:51:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3380 [14:51:45] mark: can you look at ^^ when you have a chance? [15:08:43] !log shutting down search1015 & search1016 for hdd additions [15:08:46] Logged the message, RobH [15:12:26] PROBLEM - Host search1015 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:20] PROBLEM - Host search1016 is DOWN: PING CRITICAL - Packet loss = 100% [15:21:49] New review: Mark Bergsma; "Looks ok, except for the regexes in site.pp." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/3380 [15:34:10] Dell's design on the R410 line is pretty slick, even for non hot swap drives [15:34:24] the cabled drives still have nice carriers that remove easily [15:34:31] tell me about them :P [15:34:45] not so great for 2.5" ssds :P [15:39:07] better than the hot swaps for that ;] [15:39:23] but yea, no enterprise grade servers seem to fit the 3.5 to 2.5 adapters easily. [15:44:54] RECOVERY - Host search1015 is UP: PING OK - Packet loss = 0%, RTA = 26.85 ms [15:45:24] !log search 1015 and search1016 back up with added disks [15:45:25] notpeter: ^ [15:45:27] Logged the message, RobH [15:45:49] notpeter: one hopes that your search data lives in an LVM ;] [15:46:00] though i suppose reinstall isnt a big deal [15:48:25] RobH: definitely reinstall time [15:48:26] thank you! [15:48:30] RECOVERY - Host search1016 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [15:50:34] cmjohnson1: i am stepping out to lunch, but i wanna confirm some stuff on order before i go [15:50:39] actually, when i get back is fine [15:50:46] you will be around today in about an hour or so right? [15:50:57] robh: yeah..i am around...ping me when you get back [15:50:59] cool [15:51:12] back shortly [15:51:16] damn..irc keeps kicking me out [15:54:21] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [15:56:27] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [16:05:27] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [16:05:27] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [16:37:24] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 275 MB (3% inode=61%): /var/lib/ureadahead/debugfs 275 MB (3% inode=61%): [16:48:41] cmjohnson1: Ok, I know we reviewed this but I did not write it down, so I forget. [16:49:10] for row C in pmtpa, I recall you need the following, 1 mgmt switch, 1 access switch, and 2 of the fiber modules for install in asw-c1 and asw-c3 [16:49:19] as the ends of each row access switch get fiber module. [16:49:43] or it may be you only have the 1 spare ex4200 right now [16:49:45] i dont quite recall [16:49:59] rob, i need 2 managment switches, 3 access switches and I have 1 spare 4200 [16:50:26] i would need the fiber modules for the spare 4200 [16:51:05] and I need more D rings...that ticket is stalled i believe [16:51:29] RECOVERY - Disk space on srv221 is OK: DISK OK [16:53:26] ahh, ok, so lets keep 1 spare ex4200 [16:53:36] so i need to order you all the access switch and modules for row c [16:53:40] and a single mgmt switch [16:53:48] cmjohnson1: I have that assigned to mark who have have pinged about it [16:53:52] and just did so again [16:54:50] cmjohnson1: lemme know when the power cables i ordered for you arrive so i can close that ticket, in fact I am just kicking it to you ;] [16:54:55] yeah [16:54:56] oh, it arleady is, nm [16:55:09] get 3 switches, 2 sfp+ modules, extra PSUs [16:55:22] hmm [16:55:30] or maybe we can add row C to the row D stack [16:55:38] otherwise that MX80 fills up so quickly [16:55:48] so without the sfp+ modules [16:55:53] and extra long stacking cables [16:56:13] cmjohnson1: can you measure whether a thick stacking cable from C2 would reach into D2? [16:56:20] I believe the longest length is 5m [16:56:34] yes, i will measure [16:56:40] i doubt that it will reach. [16:56:52] considering the location of the access switch in d1 [16:56:54] we can stack with fiber too I guess [16:57:18] so still need fiber modules for the end cabinets yes? [16:57:23] dunno yet [16:57:26] wait :P [16:57:37] we could ...but 5m should be enough to make the trip...it is not that far but I will measure [16:58:21] brb in 5 [16:59:01] I don't really want to spend another 2 10G MX80 ports on a 3-rack row [16:59:21] mark: https://rt.wikimedia.org/Ticket/Display.html?id=2685 is for two mgmt switches, assigned to you. [17:00:13] approved [17:00:15] you can order those [17:01:41] doesnt ct have to approve? [17:01:48] no, it's within my approval limit [17:01:56] (which is the same as CT's anyway...) [17:02:04] the ex4200 ticket https://rt.wikimedia.org/Ticket/Display.html?id=2684 is assigned to you so you can tell me if you want fiber modules, if we answer here in irc i can steal it back. [17:02:19] depends on what chris will tell us [17:02:21] so the ct escalation is so he can get erik to sign off then [17:02:23] if 5m can make it I would prefer [17:02:26] yeah [17:02:30] they cannot get longer than 5m eh? [17:02:35] and of course it's good for him to be involved in big orders too [17:02:42] but these small ones don't make a dent in the budget anyway ;) [17:02:44] no [17:03:32] you can stack with fiber as well, but then you need the uplink modules, and the speed is lower [17:03:37] (10G vs 32 G) [17:04:01] still better than going to the MX80 [17:04:44] will two 10g connections suffice if we put somethign like caching in there? [17:04:49] or will we just need to not do that [17:05:17] preferably not no ;) [17:05:33] i want to put ciscos there but i have reservations about the power limitations in pmtpa. [17:05:49] you'd need two racks then [17:05:55] i figure labs use would be perfectly fine there network wise [17:06:03] mark and robh: it will be really tight...with some measuring cushion I am just over 5m [17:06:11] ok [17:06:16] then its not gonna work, you need a good 2' slack [17:06:24] if you want to plug and unplug them without unracking the switches [17:06:33] and d1 access switch is halfway down the rack [17:06:41] unless you measured from mid rack? [17:06:44] and it needs to be a braid [17:06:47] so it's not just to d1 [17:06:49] also one to d2 [17:06:51] etc [17:06:58] indeed, i think the run is too long for a 5m [17:07:03] yeah [17:07:05] I would expect to put 10m fiber in for it. [17:07:06] i'll check if 10m is available [17:07:08] but I don't think so [17:08:49] mgmt switches ordered. [17:10:12] 5m is max [17:10:13] :( [17:10:17] that sucks [17:10:22] well get 2 fiber modules then [17:10:27] indeed, stealing ticket back [17:11:39] RT times out now when i try to attach pdfs [17:11:44] been happening for the past few days [17:12:00] I was hoping it was my cable modem being slow, but as im on the wifi in the datacenter, on our network [17:12:03] something is up with it. [17:12:15] i dont have time to check it out now, going to drop an unassigned ticekt in core-ops [17:14:44] mark: can we using stacking cables from c1 to d1? 5m would work [17:15:06] cmjohnson1: well, for proper redundancy, we need to make a braid [17:15:08] so from c2 to d1 [17:15:11] and from c1 to d2 [17:15:28] because otherwise if c1 or d1 switch dies, everything dies [17:15:47] it's always setup as a ring [17:15:58] (although that is difficult to see physically :) [17:16:05] ahh i see [17:17:06] cmjohnson1: https://rt.wikimedia.org/Ticket/Attachment/12933/9358/EX4200%20stack%20braid.png [17:17:15] that's how the 8-rack long rows in eqiad are connected [17:18:41] hrm [17:18:46] we could actually buy one stacking cable [17:18:51] and finish the ring with one fiber [17:18:55] then that fiber would normally be unused [17:18:58] only if the ring breaks [17:19:07] how about that [17:19:48] yeah I like that [17:19:54] RobH: order one 5m stacking cable [17:21:28] they come with the shorter ones right? [17:21:36] sorry, nm, they come iwth the tttttiny ones [17:21:37] they come with 50cm [17:21:38] yeah [17:21:43] so how will these join in the row? [17:21:49] dont i need to order more than 1 stacking? [17:21:57] no [17:22:03] because we do one end with stacking cable [17:22:05] and one end with fiber [17:22:09] best of both worlds [17:22:18] but how do the siwtches within the row join? [17:22:26] oh, yeah, you should order those [17:22:27] 3m cables [17:22:31] =P [17:22:34] i hate the 3m [17:22:36] but for interconnecting the rows you need to order one 5m [17:22:45] rack to rack its really tight. [17:22:45] (or you get all 5m, whatever works) [17:22:51] ok, i prefer that =] [17:23:01] so 4 5m cables, 3 ex4200, two fiber modules [17:23:18] yes [17:23:19] as the 5m will only be able to join c3 to d3 [17:23:22] and 3 extra PSUs [17:23:23] ok [17:23:30] yep [17:23:48] I should send some modules from esams [17:23:55] all 6 switches in esams have the 2x XFP module [17:23:58] and we're only using 2 of them [17:23:59] c3 to d3 will not work for stacking cables [17:24:05] cmjohnson1: that's fine [17:24:08] we'll run a fiber for that [17:25:17] those ex4200s are so damn heavy for switches [17:25:22] so just need the 3 5m cables then. [17:25:23] they're like servers [17:25:25] ? [17:25:30] RobH: yeah but get 4 [17:25:34] who knows, maybe we can make it [17:25:39] otherwise we'll use it next time [17:25:43] good to have [17:25:43] doesnt hurt ot have a spare on site either. [17:25:52] it can just sit with the spare ex4200 [17:26:02] yes [17:26:12] (3) EX4200 with redundant power supplies [17:26:13] (2) 4xGigE/2x10G SFP+ modules [17:26:13] (4) 5m stacking cables [17:26:18] EX4200-48T [17:26:22] ahh, yes [17:26:30] (3) EX4200-48T with redundant power supplies [17:26:33] yep [17:26:45] http://www.juniper.net/us/en/products-services/switching/ex-series/ex4200/#ordering [17:28:59] oh, what warranty should i ask for [17:29:05] i just sent the request without specifying that =P [17:29:26] whatever we got previously [17:29:34] we have a spare, so warranty return and only software support is fine [17:29:39] ok, he will do that i assume, so we will check when it comes back [17:29:50] i just sent you the latest price list [17:30:45] food, bbl [17:37:26] Any update on the state of the ciscos? [17:37:58] New patchset: Pyoungmeister; "2 more lvs pools for search" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3380 [17:38:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3380 [17:39:37] oh yes, that reminds me, thanks dschoon [17:39:50] mark: would you like to allocate a subnet for the virt vlan in eqiad [17:40:08] i was going to work on those for ryan last night, but realized we didnt ask you guys (you or leslie) about that yet [17:40:46] also ryan is havin gissues getting it to dhcp, so i was going to try these here with a physical console. i think he also wants to have you or leslie check the network to see if you see dhcp requests from that port (virt5) when its attemptee [17:40:49] attempted. [17:41:20] <3 [17:41:42] yeah, my understanding is that ryan_lane can't get them to pixiboot [17:41:47] brb [17:42:02] correct. having issues getting them to pxe [17:42:29] * [1]schoolcraftT finds the closest large object and gives [1]schoolcraftT a slap with it [17:43:43] mark: https://gerrit.wikimedia.org/r/#change,3380 does that look better? if yes, I will go ahead and make dns changes as needed [17:52:23] why does no one sell an m5 screw shorter than 0.8 [17:52:26] i need a .5 [17:53:24] oh well, the .8 will work, just may need to dremel off part of it [18:04:56] RobH -- did i miss anything about the ciscos when i relogged? [18:05:11] nope, asked mark about the subnet but I expect he is busy so i will have to drop a networking ticket [18:05:25] that being siad, i will be on site here again tomorrow so if its not today its then [18:05:26] coolio. just making sure [18:05:30] its getting late in his time zone =] [18:05:30] sweet. [18:05:30] np [18:05:32] thank you. [18:05:46] we're all totally psyched about getting started on those machines [18:05:49] quite welcome [18:12:56] cmjohnson1: So i had to reorder a different screw for grounding the racks [18:13:09] but now that i have the small qty from lowes (all they had) to size, i am order a ton more [18:13:10] ok..the others didn't fit [18:13:20] so i am going to hold off shipping you the SFPs until these come in [18:13:27] also may wait for the labels, which i am ordering today [18:13:30] the asset tags that is [18:13:44] since honestly the asset tags are the ones you need soonest i imagine =] [18:13:51] ok...not pressing [18:14:00] also going to send you a bunch of hdd screws [18:14:05] no...i am good for about another 50 items [18:14:09] just getting low [18:14:09] cuz im tired of scrambling for them, so ordered two bags of 100 [18:14:13] ok, cool [18:14:33] so i expect to send all this stuff to you midweek next week [18:14:40] basicalyl screws will be in tomorrrow, labels are the hold up. [18:14:49] trying to stay ahead...also while you were out ...had to replace a hdd on brewster it was the only 1tb drive in the DC...not sure if you want to order a spare [18:15:18] nah, cuz sometimes we need sas [18:15:26] ok [18:15:27] brewster was misc so thats sata [18:15:40] so easier to just order as needed, plus hdd costs are still kinda high right now [18:15:40] th [18:15:52] the floods in asia messed up hdd production [18:16:03] seems there are only half a dozen factories on earth making the drive motors [18:16:06] and they all got flooded [18:19:51] hrmm, ps1-a1-eqiad has failed to respond to observium once since being grounded. [18:20:04] i think it may have been a combination of the mgmt switch and the ps [18:20:17] gonna have to ground mgmt switches, which use same screw as servertechs atleast. [18:20:30] (well, the juniper non mgmt do, gotta check the mgmt) [18:27:38] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 212 MB (2% inode=61%): /var/lib/ureadahead/debugfs 212 MB (2% inode=61%): [18:31:34] Ryan_Lane: so you may wanna have chris attempt netboot on the virt5 [18:31:45] since he can attach a physical console to see if it perhpas has output [18:31:52] i cannot do here since the network isnt setup yet [18:37:50] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 27 MB (0% inode=61%): /var/lib/ureadahead/debugfs 27 MB (0% inode=61%): [18:38:08] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 203 MB (2% inode=61%): /var/lib/ureadahead/debugfs 203 MB (2% inode=61%): [18:38:08] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 263 MB (3% inode=61%): /var/lib/ureadahead/debugfs 263 MB (3% inode=61%): [18:39:21] New patchset: Asher; "fix digi malaysia ip range" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3411 [18:39:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3411 [18:39:52] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3411 [18:39:54] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3411 [18:40:14] RECOVERY - Disk space on srv222 is OK: DISK OK [18:40:14] RECOVERY - Disk space on srv220 is OK: DISK OK [18:43:50] PROBLEM - Puppet freshness on linne is CRITICAL: Puppet has not run in the last 10 hours [18:46:14] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 227 MB (3% inode=61%): /var/lib/ureadahead/debugfs 227 MB (3% inode=61%): [18:48:38] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 163 MB (2% inode=61%): /var/lib/ureadahead/debugfs 163 MB (2% inode=61%): [18:50:26] RECOVERY - Disk space on srv224 is OK: DISK OK [18:50:44] RECOVERY - Disk space on srv219 is OK: DISK OK [19:00:33] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3380 [19:11:53] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [19:37:36] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [19:48:30] New patchset: Hashar; "remove +x bits from files of /srv/org/mediawiki/integration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3433 [19:48:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3433 [19:49:18] PROBLEM - Host ms1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:50:21] RECOVERY - Disk space on ms1002 is OK: DISK OK [19:50:30] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [19:50:30] RECOVERY - Host ms1002 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [19:50:48] RECOVERY - RAID on ms1002 is OK: OK: State is Optimal, checked 2 logical device(s) [19:51:42] RECOVERY - DPKG on ms1002 is OK: All packages OK [19:58:54] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Thu Mar 22 19:58:20 UTC 2012 [19:59:12] RECOVERY - Host magnesium is UP: PING WARNING - Packet loss = 37%, RTA = 65.15 ms [20:01:49] !log magnesium goign down and up again, troubleshooting the disks [20:01:53] Logged the message, RobH [20:23:24] !rebuilding search1015 and 1016 for disk shuffles [20:25:34] !log rebuilding search1015 and 1016 for disk shuffles [20:25:38] Logged the message, and now dispaching a T1000 to your position to terminate you. [20:26:12] PROBLEM - Host search1015 is DOWN: PING CRITICAL - Packet loss = 100% [20:29:21] RECOVERY - Puppet freshness on magnesium is OK: puppet ran at Thu Mar 22 20:29:10 UTC 2012 [20:32:08] maplebed: what exactly am i supposed to ask dell [20:32:16] they will ask us to run a hardware test suite which we did [20:32:30] so its either grounding, which I am trying to solve presently, or its our software, which isnt their problem [20:32:42] (c2100 issue) [20:33:03] so they are gonna say 'if the hardware tests dont fail, what would you like us to do?' [20:33:28] well, what I would say is "we got this new hardware, we know it's new, the hardware test doesn't show anything, but they're crashing. Exactly the same software running on X does'nt crash, so ... umm... find us a solution." [20:33:34] now, we have had some odd stuff that we think is grounding, and I just ordered a bunch of stuff ot fix it, which when it comes in tomorrow, i will be dropping half in a box to go to tampa [20:33:42] they wont find us one [20:33:52] but i am happy to get them to tell us to go pound sand for you ;] [20:34:05] the whole reason to buy from a company like dell instead of something like silicon mechanics is that we get to say shit like "it's just not working and it's your job to fix it." [20:34:07] ;) [20:34:16] whatever. [20:34:23] they are going to tell you what i said, but i will ask. [20:34:25] that's what I would say. [20:34:52] you're probably right, that they'll tell us to stfu, [20:35:15] but ... yeah, that's all I've got. [20:35:24] I really have no clue why they are failing [20:35:37] i am reallllly hoping grounding things properly will eliminate the issue [20:35:38] =P [20:36:55] so the logs show no crazy errors post crash? [20:37:24] i will also pull up the firmware on them and compare to recent releases [20:37:38] once they are up to date and crash again, then i call dell, cuz they will ask that first ;] [20:37:47] i will try to get to that tomorrow [20:38:01] (anyone else who wants to can also update firmware, as it can be done remotely ;) [20:38:48] RECOVERY - Host search1015 is UP: PING OK - Packet loss = 0%, RTA = 26.41 ms [20:41:07] I haven't been able to find anything in the logs. [20:41:30] I figure that since it's a new hardware platform, they might be more receptive to 'weird shit's going on' kind s of reports. [20:42:44] yea, i will ask [20:43:09] PROBLEM - SSH on search1015 is CRITICAL: Connection refused [20:43:18] PROBLEM - DPKG on search1015 is CRITICAL: Connection refused by host [20:44:57] PROBLEM - RAID on search1015 is CRITICAL: Connection refused by host [20:50:22] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 326 seconds [20:50:31] PROBLEM - Lucene on search1015 is CRITICAL: Connection refused [20:50:40] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 345 seconds [20:52:20] !log stopping puppet on brewster temporarily [20:52:24] Logged the message, and now dispaching a T1000 to your position to terminate you. [20:56:49] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [20:57:07] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [20:59:04] PROBLEM - Host search1016 is DOWN: PING CRITICAL - Packet loss = 100% [20:59:58] RECOVERY - SSH on search1015 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:05:13] RECOVERY - Host search1016 is UP: PING OK - Packet loss = 0%, RTA = 26.41 ms [21:10:10] PROBLEM - Disk space on search1016 is CRITICAL: Connection refused by host [21:10:37] PROBLEM - RAID on search1016 is CRITICAL: Connection refused by host [21:10:55] PROBLEM - SSH on search1016 is CRITICAL: Connection refused [21:11:22] PROBLEM - DPKG on search1016 is CRITICAL: Connection refused by host [21:17:58] PROBLEM - Lucene on search1016 is CRITICAL: Connection refused [21:28:37] PROBLEM - NTP on search1015 is CRITICAL: NTP CRITICAL: No response from NTP server [21:29:58] RECOVERY - SSH on search1016 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:34:50] RobH: are you around? [21:34:55] PROBLEM - NTP on search1016 is CRITICAL: NTP CRITICAL: Offset unknown [21:36:04] hey ops. what's our process for requesting techblog access ? I need phil to get an account [21:36:16] PROBLEM - Host search1015 is DOWN: PING CRITICAL - Packet loss = 100% [21:37:37] RECOVERY - Disk space on search1016 is OK: DISK OK [21:37:55] RECOVERY - RAID on search1016 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [21:38:04] RECOVERY - Host search1015 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [21:38:31] RECOVERY - DPKG on search1016 is OK: All packages OK [21:38:33] bug filed http://rt.wikimedia.org/Ticket/Display.html?id=2688 [21:39:57] tfinc: did they make accounts? [21:40:08] anyone can make their own account, then an admin just promotes them [21:40:28] there is no techblog [21:40:30] there is just the blog. [21:41:13] RECOVERY - NTP on search1016 is OK: NTP OK: Offset 0.08279848099 secs [21:41:41] tfinc: also, are these employees? What level of access do you want them to have, the abilty to create posts but an editor needs to publish, or editor where they can edit any blog posting? [21:42:43] PROBLEM - Host search1015 is DOWN: PING CRITICAL - Packet loss = 100% [21:45:50] tfinc: http://codex.wordpress.org/Roles_and_Capabilities So I need to know if they are going to be contributor (can create drafts but not publish), author (can publish and edit their own posts to the site), editor (can publish,write, and edit ALL posts) [21:47:22] Editor [21:47:42] i updated the bug [21:48:29] ahh, so did it [21:48:29] i [21:48:37] so pchange is an author [21:48:41] he can already post and edit his own [21:48:49] You want him promoted to be able to do that with other folks posts? [21:49:05] (i dont like us handing out editor like candy, so thats your call, by default i only promote staff to author [21:49:34] jdrobson needs to make his own account, we dont create them, we just promote them [21:49:37] anyone can make their own account [21:49:39] i'm fine with it [21:49:49] ok, well, that means they can edit ANYONEs post [21:50:00] so if you are on record thats fine, but if the other folks get pissed im sending them to you ;] [21:50:14] ZOMG anyone can edit [21:50:18] hehe [21:51:01] yea, but edit actions on wordpress arent all that evident [21:51:09] i mean, there is no diff of what that person did versus the original poster [21:51:12] but promoted pchange [21:51:15] pchang even [21:51:18] thanks! [21:51:24] the other dude needs to make an account before its promoted [21:51:28] sure thing [21:51:40] we have an insane numbe rof admins [21:51:45] so you dont need to come to me for this ;] [21:52:00] Ryan_Lane told me to do it [21:52:05] so blame him [21:52:11] Guillaume, Jay, Lianna, matt roth, philippe, ryan lane [21:52:13] Ryan_Lane: you ass [21:52:16] you could do this [21:52:18] ;p [21:52:23] stacey merrick [21:52:26] ;0 [21:52:28] tilman bayer [21:52:28] heh [21:52:46] tfinc: i dont mind, you may notice that you end up waiting on me is all [21:52:56] cuz when i am in the datacente,r like now, i tend to ignore non datacenter things. [21:53:04] but for you i promoted him ;] [21:53:51] tfinc: i have no idea who promoted all these folks to admin [21:53:55] its kind of insane and i dont like it [21:54:03] but meh. [21:54:17] when one of them breaks it, and i have to fix it, then i shall rain down hell upon their head. [21:54:32] tfinc: may wanna have him ping me when he makes the account [21:54:43] cuz i may not notice the RT update, since im in the datacenter all this week and prolly all next. [21:54:57] and i am not keeping up to the entire queue anymore this week or next =] [21:55:27] there may also be a list yer supposed to email, i have no idea. [21:55:44] Guillaume would be the blog interaction expert, i just update the software =] [21:59:26] RobH: ;) [22:00:22] i figure if i shoudnt be promoting these folks someone else will yell at me [22:00:29] ya know, like happened on officewiki a long time ago [22:00:31] and foundationwiki [22:00:36] and pretty much everywhere =P [22:02:11] * RobH stopped posting to the blog after he got yelled at for pushing someone from top posting spot [22:02:40] there's a calendar for that now. [22:03:55] tfinc: make sure you pass on that scheduling when to push a blog post live is important and should be scheduled with the other folks that post to the blog. [22:04:06] to avoid getting yelled at like RobH. [22:09:26] meh, there wasnt one when i was yelled at [22:09:31] it was just after the blog merge [22:09:49] maplebed: any idea where that stuff is documented? links? [22:10:06] the office wiki. [22:10:09] pipeline = rewrite healthcheck cache swauth proxy-server [22:10:15] blah. [22:10:38] !log pushing a new zone file to add 2 more search-related vips for eqiad [22:10:41] Logged the message, and now dispaching a T1000 to your position to terminate you. [22:12:54] notpeter: everyone should be that paranoid about dns [22:12:58] you get +1 sir [22:13:17] still haven't taken down the site! [22:13:26] wait, ever? [22:13:40] ever [22:13:44] we thought I had once [22:13:49] but my shit was a red herring [22:14:20] taking down the site now is a tiny bit harder than it used to be [22:14:23] not a lot harder mind you [22:14:31] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123 [22:14:41] but the entire push to puppet, review in gerrit does give an abstraction layer [22:15:07] i suppose we need to get our dns stuff in there eventually. [22:15:23] I think that having it in svn is fine [22:16:48] we are ditching svn eventually i thought [22:16:58] the idea of review in gerrit doesnt suck. [22:17:13] i mean, svn diff works just fine as well, but less of a public facing audit trail [22:17:34] i dont know of anything in our zone files that couldnt be on gerrit public either [22:17:47] yeah [22:17:58] I think it would be fine to put it into our git repo [22:18:15] but I don't think that it's a big enough deal to warrant the effort [22:18:23] hhhmmmm [22:18:34] although, having dns be easier to revert is good in my mind [22:19:49] RECOVERY - Host search1015 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [22:19:54] !log all 3 dns servers are responding to digs after reload [22:19:57] Logged the message, and now dispaching a T1000 to your position to terminate you. [22:20:07] yea its a low priority task. [22:20:14] +1 to move it to git [22:20:24] im gonna drop a ticket in coreops for it [22:20:28] with no assignee [22:20:33] cool [22:21:29] it can be another 'puppetize this' task that i will never be left alone long enough to do ;] [22:21:47] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3380 [22:21:50] did that have the tone of dripping sarcasm, cuz it was meant to [22:22:31] put a puppet on it! [22:24:07] recall when hearing puppet made you think of the muppets and happiness rather than ruby and misery? [22:24:12] heh [22:25:04] PROBLEM - SSH on search1015 is CRITICAL: Connection refused [22:25:37] I'm actually ok with puppet. it's nice to be able to add a couple of lines of code when you need a new service [22:26:16] oh god, what did I just say? have the ruby brainworms gotten to me? [22:34:49] PROBLEM - Host search1015 is DOWN: PING CRITICAL - Packet loss = 100% [22:35:25] notpeter: time to old yeller ya in the corn crib, there is not coming back now [22:35:33] once you start liking puppet, you are lost \ [23:01:13] RECOVERY - Host db1020 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [23:05:34] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: Connection refused by host [23:05:43] PROBLEM - SSH on db1020 is CRITICAL: Connection refused [23:06:02] PROBLEM - DPKG on db1020 is CRITICAL: Connection refused by host [23:06:10] PROBLEM - mysqld processes on db1020 is CRITICAL: Connection refused by host [23:06:12] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: Connection refused by host [23:06:28] PROBLEM - Disk space on db1020 is CRITICAL: Connection refused by host [23:06:28] PROBLEM - MySQL Slave Running on db1020 is CRITICAL: Connection refused by host [23:06:55] PROBLEM - Full LVS Snapshot on db1020 is CRITICAL: Connection refused by host [23:07:13] PROBLEM - MySQL disk space on db1020 is CRITICAL: Connection refused by host [23:07:13] PROBLEM - MySQL Idle Transactions on db1020 is CRITICAL: Connection refused by host [23:07:31] PROBLEM - RAID on db1020 is CRITICAL: Connection refused by host [23:07:31] PROBLEM - MySQL Recent Restart on db1020 is CRITICAL: Connection refused by host [23:13:13] RECOVERY - Host search1015 is UP: PING OK - Packet loss = 0%, RTA = 26.41 ms [23:17:22] Ok, cleaning up eqiad, headed home. [23:18:21] !log db1020 firmware still updating, will check on it later tonight. offline until then [23:18:24] Logged the message, RobH [23:32:25] PROBLEM - NTP on db1020 is CRITICAL: NTP CRITICAL: No response from NTP server [23:38:52] RECOVERY - SSH on search1015 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [23:49:52] RECOVERY - Disk space on search1015 is OK: DISK OK [23:50:46] PROBLEM - Host search-pool4.svc.eqiad.wmnet is DOWN: CRITICAL - Network Unreachable (10.2.2.14) [23:50:55] RECOVERY - RAID on search1015 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [23:51:04] PROBLEM - Host search-pool4.svc.pmtpa.wmnet is DOWN: CRITICAL - Network Unreachable (10.2.1.14) [23:51:13] PROBLEM - Host search-pool1.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [23:51:22] PROBLEM - Host search-prefix.svc.eqiad.wmnet is DOWN: CRITICAL - Network Unreachable (10.2.2.15) [23:51:40] RECOVERY - DPKG on search1015 is OK: All packages OK [23:51:49] PROBLEM - Host search-pool2.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [23:51:58] PROBLEM - Host search-prefix.svc.pmtpa.wmnet is DOWN: CRITICAL - Network Unreachable (10.2.1.15) [23:52:07] PROBLEM - Host search-pool3.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [23:54:06] damnit [23:54:11] sorry people [23:54:15] those aren't live yet [23:54:55] notpeter: would you page everyone saying that? [23:55:19] (for those who might be near sleep and yet concerned) [23:56:09] maplebed: what's the script on spence? [23:56:18] err... page-something? [23:56:27] page-all, probably. [23:56:30] page-all?> page_all? [23:56:37] either. [23:56:41] (they're symlinks) [23:58:18] done [23:59:08] recieved. thanks! [23:59:24] yep! thank you for reminding me