[00:02:31] try doubling the threads … for 10 minutes ;-) [00:02:50] test ur assumption [00:02:57] k. [00:05:26] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2903 [00:05:30] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2903 [00:28:46] New patchset: Ryan Lane; "Seems wikidev *is* required" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2904 [00:36:28] woosters: you were right; doubling the threads didn't double throughput. but it also only bumped up usage on ms5 by about half as much as before, so I think it's fine to run it at 3x speed. [00:36:40] it'll help too that I'll be running against 3 different containers instead of all against the same conatine.r [00:37:41] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2904 [00:37:45] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2904 [00:39:59] are u going to keep running more tests? seems okay ;-) [00:44:51] New patchset: Ryan Lane; "Add a way to avoid installing apache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2905 [00:45:19] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2905 [00:45:22] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2905 [00:48:48] New patchset: Ryan Lane; "Removing non-ldap accounts from formey and manganese" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2906 [00:50:53] New patchset: Ryan Lane; "Checking for no-apache gerrit config properly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2907 [00:52:03] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2906 [00:52:06] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2906 [00:52:24] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2907 [00:52:27] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2907 [00:56:22] New patchset: Ryan Lane; "I hate our apache manifests, so, so much." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2908 [00:58:55] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2908 [00:59:04] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2908 [00:59:06] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2908 [02:15:32] !log vlan tagged virt5's eth0 and eth1 ports on csw1-sdtpa [02:15:35] Logged the message, Master [05:46:03] !log started swift deletion run on owa1, 2, and 3. [05:46:07] Logged the message, Master [06:05:15] New patchset: Bhartshorne; "changed thresholds, caught io exception on missing file." [operations/software] (master) - https://gerrit.wikimedia.org/r/2911 [06:06:35] New review: Bhartshorne; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2911 [06:06:38] Change merged: Bhartshorne; [operations/software] (master) - https://gerrit.wikimedia.org/r/2911 [09:04:42] New review: Hashar; "Patchset 3 was a test to try to get Gerrit to mark the change verified." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2682 [13:54:18] !log Pooled new eqiad bits servers strontium and palladium [13:54:21] Logged the message, Master [14:07:41] !log pmtpa/sdtpa management network went down [14:07:43] Logged the message, Master [14:14:49] !log mr1-pmtpa rebooted/lost power for some reason [14:14:51] Logged the message, Master [14:16:28] !log csw5-pmtpa: Mar 1 14:01:42:A:Power Supply 2 , 2nd from left, bad [14:16:31] Logged the message, Master [14:26:34] !log Moving bits traffic back from pmtpa to eqiad [14:26:37] Logged the message, Master [14:54:21] !log strontium server rebooting to set HT to enabled [14:54:24] Logged the message, RobH [15:25:33] does anyone have time to look at pl planet? It isn't updating. bz with rt link: https://bugzilla.wikimedia.org/34268 [17:13:35] !log Removed 4.8GB /tmp/gmond.log on db1008. Tried to resist urge to make snarky comment about ganglia but failed. [17:13:38] Logged the message, Master [17:14:35] hahaha [17:24:55] !log Removed 5.3GB /tmp/gmond.log on db1017 [17:24:57] Logged the message, Master [17:25:32] !log Removed 5.3GB /tmp/gmond.log on db1018 [17:25:35] Logged the message, Master [17:35:30] !log Removed >5GB /tmp/gmond.log on db11 [17:35:32] Logged the message, Master [17:36:06] !log Removed >5GB /tmp/gmond.log on db13 [17:36:08] Logged the message, Master [17:39:52] !log Removed >5GB /tmp/gmond.log on db25, db32, db33, db37 [17:39:54] Logged the message, Master [17:43:38] New patchset: Bhartshorne; "telling the mysql gmond module to stop spewing logs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2912 [17:44:02] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2912 [17:44:05] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2912 [17:44:54] can someone kick pl planet for me? [17:44:55] https://rt.wikimedia.org/Ticket/Display.html?id=2416 [17:47:13] I'd like to get it fixed before the ticket gets more than a month old [17:47:33] woosters: https://rt.wikimedia.org/Ticket/Display.html?id=2416 ?? please? [17:48:51] hexmode - will get back to u soon.. [17:50:58] hexmode - i can get to it [17:51:03] what's the problem? [17:51:45] http://pl.planet.wikimedia.org/ is stuck on last november [17:52:42] woosters: the "polska wikipedia" site is a facebook feed that updated ~8 hours ago, though [17:53:08] will have mutante look into it when he gets in later this afternoon [19:39:35] robh: the new disk for labstore 1 came today...okay to bring it down and replace? [19:39:49] lemme see [19:40:08] cmjohnson1: this is in the chassis or the disk shelf portion? [19:40:14] it shoudlnt have to come down at all these are hot swap [19:40:25] okay..it is chassis [19:40:26] Ryan_Lane: ping [19:40:35] 100% packet loss [19:41:15] i think i should bring it down...that way i can bring up raid bios [19:41:27] cmjohnson1: if its doing things thats not a good option. [19:41:39] we can query the raid directly via the software easily enough [19:41:47] i don't think it is in service but check...okay [19:41:53] if the host is online, scheduling downtime like that is not good practice [19:43:49] yes, fine to bring down [19:44:55] hrmm [19:44:59] i see one disk is in rebuild [19:45:04] does it just never clear rebuild? [19:45:52] cmjohnson1: so that works, but lets do this the 'uptime' way [19:45:54] correct [19:45:57] its good practice for when it cannot come down [19:45:58] it never rebuilds [19:46:02] it also goes missing [19:46:17] if we bork it up doing it the low impact way, we can always take it offline [19:46:33] okay...works for me [19:46:51] so, the raid software on the dells is megacli, this is just fyi [19:46:58] since without root you cannot run it yet, but eventually you will [19:47:11] in that, it starts all groups with 0 as the first member [19:47:13] !log restarted memcached on virt0 [19:47:15] Logged the message, Master [19:47:20] so the software sees disks 0-11 in the chassis [19:47:31] outside the software they are numbered 1-12 [19:47:44] so you just have to know about it, cuz any software output from the OS will be 0-11 [19:47:55] but any dell mgmt software may call it 1-12 [19:48:06] confusing [19:48:10] but ok [19:48:27] so the raid controller software shows me a few things, I am going to pull up an etherpad and paste output for you [19:48:29] lesseeeee [19:51:53] cmjohnson1: hahahha, so it had the bad disk showing as rebuilidng [19:51:57] but now it doesnt show it at all [19:52:08] so it's missing [19:52:12] that's what I was saying ;) [19:52:24] yea and for some reason this controller wont feed me back the event log [19:52:26] right...ryan has been saying it bounces between the 2 [19:52:28] which is odd. [19:53:13] RobH: we can't get properly operating storage hardware [19:53:15] oh well, this is the fun part where we make sure the adaptec matches up numbering wise [19:53:22] Ryan_Lane: how so? [19:53:25] dead disks happen. [19:53:32] heh. I'm just kidding [19:53:35] cmjohnson1: its hot swap, so we know the software doesnt see disk 2 [19:53:36] there's something with the eqiad netapp too [19:53:39] it's like dataset1 all over again! [19:53:40] so thats disk 3 in the 1-12 momdel [19:53:40] it went offline completely I believe [19:53:45] I haven't looked at it yet [19:53:50] even netapp ;-) [19:53:55] cmjohnson1: so pull out disk 3 and lets see if it crashes, admin log it before you pull it [19:54:13] i admit to being slightly less careful now than i would be if this was a in production server. [19:54:24] the fact its not letting me grep the controller log is alarming. [19:54:38] !log removing disk 3 from labstore1 chassis [19:54:41] Logged the message, Master [19:54:42] (we will reboot and such in a bit to see if we can fix it, but one thing at a time) [19:54:58] well, it's not a bad test to see if pulling a disk live will crash things ;) [19:55:05] especially on a box with 24 disks [19:55:15] it shouldnt since its raid60 [19:55:23] or 6, i dont recall [19:55:34] two raid 6 in an lvm [19:56:32] cmjohnson1: also log when you put in the replacement disk [19:56:37] that way i know to see if its borked [19:56:39] =] [19:57:50] mark: mind seeing if I did the vlan tagging properly for virt5? [19:57:54] hrmm [19:57:58] !log replaced disk 3 labstore1 chassis [19:58:02] Logged the message, Master [19:58:14] robh: it may go to being in a foreign state first [19:58:20] I never took any foundry classes, after all :D [19:58:22] you need to clear that [19:58:29] i dont like how this is responiding to commands, i dont see the h800 controller like i expect [19:58:37] i need to compare on another system real quick [20:00:43] ok, its just new way it works i guess [20:00:50] the older models didnt, but i see dataset1001 is the same [20:01:32] pasted in ehterpad the output for you cmjohnson1 [20:01:36] its now rebuilding the new disk [20:01:51] the dell h700s default to force online and force rebuild on disks [20:02:03] we dont change that behavior, but its always good to confirm its doing it [20:02:30] Ryan_Lane: looks ok, does it work? ;) [20:02:35] ok..see the rebuild....but we need to monitor this to see if it works [20:02:39] I haven't tried yet [20:02:45] glad to see it looks right, though :) [20:02:53] because it was previously going through rebuild and missing [20:03:10] I just wanted a check of my work, since I don't know foundry well [20:03:30] Ryan_Lane: actually, it's wrong [20:03:36] damn [20:03:37] vlan 105 name virt-hosts [20:03:37] untagged ethe 11/25 to 11/28 [20:03:37] tagged ethe 14/3 [20:03:37] + tagged ethe 15/11 [20:03:38] tagged ethe 16/3 to 16/4 [20:03:41] that should be untagged [20:03:48] cmjohnson1: normally a server is in nagios and we would monitor to see it clear there [20:03:55] why untagged? [20:03:56] however, this one isnt yet in nagios, cuz its not in service [20:04:07] RobH: what's the username for the cisco's management? admin? [20:04:09] Ryan_Lane: virt5 eth0 is not an 802.1Q link [20:04:14] admin [20:04:15] Ryan_Lane: yep, admin [20:04:16] it's just on that one vlan [20:04:37] if you look at virt1-4, they're untagged too [20:04:43] oh. right. we do tagging on the host [20:04:46] only links to other routers/switches are tagged, for that vlan [20:05:40] well, that makes sense [20:07:45] back in a bit. metrics meeting is over [20:13:25] cmjohnson1: root@labstore1:~# MegaCli -pdrbld -showprog -physdrv[32:2] -aALL [20:13:25] [20:13:26] Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 2% in 13 Minutes. [20:13:41] its going to be a long time [20:13:53] i would expect a couple hours at this rate [20:13:58] 3% in 15 minutes. [20:14:10] yep..maybe more than that [20:14:16] indeed. [20:14:32] but at least we are getting progress...if i recall i was not getting any progress on that drive before [20:14:36] apergos: that command i pasted isnt in the lsi cheatsheet [20:14:41] you may wanna copy it down, its useful [20:15:01] well, the other one failed out after trying, so the fact this is 15% in is reassuring [20:15:12] sorry, its not near that [20:15:15] thats 15 mintues, blehhhh [20:15:18] who konws. [20:17:15] thanks rob [20:17:29] apergos and robh: what is the story with dataset1 [20:17:35] is the drama over? [20:17:38] just a sec [20:17:39] yep [20:17:45] the chassis is offically allowed to die. [20:17:51] we are goign to use the disk shelf on another system [20:18:00] but not the raid controller in the chassis [20:18:12] cool [20:18:16] more than likely we will slap a dell with an h800 to talk to the disk shelf. [20:18:24] http://wikitech.wikimedia.org/view/Dataset1#Feb_2012 [20:18:28] make it dataset3 [20:18:29] that is the story. [20:18:34] like a r510 [20:18:47] so is it cool to power down? [20:18:52] please. [20:18:56] you can pull the main chassis out of rack [20:18:59] with prejudice and stuff [20:19:03] but may as well leave the disk shelf alone [20:19:06] i would but i have zero places to put it [20:19:14] pull the disks for the cabinet [20:19:19] then the rest can go in the goddamned trash. [20:19:30] i dont want the parts from that horrible machine in other hosts [20:19:34] since none of them were reliable [20:19:34] heh :] [20:19:52] give the cpus to your kids for keychains ;] [20:20:09] we cannot even donate this to anyone, its non functional. [20:20:25] nope..i have several non-functional machines to unload [20:20:32] hahaha [20:21:01] cmjohnson1: So on that you have a couple of options. [20:21:08] and it is entirely up to you however you want to handle it [20:21:17] you can stack them neatly and drop a ticket for EQ to dispose of them [20:21:29] you can take them to disposal on the clock and wmf reimburses you [20:21:44] or you can just take them off the clock and get paid for the trash metal, which rich used to do [20:21:54] but he lived out in the country [20:21:59] had to pass county dump daily to go home [20:22:06] i think that last one is not cost effective. [20:22:08] yeah...not my thing! [20:22:19] i personally would pick option 1 [20:22:45] it would be the simplest and most effective [20:22:47] also ask miguel where best place to stack so they charge us minimal hours [20:23:24] the alternative is you somehow get them to the basement and into the trash bins which are sometimes not accessible to customers like us [20:23:34] so its not really feasible to expect you to get rid of them easily [20:24:43] i will ask miquel [21:00:35] RobH: do we have documentation on how to use these ciscso? [21:00:38] ciscos? [21:00:48] oh did they show up? [21:00:54] well, I have one for labs [21:00:56] I mean the one for the new instances [21:00:58] awesome! [21:01:00] yeah [21:01:07] I'm going to install it and add to the cluster [21:01:11] coooool [21:02:11] well, it's no worse than drac, from what I can tell. heh [21:02:40] it has autocomplete and help via ? [21:04:39] cisco standard [21:07:59] ipmi would be nice [21:08:26] all I want to do is pxe boot :( [21:08:36] really, I'd like a browsable bios [21:09:14] I need to set hyperthreading on too [21:09:21] christ the timeout is low [21:10:48] oh, of course [21:11:02] when you go into the mode you can actually change things, the autocompletion goes away [21:11:06] so does the ? help [21:11:24] wow. tab just made the window freeze [21:11:28] I'm over it [21:11:40] wow. backspace doesn't work [21:12:52] I kind of want to stab this system [21:13:19] ouch [21:13:23] um. let's see [21:13:30] control-h [21:13:34] maybe? [21:13:45] aaaaaaaaaarrrrrrrrgggghhhh [21:13:51] backspace works, but it's delete [21:13:55] arrow keys work [21:13:56] heh [21:15:25] oohhhh ok [21:15:40] you need to change scope before you can set things in it [21:15:51] \o/ [21:15:58] scope bios [21:16:02] set ? [21:16:07] set boot-order pxe [21:16:11] top [21:16:20] scope ? [21:18:29] scope advances [21:18:34] scope advanced* [21:19:05] ooooooooooooooooo [21:19:11] these have VT and VYD! [21:19:12] err [21:19:13] VTD [21:20:05] these systems are pretty awesome [21:20:33] Ryan_Lane: some yes [21:20:40] sorry, missed ping, had music too loud [21:20:48] and I now also hate the cli less [21:20:53] now that I figured it out [21:20:57] it's better than drac [21:21:14] yea but i have not found a way to turn on ipmi [21:21:27] i thought we had a wikitech page on cisco =P [21:21:29] yeah, that would be nice [21:21:36] cmjohnson1: yay! [21:21:42] yea, cuz then we can just use the ipmi script to do all mgmt tasks [21:21:58] ?? [21:22:11] row c? [21:22:12] Ryan_Lane: So can you fill out a wikitech page with the info you have? [21:22:17] RobH: it mentions it in the book [21:22:31] IPMI Access Profile [21:22:44] email me book? [21:22:51] or link/ [21:22:53] ? [21:22:58] http://www.cisco.com/en/US/docs/unified_computing/ucs/sw/cli/config/guide/1.4/b_UCSM_CLI_Configuration_Guide_1_4.html [21:23:16] "You must include this policy in a service profile and that service profile must be associated with a server for it to take effect." [21:23:38] "Configuring IPMI Access Profiles" [21:23:51] it has all the instructions right there [21:24:01] RobH: I don't think my notes will be very helpful [21:25:06] well if you already have the examples of how to powercycle and such i mean. [21:25:51] im on virt1001 [21:26:17] I don't [21:26:36] ok, well, i will be rebootin virt1001 and such [21:26:42] to document this stuff [21:26:57] I'm on virt5 [21:27:05] I'm not doing anything with the eqiad ones right now [21:30:27] bah [21:30:38] you have to kill the network connection to leave the serial over lan [21:32:40] do you have to enter some server group to reboot? [21:32:49] i have no clue [21:32:54] the instructions seem to assume using some central policy [21:32:57] how did you reboot? [21:33:08] (i used webgui before) [21:33:10] found it [21:33:17] scope chassis [21:33:17] ok, what was the command? [21:33:32] power hard-reset [21:33:33] or [21:33:35] i just want you to put the reboot, use pxe, etc commands on wikitech =p [21:33:36] power cycle [21:33:47] but since you wont im gonna make you put them in irc ;p [21:33:54] :-D [21:34:04] make a page for me :) [21:35:37] having one unified page for all server management stuff would be awesome [21:35:52] christ these things take ages to boto [21:35:53] *boot [21:36:01] of course, it has 256GB of memory [21:36:06] we have the platform documentation on wikitech [21:36:08] so I guess I can understand [21:36:11] it's all there [21:36:15] I know [21:36:23] Ryan_Lane: http://wikitech.wikimedia.org/view/Cisco_UCS_C250_M1 [21:36:24] I'd like to have a page that says "here's how you reboot" [21:36:28] it has that [21:36:31] and it'll show all different platforms together [21:36:41] how would that be any different [21:36:49] if this can take ipmi commands then we can standardize them to a root user for them [21:36:59] and then the single script will handle all servers [21:37:04] so folks wont have to know this stuff [21:37:07] indeed [21:37:19] it's a pain to remember "oh, right, this is drac4, not drac 5. What systems is that for?" then find the page [21:37:20] but for now just toss all those common commands on tehre, even if you dont explain each one [21:37:28] cuz the syntax alone gives it away atleast [21:37:34] oh, right, this is sun, where's the page for that? [21:37:38] how would it be different if it were on one page? you'd still need to know [21:37:46] you browse to platform docs, then server type [21:37:47] then it's right there [21:37:51] would be no different [21:37:51] http://wikitech.wikimedia.org/view/Platform-specific_documentation [21:38:09] and racktables tells you what model server it is [21:38:14] right now I search for drac [21:38:15] or facter [21:38:18] or I search for sun [21:38:20] then you're doing it wrong :P [21:38:28] yea i dont use mediawiki search man [21:38:29] our docs are laid out like shit [21:38:32] just go to platform docs [21:38:34] that is laid out fine [21:38:42] do these systems have hardware raid? [21:38:59] I do not recall [21:39:12] would have to watch it boot and see if you see raid options [21:39:14] these only have 196608MB of memory [21:39:26] so little. [21:39:26] I feel robbed of some ram [21:39:29] there is an overview page, there is a category [21:39:44] what do you want, clippy on the page to guide you to it? :P [21:39:48] yes [21:39:50] "hello, which server do you want to reboot today?" [21:40:04] I want it to tap on my monitor too [21:40:16] well, it surely isn't PXE booting [21:40:19] huh, ok the cisco is slick [21:40:19] wait [21:40:26] can it even pxe from this vlan? [21:40:28] not only will it allow command line and webgui change of all mgmt [21:40:30] but also all bios [21:40:33] it should be able to, right? [21:40:36] on the fly while the system is running [21:41:51] hey all [21:42:01] hm. how can I find the MAC? [21:42:08] i'm going to be putting in a hardware request for the sms/ussd project [21:42:18] whats our standard system hardware now? [21:42:22] our most generic hardware build [21:42:42] tfinc: a misc server [21:42:47] Ryan_Lane: i can tell on webgui, still finding the info for command line [21:42:52] though i have ipmi working now on virt1001 [21:42:55] just put your needs in a ticket, we'll find a match [21:43:20] tfinc: indeed, we have a number of varying builds, if you can put a procufrement ticket in i can review what you want versus what we have [21:44:44] show command depends on scope too [21:48:28]