[00:04:00] maplebed: i think so, but not 100% sure, i have no reason to think he wouldnt be [00:04:12] that's good enough for me. [00:04:20] I have ms-be2,3,4 currently waiting at the bios screen [00:04:25] doing ms-be5 now. [00:04:31] I'll send him email. [00:56:59] maplebed: commonswiki: bug 34440 InvalidResponseException trying to list_objects. Message `Invalid response (0): Unexpected HTTP response code: 503`; Site: `wikipedia` Lang: `commons` ThumbRel: `5/5a/Igor_Utkin_17.jpg/` [00:57:08] there is a modest number of such things in the last few days [00:59:23] AaronSchulz: where is that line from? a log somewhere? who (mediawiki?) was making the network call, and do you know what it was trying to accomplish? [01:00:02] maplebed: from wmf-config/swift.php in wikipedia/logs [01:00:08] this is for thumbnail purging [01:00:49] the screnshot in that bug says 'Container has no objects' as the error message - I thought you patched that. [01:01:47] yeah I don't know why that's still in the log entries [01:02:14] unless it's hashar's comment "I have added a wfDebugLog() call on production (local revision number is 2973)." [01:02:19] and that's what's showing up... [01:04:02] anyway, I wonder why Swift would respond with 503 [01:04:56] maplebed: are there any internal logs for swift that show failures? [01:05:01] I think there are a bunch of conditions under which swift responds with a 503 [01:05:18] IIRC I could get log lines to show up in syslog, yeah. [01:05:27] or /var/log/messages. [01:05:34] but only when I specifically said to log stuff [01:05:40] I think by default not much is logged. [01:06:30] hm. [01:06:43] it's clearly not the case that the commons buckets are empty though, [01:06:56] so this is likely not the same as the 'Container has no objects' thing unless that was a misleading error message. [01:07:39] AaronSchulz: is it possible to log more stuff when that error is triggered in cloudfiles? [01:07:56] maybe the content of the 503 page returned? [01:09:45] note sure how easy that would be [01:09:49] * AaronSchulz was looking [01:10:49] I found a matching 503 line in ms-fe2's log for one of them - it's interesting that the query took 30.0041s to execute. Sounds like a timeout to me. [01:14:41] * AaronSchulz looks at https://graphite.wikimedia.org/dashboard/temporary-4 [01:24:50] maplebed: crazy :) [01:24:56] ? [01:25:27] what's crazy? [01:25:38] the tp50 graph [01:25:47] * AaronSchulz wonders if that link works [01:25:59] shit. I can't log into to graphite. [01:27:44] AaronSchulz: would you screenshot it for me? [01:28:06] binasher: tp999 in the default list is not very useful, I'd prefer a tp90 or something, heh [01:30:05] maplebed: i can get into graphite... tried labs creds? [01:32:30] yup. [01:32:33] oh wait... [01:32:56] wrong username. [01:33:00] ::sigh:: [01:36:02] it uses the same username as labsconsole [01:36:07] username/password, that is [01:37:20] AaronSchulz: just before one of the 503 errors: proxy-server ERROR with Container server 10.0.0.249:6001/sdt1 re: Trying to GET /v1/AUTH_abcd/wikipedia-commons-local-thumb.8c: Timeout (10s) (client_ip: 10.0.2.212) [01:37:24] AaronSchulz: yeah, i can replace tp999 with a tp90 [01:38:53] there were some cases where seeing 99.9% was useful but probably not many.. [01:39:05] it can still be done manually right? [01:39:31] the graph commands are just freetext [01:39:53] AaronSchulz: my suggestion - when that error occurs, retry once. [01:39:54] :P [01:39:56] gah, doesn't work [01:40:11] binasher: hmm, maybe you can keep it but add a 90 [01:40:24] "No Data" fail :/ [01:48:10] you can calculate nth % in graphite but that doesn't provide the real thing on pre-aggregated data. i'll see if i really care about 999, i could also replace tp50 with a tp90. its already updating 39k graphs/minute so i'm hesitant to add another per metric, which would up it to 52k. maybe i should steal another server! it supports sharding. [12:44:32] New review: Demon; "I disagree that we should allow everything but just disallow viewvc. I know that was the original pu..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2888 [12:46:49] New review: Mark Bergsma; "Please stop using these global, unqualified variables. It's dirty and will break in the next version..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2902 [12:54:32] New review: Mark Bergsma; "Please replace this by the generic rsync classes we already have. No point in redoing this every tim..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2879 [13:05:04] New review: Mark Bergsma; "Please fix indentation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2815 [13:07:40] New review: Mark Bergsma; "Please fix indentation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2819 [13:10:28] New review: Mark Bergsma; "Please fix modes, normal files should be 0400, 0440 or 0444 (read-only)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2838 [13:12:47] New patchset: Hashar; "Bug 28469 - Make SVN Documentation be indexed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2888 [13:13:57] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2888 [13:17:13] New review: Mark Bergsma; "Thinking ahead, can you replace that by a recursive definition for the docroot directory instead? So..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2888 [13:20:40] New review: Mark Bergsma; "rt-mailgate doesn't support https..." [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/2446 [13:22:42] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2607 [13:22:45] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2608 [13:22:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2607 [13:28:14] New review: Mark Bergsma; "Please get your indentation right" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2839 [15:18:22] !log backing up wikitech in hopes of upgrading some of its software [15:18:41] if wikitech starts to be odd, its cuz im maxing out the vm its on. [15:19:45] Logged the message, RobH [15:20:02] !log shutdown frdev offsite vm per email to engineering last week [15:20:06] Logged the message, RobH [15:22:29] bah, it broke [15:22:38] seems i need to lock it to pull it [15:38:23] bleh, wikitech isnt in lvm, but the linode is only 60% space used [15:38:33] we must have gotten an upgrade, now i kinda wanna redo that entire system. [15:39:32] i had to wipe out a ton of old stuff on it to get room to dump the database as it is. [15:41:00] !log Converted Toolserver switch interfaces on csw2-esams to pure routed-only mode [15:41:06] Logged the message, Master [15:51:26] I broke wikitech, fixing it now [15:51:34] stupid update script failed out, rolling it back [15:54:01] restoring db backup of wikitech [16:05:58] !log wikitech outage resolved [16:06:04] Logged the message, RobH [16:06:04] whew. [16:08:51] hi chris [16:09:23] cmjohnson1: I think that ben/maplebed was asking about you. [16:09:47] He plans to do some work on the c2100 series servers, having you on site makes him feel a bit better since these are new servers [16:10:00] (but its early yet on west coast) [16:10:08] i have some other work possibly ;) [16:10:14] I'd like to clear out row B in pmtpa [16:10:57] huzzaaaaah [16:11:01] death to old racks! [16:15:17] new pmtpa row C racks are 42U right? [16:16:35] * mark cleans out the old racks in racktables [16:16:37] one by one... [16:18:18] robh: yes...i am working on what he needed [16:18:30] coolness [16:19:53] how come sq81 is listed in rack C2 pmtpa? [16:21:53] no clue [16:21:59] mark: i must've not moved it...there is a hole in d3 where it is located....i will update [16:23:23] thanks [16:23:32] ok I've updated the racktables layout to reflect the row C changes [16:24:07] * mark adds a new eqiad row C as well [16:24:13] RobH: 47RU right? [16:24:48] i think only 42 [16:24:53] since it was 20amp not 30 [16:25:01] cmjohnson1: ^? [16:25:03] eqiad. [16:25:09] ohhh, sorry, yes [16:25:26] good [16:25:37] eqiad looks impressive in racktables rackspace view now [16:25:40] so if we want to have a rull row d [16:25:45] row d even [16:25:45] as opposed to tampa, which looks very shitty with all the 3-rack rows [16:25:51] we may need to expand the cage sooner than later [16:26:01] as the facility is filling up, and when i asked omid about it [16:26:08] he said they have no notes about us reserving further space [16:26:10] we have right of first refusal [16:26:15] thats what i thought [16:26:15] in our contract I think [16:26:22] mail CT about it [16:26:24] but he didnt see it in the system, so we should follow up [16:26:24] he should look it up [16:26:26] will do. [16:27:25] do we still need those decommissioned racks? [16:27:28] racktables can delete now [16:27:41] if we delete it then we have no record of it existing [16:27:58] less than ideal in a real inventory mgmt system [16:28:00] ok [16:28:06] I deleted the powermedium power strips [16:28:08] but none of the rest [16:28:12] yea those we dont care [16:28:14] indeed [16:28:15] indeed [16:28:25] damn it we have worked together too long. [16:28:42] we are saying the same things at the same time ;P [16:29:50] robh: yes 42 ...but i think you covered it [16:30:06] yea i thought mark was asking about something else, no worries [16:31:05] yeah [16:31:12] cmjohnson1: will you have time to move stuff from pmtpa B3 to D1 today? [16:31:26] yes...whatever you need [16:31:30] I see the management core switch is still there [16:31:35] yes [16:31:36] and that will involve running a bunch of patches [16:31:40] i'm making a ticket now [16:32:08] cmjohnson1: after you finish what you are working on for mark [16:32:19] the issue of those not saving in bios is more than likely going to fall to you to fix. [16:32:26] i am going to take a quick look at one now [16:32:59] robh: okay [16:33:52] !log poking at bios on ms-be3 [16:33:56] Logged the message, RobH [16:36:47] the esc-0 command worked for f10 for me. [16:36:55] its rebooting, lets see if it really saved. [16:37:14] i hope so...i did it with cart and it did not. [16:38:05] it said save and exit though? [16:38:15] hrmm, yea, did it to me just now [16:41:24] hmm [16:41:30] I think hostway removed our out of band management link [16:42:09] i found why the c2100 hosts do pxe first [16:42:11] * Force PXE First [Enabled] * * [16:42:17] under boot settings configuration [16:43:14] changed it, i bet it works now. [16:45:08] mark: is that a multimode fiber? [16:45:10] ok, its installing, but grub error, but it goes to disk [16:45:39] robh: that would make sense since the C series are the "cloud" servers [16:45:45] cmjohnson1: it was copper [16:45:54] the only copper link we got from hostway [16:47:09] do you want me to find out? [16:47:16] yes please [16:47:23] and hopefully move, to [16:47:24] too [16:47:30] I want to get rid of row B [16:47:34] k [16:49:28] !log fixed boot order on ms-be3, fixing ms-be4 [16:49:31] Logged the message, RobH [16:52:03] i bet eqiad looks nice now with the 3 rows [16:52:10] cmjohnson1: kinda, we had some servers donated to us by yahoo a long time ago [16:52:15] and they always pxe booted [16:52:24] to get console redirection and then told to boot on disk [16:52:29] but always to pxe first [16:52:53] mark: matt didn't know off hand...will call me when he gets over here [16:53:08] are you in pmtpa now? [16:53:17] i am on 10 right now...will be moving shortly [16:53:33] I thought you were gonna check out what's there, not ask matt ;) [16:55:23] going to relocate now [16:55:25] !log ms-be4 boot order fixed, fixing ms-be5 & ms-be2 [16:55:28] Logged the message, RobH [17:03:42] RobH: thanks for finding the force to pxe thing. [17:03:57] how is it that you can save the bios but I can't? [17:04:01] maplebed: I believe you're gonna need the management network today? [17:04:02] esc+0 [17:04:11] hrmph. [17:04:13] dell f10 emulation [17:04:24] escape always popped up a 'exit without saving' thing for me. [17:04:27] seems it will take f10 in drac console [17:04:29] do you have to just do it faster? [17:04:30] but not ipmi console [17:04:36] yea, esc then 0 [17:04:38] fast. [17:04:42] but not at same time. [17:05:16] mark: I'm going to be upstairs for the analytics day, so ... actually no. [17:05:21] ok [17:05:23] good [17:05:30] though I'm extremely annoyed that the ms-be hosts are now in a grub error. [17:05:34] we're gonna move the management network gear in pmtpa [17:05:42] what is the grub error? [17:06:06] mark/robh - found that contract [17:06:16] we do have the ROFR [17:06:19] - Customer shall have a right of first refusal ("ROFR") during the first twenty-four (24) months of the Initial Service Term for seven (7) additional [17:06:19] cabinets that are specifically located in Cage 61150 in DC6 ("ROFR Space"). [17:06:24] woosters: we should let equinix know [17:06:39] will send to Omid and ErikSilver [17:06:44] ok [17:08:01] RobH: do you have the grub error up? [17:08:34] cmjohnson1: what's up? :) [17:09:23] mark: i see a red utp going to mr1pmtpa from hostway [17:09:33] port 7? [17:09:54] correct [17:09:56] hmm [17:09:58] it's down [17:10:02] mark: I can't get to any of the consoles at the moment. RobH reported that there was another setting in the bios I was missing to get them to boot from disk but now they're at a grub error. [17:10:16] did you already take down the mgmt network? [17:10:23] I didn't [17:10:23] maplebed: console payload error? [17:10:27] maybe chris did? [17:10:28] mark: i have Matt coming by later...he can check his end [17:10:32] cmjohnson1: ok [17:10:38] maplebed: or network error? [17:10:42] RobH: ms-be2 says another connection is active, and the rest just don't respond. [17:10:44] cmjohnson1: can you run management uplinks to prepare for the management kit move? [17:10:49] yea, thats just console error, not network [17:10:53] maplebed: send consoleclose [17:10:54] maplebed: make sure 'serial redirection after bios' is DISABLED [17:10:58] its in the help part of the script ;] [17:11:01] otherwise that can give a grub error [17:11:10] this is a grub disk load error [17:11:15] it goes to grub rescue shell [17:11:20] not a post redirection error [17:11:35] ok [17:11:44] mark: i am not certain which ones are mgmt uplinks [17:11:44] maplebed: the ~~. kills the sockpuppet connectoin rather than the console connection half the time [17:11:54] so it leaves the serial open [17:11:57] RobH: you gotta count the ~s. [17:11:59] you have to just close it with consoleclose [17:12:02] for me it's three. [17:12:09] but that's only the problem on ms-be2. [17:12:10] thats what i did wrong, [17:12:15] the others connect but just sit there. [17:12:17] cmjohnson1: so... every rack in row C and row D should get a separate uplink to D! [17:12:20] d1 [17:12:27] it's easy right now [17:12:31] because you only need to row D today [17:12:31] maplebed: i literally just did them [17:12:35] they take while to reboot [17:12:40] but they eventually posted [17:12:41] ah. [17:12:48] they are all working though, i watchen them work [17:12:51] ok, ms-be5 is now at the grub rescue prompt [17:12:55] (post, reboot, svae disk setttings) [17:13:01] so yea i think your install on them has issues now [17:13:06] but the boot order issue is fixed [17:13:07] cmjohnson1: as far as I can tell, only D2 right now has an uplink going back to B3 [17:13:10] that needs to be removed [17:13:21] I -think- that racks D1 and D3 have an uplink to the management switch in D2 [17:13:30] those uplinks should be removed as well [17:13:32] maplebed: it may be an issue with the partman config and the bios partition, but atleast now you can reliably boot to see it happen [17:13:49] and this is why I don't fuck around with hardware. [17:13:50] then msw1-d1-pmtpa, msw1-d2-pmtpa and msw1-d3-pmtpa should get uplinks to msw1-pmtpa (in rack D1) [17:14:02] d1 and d2 have uplinks [17:14:03] this shit just pisses me off, and not in a way that inspires me to figure out how to fix it. [17:14:07] (beware, there will be TWO management switches in D1, the normal rack management switch and the core management swithc we're moving) [17:14:19] cmjohnson1: uplinks going where? [17:15:12] d1 is going to port 21 on msw1 [17:15:29] d2 port 11 [17:15:44] port 21 is supposed to be the SCS [17:15:46] so how did that end up there!? [17:15:58] anyway [17:16:01] nevermind what is there now [17:16:03] make new patches [17:16:07] it is scs [17:16:14] looked at the wrong one [17:16:27] i don't have anyting on d1 [17:16:29] sorry [17:16:33] it doesn't matter [17:16:35] make new patches [17:16:45] one from msw-d2-pmtpa to top of rack D1 [17:16:52] one from msw-d3-pmtpa to top of rack D1 [17:16:59] and one from msw-d1-pmtpa to top of its own rack [17:17:07] that's where msw1-pmtpa will be, top of D1 [17:22:49] mark patches are set [17:22:59] good [17:23:11] cmjohnson1: so to move mr1-pmtpa and msw1-pmtpa, we need to move two patches from hostway: [17:23:18] 1) the fiber to the 10th floor, which is on msw1-pmtpa [17:23:23] and 2) the copper patch just mentioned [17:23:27] I assume you're gonna need Matt for that? [17:23:46] yes...good assumption [17:23:57] and he's coming over later? [17:24:00] then we can do it then [17:24:22] matt has already run the fiber...just not sure which goes where [17:24:32] yeah he'll need to plug it in on his end [17:24:47] there are two fibers to the 10th floor [17:24:52] make sure he's doing the right one [17:24:55] one is management, one is production [17:25:04] we're gonna move both at some point, but right now i'm talking about management [17:25:29] okay [17:27:04] matt is onsite [17:27:10] ok [17:27:16] do you know what needs to be done? [17:27:31] move the mgmt fiber from msw1 [17:27:36] and the copper [17:27:37] move mr1-pmtpa, msw1-pmtpa to top of D1 [17:27:46] and the two connections from hostway [17:35:40] mark: do you want me to use u410/41 for msw? is okay to disconnect...also...we have copper link from port 0/1 mr1 to rx80 [17:35:48] a bit lower I think [17:35:53] keep some free space for cable mgmt [17:36:05] you can remove the link to the rx8 [17:36:11] that's not working anyway [17:37:22] cmjohnson1: it's probably easier to put the switch on top, the router below it [17:37:29] since there will be cables coming from all racks from the top [17:37:34] and the router only needs one uplink [17:38:06] okay [17:40:57] !log disconnecting management fiber from msw1-pmtpa [17:41:00] Logged the message, Master [17:46:54] !log powering down msw1-pmtpa for relcocation to d1-pmtpa [17:46:57] Logged the message, Master [17:53:17] where are we gonna put csw5? [17:56:20] mark: ..... [17:56:24] in the bin? [17:56:26] ;] [17:56:34] hehe [17:56:36] it should go on a row end [17:56:48] i figure to exhaust to the end cap [17:57:04] sound right? [17:57:29] it helps the side to side slightly, and there is so much cold air in pmtpa now i worry more about it getting rid of hot [17:57:56] though the cold intake would be best facing towards the wall where the cooling comes in. [17:58:19] the new row C has same hot cold orientation right? [17:58:24] I don't know [17:58:33] lookin at photos now, it does [17:58:49] so if i recall csw5, when racked facing the hot row in normal network fashion [17:58:58] has the cold intake on its left [17:59:03] and hot output on its right [17:59:26] so putting it in c3 would be with the cold intake facing the row of servers, oriented towards the cold air blower on the side of room [17:59:37] with the hot output into the side of the rack, which has a side panel [17:59:53] if we keep it turned off I'm fine with it too ;) [17:59:54] any hot air excaping through the rack front are blown away from cold intake of the row [18:00:04] in that case lets not use it. [18:00:05] it's just for spares for csw1-sdtpa now [18:00:06] ;] [18:00:13] it can be used as a lab switch [18:00:16] well, racking that would be the best place i think [18:00:16] but not sure how much we need that [18:00:20] yes [18:00:38] it can soft off and be powered on via mgmt right? [18:00:57] uh [18:00:58] no [18:00:59] then its best of both worlds, lab whne needed but off and not fubaring airflow otherwise [18:01:02] bnah [18:01:05] damn them! [18:01:14] i vote dont rack it ;] [18:01:24] make a BBQ out of it? [18:01:37] send to office to incorporate into a coffee table fixture [18:01:42] 'this ran wikipedia for years' [18:01:50] now it holds up my drink. [18:02:00] hehe we COULD send it to the office [18:02:13] for use there [18:02:36] send them some mrj21 too ;] [18:12:57] mark: fiber link has been re-established [18:13:03] yes stuff is back up [18:13:04] checking now [18:13:32] so what's the story on the copper link now? [18:13:36] where do you want mgmt connections for d1 and d3? (ports?) [18:13:46] it's in the ticket... [18:13:48] checking [18:14:24] okay...i will check it [18:14:45] 0/1/10 Down None None None None No l 0024.380d.7849 << msw-d1-pmt [18:14:46] 0/1/11 Down None None None None No l 0024.380d.784a << msw-d2-pmt [18:14:46] 0/1/12 Down None None None None No l 0024.380d.784b << msw-d3-pmt [18:14:49] ports 10 to 12 [18:16:14] mark: spoke with Matt about out of band...he states that it was associated with "wikia" and that it was disconnected at our request a year ago. if we want to set it back up please let them know [18:16:34] it was originally wikia that's right [18:16:53] but it was working recently ;) [18:17:00] I don't think we requested it to be disconnected [18:17:02] but I'll ask them [18:25:45] chris... I think you've created a loop [18:26:26] make sure that none of the management switches in row D have any old/temporary uplinks left [18:28:22] i'm so happy now that management network is separate from production network :) [18:28:38] mark:on msw-d2 i have a mgmt link going fro 46m port 48 to msw d1 pt [18:28:42] 46 [18:28:56] remove that [18:31:21] mark: still looping? [18:31:29] it's still blocked [18:31:31] let me bounce the port [18:31:55] still looping [18:32:01] can you check all other racks? [18:32:16] there should be one, and only ONE connection on each rack's mgmt switch leaving the rack [18:32:20] and that should be to msw1-pmtpa [18:32:34] if row B still has anything connected, disconnect that [18:40:23] now I can't reach it anymore [18:40:25] what are you doing? [18:41:28] mark: nothing ...verifying i have things in the right place ....the only question i have is [18:41:30] 0/1/20 Down None None None None No l 0024.380d.7853 cr2-pmtpa [18:41:40]