[06:50:56] !log I've corrupted the _base directory on the instance's glusterfs share. I'm recovering the files from file descriptors using lsof. Not totally sure how I'm going to get the _base directory back, yet. [06:51:00] Logged the message, Master [06:52:49] !log all of the instances are accessing the file descriptors of files inside of the _base directory, and fuse has an issue with this. gluster can't recreate the base directory because of the processes holding open the old one. [06:52:52] Logged the message, Master [12:53:52] !log Upgraded observium to latest version [12:53:55] Logged the message, Master [13:50:53] !log Set increased OSPF/OSPFv3 metric 30 on both directions of the link cr1-eqiad:xe-5/2/1 <--> cr1-sdtpa:xe-0/0/1, to combat higher than normal jitter and packet loss on the link [13:50:56] Logged the message, Master [15:18:18] !log Fixed DNS resolving on the core routers by allowing DNS replies in the loopback filter [15:18:22] Logged the message, Master [17:16:43] !log adjusted LVS partitions on hume, moved /usr/local/apache to a new 5GB mount [17:16:46] Logged the message, Master [17:35:31] !log removed 3GB db30:/tmp/gmond.log and force-restarted gmond b/c the init script failed to restart it [17:35:34] Logged the message, Master [18:01:21] !log Raised MTU between cr1-sdtpa - (csw1-sdtpa) - cr2-pmtpa to 9192 [18:01:24] Logged the message, Master [18:03:14] cmjohnson1: did power flip out in tampa ? [18:03:28] no [18:04:47] mark: csw1-sdtpa not responsive via ip [18:04:55] working on that [18:05:02] don't worry [18:05:03] just mgmt [18:05:23] csw1-sdtpa is connected via a temp vlan on the ae0 link [18:05:36] which just went to 9192 mtu, and csw1-sdtpa is (and remains) 1500 [18:05:38] so ospf fail [18:05:58] but now chris is here... he can connect csw1's mgmt interface to the mgmt network [18:06:16] and we can move it out of production/in-band [18:07:26] whew [18:07:56] mtu changes are such a fucking pain ;) [18:13:16] mark: i have 3 mgmt links coming out of csw1 and only one is connected to msw1-sdtpa...i have 2 serial console cables as well going to the scs [18:13:35] 3 mgmt links? [18:13:48] so 2 serial, one msw1-sdtpa? [18:13:51] 2 [18:13:55] managment [18:14:00] correct [18:14:02] i'm sorry, I don't understand [18:14:12] 2 serial I know about, that always needs to be connected [18:14:14] 2 mgmt links ...only 1 going to msw1 [18:14:14] but ethernet? [18:14:21] yes [18:14:22] yes [18:14:27] is it to the right port already then? [18:15:27] yes...20 [18:15:41] oh ok [18:15:44] then you're done ;) [18:18:02] cmjohnson1: is it to the primary or secondary management module? [18:18:39] oh damn [18:18:43] the secondary is active [18:18:46] that explains [18:19:13] meh [18:19:31] do you want to swap [18:19:44] no, we should increase the size of the mgmt core switch first [18:19:47] from a 24 to a 48 port switch [18:19:54] what do we have available in terms of manageable switches? [18:19:58] a few foundry GSs I think [18:20:00] also an LS? [18:20:18] i have a juniper 48port [18:20:32] that's a spare [18:20:36] we cannot use that [18:20:43] i have netgear [18:20:49] nogood ;) [18:21:16] that's is all i have on site [18:21:23] no foundry GS? [18:21:27] where did they go? [18:21:41] i have a foundry...the old asw-c3 [18:21:47] yeah [18:22:42] it's a bit cramped in rack A1 [18:22:50] it is [18:23:21] i can put it above the scs [18:23:25] hmm just leave as is for now I think [18:23:26] i have 2u there [18:23:28] I'll think about options [18:23:34] yeah a 1.5U switch is not ideal there [18:23:54] k...let me know [19:30:10] LeslieCarr: should i close the OTRS ticket on the thai ISP? [19:32:13] sure [19:34:48] k [19:35:37] 2012030510004441 is closed now [20:33:15] robla: AaronSchulz (or anyone else): any objections to putting swift back in the loop? [20:33:50] so I guess the cleanup script is done [20:33:55] yes. [20:34:25] maplebed: double check the deployments cal to make sure you're not doing it during someone else's window, but otherwise, no prob [20:34:47] robla: you mean http://wikitech.wikimedia.org/view/Software_deployments right? [20:34:53] yup [20:35:13] "Monday, March 5 - putting Swift thumbnail storage back into production " [20:35:16] someone was kind. [20:35:34] alright, looks clear to me. [20:36:25] !log enabled swift for 100% of thumbnails in production [20:36:29] Logged the message, Master [20:40:51] after i upload a package with reprepro how long should it take beforei can do an apt-get install with it ? [20:41:39] it should be immediate [20:41:44] *(but you have to run apt-get update first) [20:41:49] (on the client host) [20:45:35] Is mutante still away on holiday/whatever? Not seen him speak in quite a while [20:45:44] Reedy: he just recently got back. [20:45:48] he'll be online in a few hours. [20:45:53] (likely) [20:46:05] Right, so he was still away ;) [20:46:07] Cheers [21:37:39] is anyone around who can walk me through an apache config change? O [21:37:41] err [21:37:55] I've only used scap once, kinda afraid of it still [21:38:40] Jeff_Green: i think you can sync just the one file? (assuming it is just one) [21:38:52] maybe RoanKattouw? [21:39:04] just because he /nick'd recently ;) [21:39:07] Oh hey [21:39:09] Yeah, sure [21:39:22] Jeff_Green: What is the path of the specific file? [21:39:23] it's redirects.conf [21:39:29] /home/w/conf/httpd/redirects.conf [21:39:29] Oh OK [21:39:35] So I believe sync-apache syncs that [21:39:38] the new version is at /root/redirects.conf [21:39:39] Let me read the code [21:39:55] Yup, looks like it [21:39:56] so I just mod the file and it'll go out on its own, or I manually run sync-apache [21:40:06] You mod the file on /home , then run sync-apache [21:40:13] ok [21:40:27] Also, /h/w/conf/httpd is versioned (locally hosted SVN) [21:40:31] what about svn/git commit? is done for you? [21:40:37] So use svn diff to sanity-check your changes, and commit manually [21:40:54] last time I ran this was in maybe October, but I vaguely remember needing to call sync-apache etc with other arguments? [21:40:54] wmf-config has an autocommit script, this one doesn't [21:41:03] * RoanKattouw is sad to see the last commit was by root [21:41:14] sync-apache doesn't take arguments [21:41:15] i didn't realize this was different than wmf-config [21:41:21] There is no dollar sign anywhere in that script [21:41:29] Other sync scripts do [21:41:51] sync-file , ditto for sync-common-file and sync-dir , and scap [21:42:59] use single quotes to quote your reasons to avoid http://wikitech.wikimedia.org/index.php?title=Server_admin_log&diff=44184&oldid=44183 ;-) [21:43:01] and to test on a particular server, do i use sync-apache there too? or does that only work for sync-common? [21:43:28] i.e. pre-test my redirect syntax before redirecting all of wikipedia to the wikipedia shwag store :-) [21:44:20] I'm not sure whether sync-common pulls Apache too, let's see [21:44:23] reading this: http://wikitech.wikimedia.org/view/Scap but it doesn't explain which scripts are intended for use where [21:45:09] It doesn't seem to do Apache config [21:46:56] ok. so if i want to pre-test it sounds like I need to manually drag the file to my test server [21:47:51] I think so yeah [21:47:56] ok [21:48:00] robh: r u around? [21:49:33] cmjohnson1: he's 3.5 hrs idle [21:49:58] thx jeremyb...i know he is sick today...was hoping he was monitoring. [21:51:58] oh, that's too bad... [21:53:44] i am. [21:53:47] monitoring. [21:54:00] whats it like to take a sick day and not have the sound on for pings? [21:54:05] cmjohnson1: whatcha need? [21:54:28] * RobH can stand to look at the computer monitor on the lowest brightness setting for a short while ;] [21:54:29] received 2 servers...don't see anything regarding them. [21:54:41] packing slip doesnt have rt #? [21:55:49] sorry..go back to bed! [21:57:43] just cuz it has rt doesnt mean anyone has dispatched where they will go, but if there is no linking ticket feel free to drop one and assign to me for further racking info [21:58:36] ok the redirect seems to work as expected on srv193 [21:58:37] no dispatch on where they'll go but I could've bugged you about that tomorrow. the rt says 20 but i only received 2 ...should I expect more [21:58:50] rt#? [21:58:51] launching apache conf torpedoes [21:58:57] 2407 [21:59:14] getting the equinix ticket together now [21:59:23] cmjohnson1: not sure what reads 20 on that [21:59:35] so those are part of the swift order, the frontends to be exact [21:59:46] that same rt # will also include the c2100s [22:00:02] so just scan and attach slips for the one order and compare to ticket (i think you misread if you got 20) [22:00:30] mixing rt#'s [22:00:43] heh, it happens [22:00:50] you had the squid order one we hadnt placed i imagine ;] [22:01:07] so we have ms-fe1 and 2, the two you just got in are ms-fe3 and ms-fe4 [22:01:14] you can get them into racktables, but not in a rack [22:01:36] ok [22:01:38] they need to be spaced out from the other front ends, so i need to balance where best to place them, and I dont feel like pulling up torrus, estimating power use, etc... [22:01:39] ;] [22:02:01] cold meds != clear mental photo of datacenter planning [22:02:31] cmjohnson1: though, you can feel free to do just all that for me and make a recommendation in the new ticket, just dont rack them yet [22:02:52] isn't that what google sketchup is for? ;-) [22:02:54] ie: review torrus and racktables to find a rack that can accomodate one of each of these servers that doesn't also house the other two front ends [22:02:54] ok...let me see where they can go [22:03:09] and i will take a look at your recommendation tomorrow and we can go from there =] [22:03:11] right....all on 10? [22:03:22] can go to either floor, but you need to try to not mix rack uses [22:03:28] ie: pmtpa only has d1 for misc use [22:03:41] except we are getting the new row power strips in on the 16th [22:03:51] ok...cool! [22:03:58] but thats awhile off, still need to ensure we have networking, etc, so do your best [22:04:10] will do [22:04:28] thx [22:04:29] mark_ and i have ongoing discussions on how we envision laying out the datacenters, so you wont have that insight unfortuately [22:04:33] but you will have enough for an educated guess [22:05:14] garg sync-file is apparently the wrong tool [22:05:32] i thought it was decided to use sync-apache? [22:06:05] ok-doky, im goin back to coma [22:06:08] if it was i missed it but that does seem more consistent with the wikitech scap page. skimming backscroll [22:06:17] jeff 05 21:45:09 < RoanKattouw> It doesn't seem to do Apache config [22:06:44] RobH: get some tea and get better! [22:07:18] well that's fugly. [22:07:19] where jeff^I is suppose to be "Jeff_Green: " of course ;) [22:07:25] ya [22:07:30] the output looks like a fucking trainwreck [22:07:59] srv269: cp: cannot stat `/home/wikipedia/conf/httpd/apache2.conf': No such file or directory [22:08:14] I have no idea if I just took the site down based on this output. [22:08:33] tons of rsync permissions errors, file transfer errors, missing file errors [22:08:44] thats not entirely unheard of [22:08:55] sigh. no wonder I have so much angst about this [22:08:57] if its for .svn directories, you can kinda disregard it, its not good, but its not breaking the site [22:09:21] srv277: cp: cannot stat `/home/wikipedia/conf/httpd/apache2.conf': No such file or directory <-- what about those? [22:09:23] atleast it used to be, but i have not pushed a site change via scripts in 30+ days, only via puppet [22:09:27] hrmmm. [22:09:34] no thats kinda fubar. [22:09:38] https://en.wikipedia.org/wiki/Special:RecentChanges does seem to be getting updated... [22:09:52] okay, turns out icinga kicks ass after you finally forever get it working [22:10:12] Jeff_Green: so you changed apache files? [22:10:13] should I redo as root? this looks like permissions meltdown to me [22:10:17] i changed one file, redirects.conf [22:10:25] apache-sync-all [22:10:30] then apache-gracefull-all [22:10:44] sorry, backwards [22:10:54] sync-apache [22:11:10] i did sync-apache and that's what barfed pages and pages of output [22:11:14] now I am paused [22:11:27] hrmm, then yea i would attempt as root [22:11:35] see if it works, and its a temp fix. [22:11:55] i thought the apache files ran in wikidev group and therefor the users in there had access [22:12:02] but i am not 100% [22:12:49] are these bash scripts? you could try throwing 'set -x' in the top of it to see what it's trying to do as it does it [22:13:35] * RobH feels crappy and slouches back to bed [22:13:47] sync-apache as root still looks pretty bad [22:14:06] but better [22:14:20] LeslieCarr: is there a demo up? [22:14:42] https://neon.wikimedia.org/icinga/ [22:14:46] however you need to login [22:14:51] lemme see if i can get the port 80 working :) [22:15:00] oh port 80 does appear to work [22:15:05] try http:// ? [22:17:11] definitely looks enough like nagios that it could just be an upgrade [22:17:17] at first glance at least [22:17:47] although, now you turned it off and i can't see anything ;-) [22:17:50] jeremyb: oh [22:17:53] that was me breaking stuff [22:17:54] sorry [22:17:54] hehe [22:17:55] jeremyb: it's supposed to be a fork + betterness. [22:18:41] maplebed: yeah, i guessed it was something like that ;-) [22:18:50] damn this internet (cable) sucks ;( [22:24:33] !log modified redirects.conf per RT #2488 [22:24:37] Logged the message, Master [22:31:07] Jeff_Green: did it actually get deployed? logmsgbot should have said something [22:31:48] well [22:32:05] sync-apache doesn't seem to want the comment part [22:32:23] at least when I ran it with no arguments expecting it to barf usage info, it just started syncing shit [22:33:23] Jeff_Green: oh, maybe that script's different. did you do the graceful part? i think usually when i see people gracefulling it's accompanied by a manual !log [22:33:40] i did, and that doesn't ask either [22:33:53] see "manual !log" ;) [22:34:56] Ryan_Lane: looks like the gerrit IRC bot has died somehow :) [22:37:43] hm [22:37:50] likely because it's running on two boxes [22:39:34] New patchset: Ryan Lane; "Giving glustermanager a shell" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2923 [22:42:07] New patchset: Lcarr; "making a routers group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2924 [22:42:45] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2921 [22:42:48] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2921 [22:43:11] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2924 [22:43:13] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2924 [22:43:42] Ryan_Lane: thanks :) [22:44:26] yw [22:50:23] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2923 [22:50:36] hm. why is the damn lint check not running? [22:50:42] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2923 [22:50:45] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2923 [23:02:30] New patchset: Ryan Lane; "Adding python-paramiko to openstack::project-storage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2925 [23:03:24] RAWR, run my damn lint check! [23:03:40] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2925 [23:03:43] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2925 [23:03:49] LeslieCarr: have the lint checks been running for you? [23:04:03] Ryan_Lane: last one didn't, 2 ago did [23:04:12] it's killing me [23:05:37] grr, did my DNS changes just go out mid-edit!? dammit. [23:05:47] how would that be possible? [23:05:49] why did I get 26 nagios host down messages delivered to my inbox in the last 3 minutes? [23:06:01] where's the nagios bot, thinking of that [23:06:09] f if I know, I'm editing files on sockpuppet, haven't checked them in yet [23:06:19] then it isn't possible [23:06:20] but it's all the domains I'm working on that are paging [23:06:29] got damn it [23:06:35] I just restarted nagios rather than the bot [23:06:42] oh, whew [23:06:43] nagios@neon.wikimedia.org? [23:06:59] oh. that's the new nagios server that LeslieCarr is testing [23:07:09] sorry everyone [23:07:17] just restarted icinga, shoudln't be happening any more [23:08:00] LeslieCarr: how do you like icinga ? :) [23:09:10] hashar: so far looks pretty cool took a bit to get going - http://neon.wikimedia.org/icinga/ (when i restart icinga after fixing neon's emailing everybody) [23:10:11] hashar: definitely needs some love on the packaging but looks so much nicer than nagios [23:10:19] and nagios needed a ton of packaging love :) [23:12:54] New patchset: Lcarr; "removing nagios monitor from neon doing manual changes, prevent overwriting" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2926 [23:16:46] looks nice indeed [23:18:33] ultimately, you want nice dashboard for CT [23:18:37] showing a big green button [23:19:25] maplebed: are you super busy today? [23:19:31] and when he pushes the button everything works ? :) [23:19:48] he never has. That is the point :-) [23:20:03] New patchset: Lcarr; "removing nagios monitor from neon doing manual changes, prevent overwriting" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2926 [23:20:23] AaronSchulz: working on wikimania stuff atm, so I think no. [23:21:00] just to confirm, nobody else is getting the neon nagios spam, yah ? [23:21:10] in the last 5 minutes [23:21:12] * AaronSchulz can't make sense of https://graphite.wikimedia.org/dashboard/temporary-4 [23:21:57] * AaronSchulz meows to binasher about p90 and graphite [23:22:18] yes [23:22:26] it will happen this week [23:22:30] yay [23:23:14] AaronSchulz: what part of that graph is confusing? [23:24:36] maplebed: I don't get why is causing the plateaus in the p50 time graph [23:24:38] *what [23:25:11] AaronSchulz: am I supposed to be looking at the wmfPurgeBackendThumbCache.count? [23:25:48] the one below that [23:26:11] wmfPurgeBackendThumbCcahe.tavg? [23:26:26] mostly tp50 [23:26:36] I don't know what that graph is talking about or what the Y axis means, or where you're seeing 'tp50'. [23:26:43] sorry. man, I'm just confused. [23:26:46] * AaronSchulz wonders if the link works [23:27:00] there is no wmfPurgeBackendThumbCache.tp50? [23:27:16] this is what I see: http://screencast.com/t/t2SHfJq3u [23:27:31] yeah, the link is broken for some reason [23:27:51] maplebed: load again [23:28:26] ok, I see tp50 now, but I still don't know what it's measuring. [23:28:29] or why it's interesting. [23:28:29] * AaronSchulz forgot to "save dashboard" [23:28:45] or what the units are. [23:29:01] "1K" = 1000 ms [23:30:27] * AaronSchulz wonders if it would help to have a higher sampling factor [23:35:46] AaronSchulz: tp50 == 50% of calls finish in <= that time value in ms. not 1 in 50 [23:36:15] 50th percentile [23:36:22] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [23:36:25] i.e. the median [23:48:22] RECOVERY - udp2log processes on locke is OK: OK: all filters present [23:48:47] binasher: Are you planning any large schema changes soon? We found an index on the revision table that was added to MediaWiki in 2007 but somehow never made its way onto the production DBs [23:49:04] Not a scientific result, but I just checked the enwiki master and it didn't have it [23:49:17] hah [23:49:27] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2926 [23:49:29] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2926 [23:49:35] i still have to add revision.rev_sha1 on enwiki [23:49:41] Aha [23:49:42] so it can be included with that [23:49:45] Nice [23:49:54] It looks like all wikis are missing it, even newly created ones don't have it [23:49:56] Which is very strnage [23:50:00] i wasn't planning on doing a master swap though, since the master does have that column [23:50:27] Lol, mystery solved [23:50:39] The schema change was reverted later in '07 and never in an actual release