[06:05:23] !log rebooting storage3: we have messages like May 6 05:45:12 storage3 kernel: [465081.410025] Filesystem "dm-0": xfs_log_force: error 5 returned. in the log, and the raid is unaccessible, megacli doesn't run either [06:05:31] Logged the message, Master [06:10:08] thanks apergos [06:10:18] don't thank me yet [06:10:22] we might get zip out of this [06:10:33] you are at least trying [06:10:52] it's hung on the mgmt console. might have to powercycle the box [06:12:33] do either of you happen to know how the labs puppetmaster gets updated? manual like prod or auto tracks the test branch or? [06:12:41] !log and powercycling the box instead. grrrr [06:12:44] Logged the message, Master [06:13:05] jeremyb: no idea [06:13:46] apergos: I did manage to find at least 1TB (tridge) in pmtpa and 4TB in eqiad (oxygen) [06:13:52] hrmm. well i guess i could submit something and one of you could merge (to test) and we could see what happens [06:14:30] jeremyb: heh, I wish I could. I am not ops [06:15:15] I think it's not going to come up. bad news [06:15:26] ah just a sec [06:15:31] pgehres: you don't have submit perms in gerrit?... oh, hrmmm... let me just look at gerrit directly. maybe test perms really are as restrictive as prod [06:15:49] oh yay it's booting but no idea what we'll see when it comes up [06:16:19] The disk drive for /a is not ready yet or not present [06:16:26] I"m going to wait a bit [06:16:38] bit this is not making me hopeful [06:16:44] okay, i am more concerned with /archive [06:16:59] as the mysql on /a is a replica of db1008 and db1025 [06:18:15] I'm going to give it a couple minutes yet [06:18:15] pgehres: looks like you do have perms for that. (i'm assuming you're in the wmf group). anyway, i need to get it pushed for review first [06:18:33] jeremyb: heh, i love access i don't know about [06:18:43] heh [06:19:19] I need to put together the official greek elections drinking game here pretty soon [06:19:44] err... it's only 9:20am?! [06:20:16] jeremyb: despite being staff I am not in the wmf group [06:20:40] odd…and none of my groups should give me that right [06:20:51] yeah and some channels already started their all day converage. but I'm not going to watch now... I just need it to circulate so friends have it by the evening [06:22:02] ok so I think /a is not coming back, I'm going to skip the mmount and see if we get anywhere [06:22:13] The disk drive for /archive is not ready yet or not present [06:22:20] :-( [06:22:22] so, sorry folks. I'll leave it sit here at this prompt but [06:22:47] I think that's where we're going to be, I expect chris will have to look at it monday (tomorrow) [06:22:58] apergos: thanks for trying. [06:23:02] sure [06:23:06] its looks like locke is getting full though [06:23:10] without storage3 around [06:23:13] how full? [06:23:22] http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Miscellaneous+pmtpa&h=locke.wikimedia.org&v=153.881&m=disk_free&jr=&js=&vl=GB&ti=Disk+Space+Available [06:23:28] filling fast [06:23:46] oh joy [06:24:13] when you get stuff from storage3 where do you put it? [06:24:27] we leave it on storage3 [06:24:38] uugghhh [06:24:42] we sometimes pull to process, but then delete locally [06:24:55] the important logs are duplicated to tridge [06:25:01] but that does not include udp logs [06:25:16] (AFAIK) [06:25:48] the plan was to use put 10+ TB in eqiad in the new payments rack but that is still a twinkle in Jeff's eye [06:25:54] right [06:27:18] well we have 326gb on locke, I assume much of that is you [06:27:30] yeah [06:27:46] with the Terms of Use banners up, I expect it is [06:31:42] I don't know which are your files since I can't see them on storage3 :-( [06:32:11] yeah, and I have no idea what the naming conventions are on locke [06:32:33] on storage3 they look like bannerImpressions-2011-11-14-12PM--45.log [06:32:43] all right I'm going to bring up storage3 with neither /a nor /archive mountd [06:32:44] and landingpages-2011-11-14-12PM--15.log [06:33:39] !Log bringing up storage3 with neither /a nor /archive mounted, saw "The disk drive for /archive is not ready yet or not present" etc on boot, waited a long time, finally skipped them [06:34:49] is morebots case senstive? [06:36:07] yes [06:36:15] !log bringing up storage3 with neither /a nor /archive mounted, saw "The disk drive for /archive is not ready yet or not present" etc on boot, waited a long time, finally skipped them [06:36:18] Logged the message, Master [06:36:19] that's annoying [06:36:21] I dn't type the L on purpose [06:36:31] i figured not [06:36:33] it's from hitting the shift for the ! [06:36:57] ok well looking at the jobs on storage3 there's a bunch of different directories and hosts involved so it's not so helpful [06:38:43] ok I found the filemoveer job [06:42:29] pgehres: not stored compressed? (guessing based on file extension) [06:42:45] no they are compressed [06:42:56] the samples that I had are not :-) [06:43:40] they actually compress very nicely [06:43:46] hmmmmm.... why can't this all go in swift? [06:43:50] mwuhahahaha [06:44:18] probably because we try to keep a hard line btw public / private data [06:44:28] so what? [06:44:35] you could even have your own swift cluster! [06:44:42] but not necessarily [06:44:51] lol, I think we are getting the old netapp in eqiad [06:45:01] oh. [06:45:03] ;-( [06:53:13] ok I am copying off the logs to /archive/emergencyfrmlocke on hume [06:53:16] there's only 4 files [06:53:22] because the logs didn't rotate. [06:53:29] yeesh [06:53:30] I'll gzip them after they get there, [06:53:44] thanks apergos [06:54:00] and then they won't take up much space on hume. this will be ok for awhile, not days. someone should check again this evening my time (p the morning for you folks) [06:54:16] i will keep an eye on it [06:54:26] well, from ganglia at least [06:55:19] this is a bad day for availavility for me, in the midday I am out and in the evening til quite late I wil be unavailable [06:56:02] I expect that after this file is moved you'll have another... at least 24 hours before stuff gets croweded again, let's see what ganglia looks like in a bi [06:56:03] t [06:56:10] okay, i'm sure jeff green would be willing to run something during the normal day [06:57:01] yep, which is why I am not monkeying with it [06:57:15] he likely has a fallback plan and all that stuff, cause storage3 has been wonky before [06:57:20] I just don't know the drill [06:57:32] he usually has somethingup his sleeve [06:57:42] well in about 1 more minute this stuff should be copied over and I can do the gzip [06:59:08] oh. I lie [06:59:14] it wil lbe a lot longer than 1 minute :-D [06:59:43] haha, still not quite as bad as xfering them to the office [07:00:28] according to the eta, 40 minutes. whatever [07:01:08] i ballparked 3 days to get them from FL to SF, almost makes me want to fly to tampa with some HDDs and then fly back [07:01:34] for the 3TB of stuff on storage3 [07:01:44] hahaha [07:03:01] !log manually rotates udplogs on locke, copying destined_for_storage3 off to hume:/archive/emergencyfromlocke/ (jeff, this note's for you in particular) [07:03:04] Logged the message, Master [07:03:13] i don't even want to think about xfering them over TCP/AC (RFC 1149) [07:03:20] heh [07:03:41] i take it you are familiar with that one :-D [07:04:02] Just ship the server over then ship it back again :D [07:04:38] Damianz: once it boots and they mount i'm not letting anyone shut them down [07:08:03] you say :-P [07:08:36] i will use everything in my (very limited) power [07:09:23] i think the most effective technique would be bribing cmjohnson to yank network [07:09:39] * pgehres starts baking cookies [07:13:09] * Damianz thinks you could stand guard with a nerf gun [07:13:33] * pgehres would need a large nerf army with our distributed ops team [07:13:42] half an hour to go [07:18:48] 20 mins [07:20:46] that was a fast 10 mins, are you moving close to the speed of light? [07:21:20] no [07:21:34] :-( that would have been cool [07:21:38] if anything I'm moving close to the speed of frozen molasses [07:21:45] which is pretty dang slow, let me tell you [07:25:49] while we wait... if any commentator says "austerity" (λιτότητα): drink. if any commentator says "stability" (σταθερότητα): drink. if any commentator says "most important elections since the restoration of democracy" (οι πιο σημαντικές εκλογές της μεταπολίτευσης): drink [07:26:06] at that point we ought to be well soused in... oh, I'd say the first 15 minutes [07:26:46] heh, might want to stick to low-point beer [07:27:00] you'll still get trashed [07:27:33] 'fraid so [07:27:54] oh, *that's* the game [07:28:12] yeah, what did you think I meant? [07:28:23] how much do you drink each time? is it constant? [07:28:35] exponential [07:28:39] ;-) [07:28:48] idk, i just didn't think it had to do with commentators (didn't think of it) [07:29:41] I expect given the way this evening is going to go, it's going to be a healthy gulp but no more than that [07:30:04] at some point of course you won't have good control over the size of the drink... [07:30:24] 10 mins [07:31:28] 50MB/s isn't too bad [07:31:56] awesome. the polls have been open for two hours and there's (at least) one polling station without the ballot box [07:31:58] nice job guys [07:32:08] somehow they "forgot" to deliver it [07:32:31] do they have teams of lawyers from each side waiting to work on problems like that? [07:32:36] is it even 2 party? i guess no [07:32:43] it's 36 party [07:32:48] oooh [07:32:53] 8 to 80 parties that will get seats [07:32:58] er 8 to 10 [07:33:00] sorry [07:33:17] 8 parties is an amazing fracture [07:33:27] is it sad that the first thing I think it that each party gets a single hex digit [07:33:42] oops, not hex, aplah+10 [07:33:55] perfectly normal [07:34:27] although i assume there's some lifecycle. parties die, parties are born [07:34:29] i would be willing to try 8 to 10 parties with seats, 2 certainly doesn't work [07:34:56] so, to have primary keys unique across all time then you need >1 digit [07:35:02] are any the Pirate Party? [07:35:21] i think there was a pirate party ad on tv the other day? [07:35:48] or something. don't remember exactly [07:36:07] i think it was the first pirtae party candidate appearing on a ballot in the US [07:36:19] * pgehres is looking [07:37:21] we have a local pirate party yes [07:37:35] they're not expected to pick up any seats [07:38:10] wow, you guys had 70% turnout in 2009 [07:38:41] voting is mandatory here [07:39:56] huh, turnout was higher than I thought in our 2008 election at ~62% of eligible voters [07:40:14] wonder how it will be this year [07:41:08] i have no idea [07:41:15] but i imagine it will be lower [07:41:26] compression started on hume now [07:41:44] I would expect it to be a fair amount lower, since the race isn't expected to be close (is it?) [07:41:54] ho hum, gzipping 120gb... [07:41:59] i think it could still be close [07:42:21] i don't think it will end up close, but i think its still a toss-up [07:57:07] I cleared off those logs from locke andthe gzip will just rnu til its done [07:57:22] so at this point I'm not really here ;-) [07:57:35] thanks for all the help apergos [07:57:40] sure [07:57:40] enjoy election day [07:57:50] I hope so but I have a asinking feeling [07:57:56] anyways we'll see :-) have a nice evening [09:41:49] so, idk if this will work, but could someone please merge 6584 and 6727 ? (both to puppet's test branch) [09:42:18] * Damianz feeds jeremyb cookies [18:54:28] robla: your message on the VP about the deployment, are you not using the central notice this time? I ask because it was last set to be on 25/04/2012 so was last used when 1.20wmf1 was deployed - are the changes in 1.20wmf2 small so don't need the CN? [22:06:10] i think nagios-wm is dead? or at least broken? [22:09:54] also, locke's getting low on space again [22:09:58] http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Miscellaneous+pmtpa&h=locke.wikimedia.org&v=153.881&m=disk_free&jr=&js=&vl=GB&ti=Disk+Space+Available06 [22:10:16] jeremyb: yeah, been watching it [22:10:27] k [22:10:33] hard to find an opsen on the wkend [22:10:49] and its not to the wake people up stage…yet [22:11:00] s/wake/page [22:11:05] just turn off wikipedia for a few hours; they'll show up ;) [22:11:30] lol, if we turned off the ToS banners it would slow the death of locke [22:11:38] heheh [22:11:57] its logging every CN banner imp atm [22:12:55] is there a plan for the data? or do you just log all impressions for all banners ever? [22:13:10] the data gets used for the fundraising tests [22:13:26] theoretically we don't need it all the time, but its not trivial to enable/disable [22:27:10] every single IP pgehres? that's a lot of IPs in one place... [22:27:48] the squids output those, but we don't use them [22:28:03] jeremyb: on that graph, what happened on sunday morning? when the space jumped back up? [22:28:09] some cleaning script? [22:28:13] apergos did a logrotate [22:28:17] by hand [22:28:30] storage3 is broken [22:28:38] that's where they usually go after locke [22:29:53] http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Miscellaneous+pmtpa&h=locke.wikimedia.org&v=150.963&m=disk_free&jr=&js=&vl=GB&ti=Disk+Space+Available [22:30:14] pgehres: ^ it looks like something other than log rotation changed, though, doesn't it? [22:30:43] robla: well, i guess i meant logrotate and push to tridge [22:30:54] see SAL [22:31:11] 07:03 apergos: manually rotates udplogs on locke, copying destined_for_storage3 off to hume:/archive/emergencyfromlocke/ (jeff, this note's for you in particular) [22:31:18] sure, I know about that [22:31:23] oh it's a robla :) [22:31:46] Thehelpfulone: re ^^^^ I'm not planning on using CentralNotice for 1.20wmf2 [22:32:02] ok, so things shouldn't really break? [22:32:23] no more likely to break than many other things we do that we don't use CN for :) [22:32:29] heh [22:32:35] also robla any chance that you could handle https://rt.wikimedia.org/Ticket/Display.html?id=2904 - I think the wiki has already been created, it just needs that so that the HK team can get to work. [22:32:48] it should be a quick and easy job as the code has already been written it just needs to be placed in the correct place. [22:33:36] Thehelpfulone: that one seems to be underway....any reason for that to be a weekend job? [22:34:18] robla: btw, see above about nagios-wm [22:34:44] not sure what's up with nagios-wm [22:35:23] robla: also, would you mind doing a couple merges on puppet's test branch for me? [22:35:42] I presume that's different to the one in labs that's also broken? [22:35:44] 06 09:41:48 < jeremyb> so, idk if this will work, but could someone please merge 6584 and 6727 ? (both to puppet's test branch) [22:36:52] https://gerrit.wikimedia.org/r/#/c/6584/ and https://gerrit.wikimedia.org/r/#/c/6727 if you're lazy like Ryan Lane ;) [22:37:08] Thehelpfulone: 36 hrs! [22:38:11] Thehelpfulone: what are you saying is broke in labs? [22:38:23] sorry...just trying to handle things that won't wait until tomorrow (like the disk problem on locke) [22:38:31] whatever the bot is reporting in #wikimedia-labs [22:38:44] (got people waiting on me) [22:38:54] robla: bye ;) [22:39:09] Thehelpfulone: i think that's working as designed? the problem here is that the bots saying *nothing* when it should speak [22:39:12] hmm jeremyb on that 6584, line 21 "^/(index\.html?)?$" => "https://lists.wikimedia.org/mailman/listinfo", can that be duplicated and inserted above line 41 ( "^/mailman/?$" => "/mailman/listinfo") to the mailman.wmflabs.org URL? [22:39:17] bot's* [22:40:08] oh it looks like there's a problem with it at the moment http://mailman.wmflabs.org/mailman/listinfo is down [22:40:22] as it should be ;) [22:40:47] oh? maybe I'm behind :P