[09:43:15] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/List of Toolserver Tools was modified, changed by Tim.landscheidt link https://www.mediawiki.org/w/index.php?diff=847917 edit summary: /* Active Tools on the Toolserver */ Duplication Detector was migrated to Tools. [11:10:44] NFS broken? [11:10:52] I think so [11:10:58] just noticed my irc bots going down [11:11:04] local-steinsplitter@tools-login:~/public_html$ ls [11:11:04] ls: cannot open directory .: Stale NFS file handle [11:11:13] same [11:14:53] *opened a bug* :) [11:15:48] Steinsplitter: confusingly, the "Tools" product isn't for Labs. That's for general MediaWiki tools [11:15:59] I moved it into Wikimedia Labs [11:16:26] oh. thx [11:16:55] hi, is it something wrong about labs right now? I get reading/writing error for doing almost anything! [11:16:57] "cannot open `commonscatnl.err' for reading: Input/output error" [11:17:21] https://tools.wmflabs.org/ 403-forbidden [11:17:57] local-dexbot@tools-login:~$ qstat [11:17:57] sh: 0: getcwd() failed: No such file or directory [11:18:21] Coren: some issues to fix [11:19:21] so what we can do? [11:21:35] nfs is broken [11:21:50] Amir1: waiting for ringmaster Coren to take action ;) [11:22:31] hedonil: Okay, Thank you [11:32:56] this reminds me of an old manowar song [11:34:15] "When the smoke of the broken f*** NFS did clear, many thousand bots and tools were dead ... " [11:34:46] "Their bodies lay broken and scattered across the battle field... " [11:35:06] "Like brown leaves blown by the wind." [11:47:48] wmflabs down? [11:48:54] yes [11:55:25] can't seem to be able to create categories in wikimedia commons too...related? [11:56:12] warpath: which error you get? [11:56:22] related no i think o_O [11:57:09] lol weird ..my bad then.. [11:57:24] for wmflabs? permission error [11:59:23] no.... for " create categories in wikimedia commons too." [12:00:08] oh, after clicking 'save', nothing got created, fixed now..must have been a 'human error'..ME :P [12:05:46] much of labs may be inaccessible due to a filesystem issue on one of the lab storage hosts [12:12:31] ls: cannot open directory .: Stale NFS file handle [12:12:43] https://bugzilla.wikimedia.org/show_bug.cgi?id=58888 [12:13:09] yes, that would be it [12:13:51] I see someone has added it to the topic [12:14:05] it's not nfs issues s much as actual filesystem (xfs) issues [12:15:16] zhuyifei1999: yes, it is the same becous the system dos not find the file... but the bug was opened 11:12:09 UTC :P [12:15:31] *lol* 5_8888_ [12:15:56] Steinsplitter: cool number [12:16:03] indeed :P [12:22:31] Steinsplitter: you could offhand set this error ticket to importance: Immediate/blocker - probably doesn't help much, but is more honest [12:22:58] {{s}} [12:35:46] good thing is that it is still possible to query db [12:36:40] at this point we are waiting for coren to come on line [12:37:09] so tweaking the bug report won't make a difference (we know it's priority etc) [12:40:45] hmm. [12:41:30] also if anyone needs some sql querying in this period, feel free ask me [12:48:34] zhuyifei1999: how about: SELECT eta FROM tools.labs WHERE operation='back_to_normal' ?:P [12:49:32] hedonil: but there's no database named "tools" :P [12:49:54] zhuyifei1999: there *must* be one ;) [12:50:11] unknown column 'eta' in ... :-P [12:50:22] | tnwiktionary_p | [12:50:24] | towiki_p | [12:50:25] | towiktionary_p | [12:50:26] | tpiwiki_p | [12:50:30] :D [12:50:47] hedonil: ^^ :P [12:51:10] zhuyifei1999: : rofl'd [12:53:12] zhuyifei1999: of course asking for eta was the wrong question.. http://www.flickr.com/photos/110698835@N04/11363420046/ [13:14:17] @seen petan [13:14:17] zhuyifei1999: Last time I saw petan they were quitting the network with reason: Ping timeout: 246 seconds N/A at 12/22/2013 10:34:06 AM (1d2h40m11s ago) [13:15:46] Anyone knows what replaced "Labs NFS cluster pmtpa"? [13:16:34] https://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&m=cpu_report&s=by+name&c=Labs+NFS+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [13:30:00] What's up with tools.wmflabs.org? [13:30:21] I'm getting 403 errors [13:30:50] huskyr: nfs issue [13:30:51] huskyr: outage [13:31:02] https://bugzilla.wikimedia.org/show_bug.cgi?id=58888 [13:31:32] Ah, thanks zhuyifei1999 and Betacommand [13:33:15] huskyr: email just sent to labs mailing list [13:34:23] Just curious: When people told me to migrate from lousy old toolserver to modern well-organised stable tools labs... they were having me on, weren't hey? [13:35:37] krd: labs is in most cases better [13:36:08] krd: do you forget the year+ replag the toolserver used to have? [13:37:08] I currently see things down each and every week. It's most offten different, but anyway down. [13:37:47] krd: toolserver has a lot of issues too [13:38:25] plus the toolserver is being shut down in ~6 months [13:38:26] I don't want to blame someone or something. The fact is: We can't work. [13:38:50] This should be improved. [13:38:58] nfs... [13:43:27] * Cyberpower678 comes in to whine. [13:44:06] Its xfs issues not nfs issues [13:44:20] Heh. [13:45:05] Betacommand: sure? [13:45:30] zhuyifei1999: from #wikimedia-operations [13:46:12] local-yifeibot@tools-dev:~$ touch test [13:46:13] touch: cannot touch `test': Input/output error [13:46:44] the box serving the filesystems to you (labs) says there are xfs errors [13:47:06] local-yifeibot@tools-dev:~$ ls [13:47:07] ls: cannot open directory .: Stale NFS file handle [13:47:26] that's because the filesystems are not now accessible to you [13:47:42] since the server can't get to them it can't make them available to the labs instances [13:48:28] maybe [13:49:09] apergos: also what replaced "Labs NFS cluster pmtpa"? [13:49:18] well, "Internal error xfs_dir2_data_reada_verify" is the first of the errors we see, and it's repeated a number of times [13:49:26] that's on our box so [13:49:33] zhuyifei1999: I would go with what apergos is saying, they are a root [13:49:36] zhuyifei1999: I have no idea [13:50:16] the last round of such errors followed up with [13:50:18] "Corruption detected. Unmount and run xfs_repair" [13:50:46] (no, we aren't going to do that, or at least I will not blindly do it, we should switch to the secondary but thre are questions about doing that properly, that is why we need coren.) [13:51:01] "...metadata I/O error: block 0x3c0082bb0 ("xfs_trans_read_buf_map") error 117 numblks 8" [13:51:15] "...I/O Error Detected. Shutting down filesystem" [13:51:39] and after that you should expect that all attempts by a lab instance to do anything to data on any of those filesystems will fail [13:52:12] I'm really sorry for the trouble, just be patient a little while longer [13:52:36] apergos: dont let them get under your skin [13:52:50] oh I'm not [13:53:00] I just feel bad that we can't do anything about it right now [13:53:55] * Vito repeats himself [13:53:59] https://tools.wmflabs.org/guc/?user=79.22.0.97 <-- WTF? [13:54:03] @wakeup Coren [13:54:21] Vito: labs is temporarily out of order, filesystem issues [13:54:40] oh I see [13:54:48] Coren broke the Internet? :D [13:54:57] we're waiting for coren to come and save the day ;-) [13:55:16] well if by 'broke it' you mean 'had the nerve to be asleep when xfs bought the farm again', then yes :-P [13:55:33] always Coren fixes everything [13:57:23] well, I'll see if someone is still alive when I'm back! [13:57:38] hope so! [13:57:58] hope everyone is alive [13:58:50] I've got a cat in a box over here, should I check? :-P [14:00:13] collapsing its wave function will break the Internet even more! [14:00:21] you'd better use some glue [14:00:44] but also a broom in order to catch all the fallen pieces [14:01:54] I can't login via ssh anymore. Where do I have to ask for help? Am I right here? [14:02:03] lustiger_seth: here [14:02:17] apergos: another ^^^ [14:03:24] I use "ssh -A -i {mykey} seth@tools-login.wmflabs.org", that worked a few times. I tried that a few weeks ago. Is that command still correct? [14:03:41] xfs issues [14:04:03] apergos: ^ [14:05:17] zhuyifei1999: was "xfs issues" an answer to me? [14:05:36] lustiger_seth: apergos knows better [14:06:55] * Cyberpower678 delivers a foghorn to Core [14:07:02] *Coren [14:07:57] * zhuyifei1999 tried to wake Coren up, but failed [14:08:09] lustiger_seth: labs are out of order for right now, filesystem problems [14:08:46] ok, then I'll try it again in a few hours. thanks! [14:08:52] sorry for the trouble [14:15:04] @wakeup Coren [14:15:57] sigh… NFS total failure [14:16:33] well, nfs + xf total failure [14:16:40] but either way, broken for right now [14:17:15] pietrodn: when xfs fails NFS has nothing [14:17:23] well, Tool Labs is even more unstable than Toolserver. :-( [14:18:40] tbh once we have the fallback procedures documented, anyone will be able to handle a filesystem outage like this in 15 minutes [14:18:58] {{agree}} [14:19:13] +1 [14:27:51] Oh, FFS [14:28:08] I'm on it. [14:29:06] Coren, thank you and sorry [14:29:23] Nah, it's not your time. [14:29:29] Coren: Thanks :)))) [14:29:56] But if anyone questions my decision to used ext4 for the new server I'm going to yell a alot. :-) [14:30:04] nope. not me. [14:30:45] Coren: also what replaced "Labs NFS cluster pmtpa"? [14:30:54] (in ganglia) [14:31:02] I admit I didn't even look at that [14:31:13] Nothing yet; it's going to be Labs NFS cluster eqiad. [14:31:21] ah well there we go [14:31:45] I"ll ask questions once we're back on the air [14:31:52] Let's hope that thing can have a proper xfs_repair done. [14:32:03] ah you're not just going to switch [14:32:04] ok [14:32:36] apergos: Little point in switching; it's the same files from both servers. The filesystem itself is ill. [14:32:51] but [14:33:02] don't see that from labstore3 [14:33:11] labstor4. [14:33:13] and I just said I wouldn't ask til we're back on the air, scuse me, I'll ask then [14:33:17] labstore4 [14:33:28] Nah, you can ask. Talking won't slow down the reboot and repair. [14:33:44] labstore4 shows the errors, labstore3 which should... have the same disks attached? [14:34:08] apergos: Shared shelves. [14:34:11] doesn't say anything about xfs having a n issue [14:34:41] apergos: That's normal: only one of them can (try to) have the filesystem mounted. [14:34:42] and while I thought I remembered that we didn't mount things on the store that's not active, [14:35:02] then when you look at mount on both of them there are several thousand mounts [14:35:15] so at that point... [14:35:33] apergos: Oh yeah, the bind mounts are made even if the real filesystem isn't underneath. [14:36:04] and labstore3 had *more* mounts than labstore4 (which is probably because labstore4 unmounted all the failed crap)... [14:36:10] now it is making sense in retrospect [14:36:26] apergos: (Part of) the problem is that this is still the same filesystem that suffered the hardware failures. I think the xfs never completely recovered. [14:36:44] no xfs repair ever done on it? but I thought... [14:36:49] * Coren fumes at how long those servers take the report. [14:37:08] takes to reboot* [14:37:15] autocorrect? :-D [14:37:35] Coren: http://ganglia.wmflabs.org/latest/stacked.php?m=load_one&c=tools&r=hour&st=1387809354&host_regex= [14:37:59] zhuyifei1999: Normal consequence of the filesystem going away. [14:38:05] well hmm so we are spofed no matter what with the current setup [14:38:12] Things pile up while waiting for it to return. [14:38:31] apergos: No; the move to eqiad will be a (long) file-to-file copy to a new filesystem. [14:39:31] * Coren considers doing this now. [14:39:39] I mean the copy; not the move. [14:40:36] yes, let's move labs durin the break... :-D [14:41:17] rsync is it? [14:41:47] I can change the NFS server; the latency between eqiad and pmtpa would kill us. [14:42:04] But I can create an ext4 on those disk and copy everything from one to the other. [14:42:19] why not [14:42:33] apergos: It will likely take a couple of hours to do so. [14:42:34] as long as you don't use every bit of bandwidth between dcs :-P [14:43:17] I would have prefered to not keep labs down that long if I could avoid it. I'll see how the xfs_repair fares. [14:43:23] right [14:45:33] Coren: (14:44:26) RECOVERY - Disk space on labstore4 is OK: DISK OK [14:45:53] xfs_repair in progress [14:46:57] The good news is that it managed to replay its journal. [14:47:12] that's a big win [14:47:21] * apergos waits for the other shoe to drop [14:47:36] Coren: no bad news please? [14:48:42] http://www.worldwidewords.org/qa/qa-wai1.htm [14:49:55] Doesn't look like it's likely much worse than a couple things ending in lost+found will be the result. [14:50:08] Too many negatives. [14:51:22] xfs_repair on phase 7 [14:51:52] ok [14:52:10] Huh. Perhaps lady luck is on our side today. [14:52:17] :) [14:52:51] tools is down: http://ganglia.wmflabs.org/latest/?c=tools [14:54:05] danilo: yes, being worked on right this seconf [14:54:40] * Coren restarts nfs [14:54:42] ah, ok [14:55:38] Coren: thanks :) [14:56:14] No longer throwing 403. Now it just hangs. [14:56:15] NFSD: starting 90-second grace period [14:57:00] Cyberpower678: 1460s ago [14:57:11] Heh [14:57:30] hanging [14:57:33] local-yifeibot@tools-dev:~$ ls [14:57:35] ^C^Z^C [14:58:08] Guy, you're not going to make the exponential backoff any faster by poking it. It'll take several minutes before the clients recover. [14:58:11] Bugs* [14:58:14] Guys* [14:58:33] 'bugs' would also be to-the-point in this context ;-) [14:58:50] BUGS [14:58:57] * Coren sees NFS traffic picking up gradually. [14:59:05] Coren, what's up doc? [14:59:11] so nothing in lost + found? [14:59:25] * Cyberpower678 is munching on a carrot [14:59:43] * apergos steals the carrot and makes soup [14:59:49] apergos: Haven't inspected it yet, but xfs_repair was happy enough. [14:59:53] Load is decreasing! :D http://ganglia.wmflabs.org/latest/stacked.php?m=load_one&c=tools&r=hour&st=1387809354&host_regex= [15:00:33] I'll probably have to reboot all of labs. [15:00:34] :))))) [15:00:36] "(yeay!)" [15:00:50] cannot access /home: Stale NFS file handle [15:01:05] local-yifeibot@tools-dev:~$ ls [15:01:06] ^C^Z^Cls: cannot open directory .: Stale NFS file handle [15:01:07] local-yifeibot@tools-dev:~$ [15:01:23] Coren, I find it funny that virtually every exec node on tools is overloaded but cyberbot's [15:01:45] Cyberpower678: Depends on whether and how much your stuff is trying to write to disk. [15:01:56] and webgrid [15:03:00] Yeah; I'm pretty sure we'll not be able to recover without a reboot of tools; the problem wasn't that NFS went away but that the actual filesystem did for a while. [15:03:11] ouch yeah [15:03:36] * Coren makes a test first. [15:04:03] webserver still 403 [15:04:52] Coren, and we're back to a 403 [15:05:48] no 403, now hanging [15:06:01] pietrodn: Do you think you're helping right now? [15:06:14] pietrodn, going back and forth [15:06:30] sorry Coren [15:06:38] * Cyberpower678 has his bot bombard tools. >:D [15:06:56] do I assume you have a script to reboot all instances? if not, point me at some, happy to pitch in [15:08:24] Coren, is it normal for tools-webproxy to constantly go up and down? It's down again [15:08:25] apergos: I do. I'm about to launch it but I'm just making sure that the /other/ reason I wanted a reboot eventually is also fixed (the pagecount mounts) [15:09:18] ok [15:09:52] Coren, I mean while the system is recovering. [15:10:14] Coren: nfs working :) [15:10:17] local-yifeibot@tools-dev:~$ ls [15:10:19] access.log [15:10:29] [bla bla bla] [15:10:47] Yay it's back online [15:10:51] login ok and webserver up [15:10:53] I *might* not need a reboot after all. [15:10:58] rilly! [15:11:06] (Which is a shame for /pagecounts but hey) [15:11:10] :-D [15:11:12] Coren: Reboot anyway! [15:11:14] no it's great [15:11:23] take a scheduled downtime some other time [15:11:32] copy all yer data [15:11:36] do the pagecounts stuff [15:11:37] etc [15:11:48] Which is why I said "A shame for pagecounts" and not "about to reboot anyways". :-) [15:11:52] yup [15:12:06] so in the end I could have just tried the repair myself [15:12:07] meh [15:12:20] but if I ha it would have barfed all over the place and you would have been left to pick up the pieces [15:12:23] cause that's how it works... [15:12:43] apergos: Yeah, the only thing I did was reboot, xfs_repair, mount, see that things were okay, restart nfs. [15:12:56] ah so [15:13:10] For once, randomness was on our side. [15:13:12] a page on wikitech, I will nag towards the end of the week [15:13:40] that says 'if nfs breaks, try these: 1) if it's actually an xfs error... 2) if not... switch to the other labstore by...' [15:13:56] would be nice so we don't page you [15:13:58] I'll aim to have that for you as a xmas present. :-) [15:14:04] sweet! :-) [15:14:40] Coren, you should install an AI management system. [15:14:41] it might also in clude things like 'don't be fooled by the mounts because...' [15:14:43] :-D [15:15:09] Coren: is it safe to start re-starting my bots [15:15:17] Coren, and upgrade labs to a neural net infrastructure. [15:15:25] apergos: The simpler notice will just read "Ignore the thousants of mounts in /exp/ they are just bind mounts" [15:15:42] worksforme [15:15:46] Betacommand: No reason not to. Nothing is going to be rebooted after all. [15:16:07] mhoover: I'm counting on you to flee pmtpa as soon as reasonably possible. :-) [15:16:32] Coren: very soonly :) [15:18:41] flee! run for your lives! [15:18:57] Coren: and if you need help with whatever happened here, let me know what [15:19:18] Coren: stale nfs handles are always a ginormous pita [15:19:34] oh, and if that script 'reboot all instances' isn't documented somewhere obvious, that would be nice too :-D [15:19:36] mhoover: Nah, that's not actually even (directly) labs related; it's just that the NFS setup in eqiad is about 3 orders of magnitune better. :-) [15:20:03] Coren: heheh ok :) [15:20:58] apergos: It's a script in my own home, not a generally avaliable one. eqiad labs will feature salt instead. [15:21:31] Well, also in root's home on -login. "onall " [15:23:06] I use it mostly to force puppet runs. [15:23:44] well until we're in eqiad... [15:23:48] * apergos makes puppy eyes [15:24:10] it's totally ok to be on wikitech in case someone else has to do this [15:24:29] oohhh salt in eqiad? nice [15:25:25] Coren: when you have a sec I need a hand finding out why my script wont restart [15:33:06] Coren: Something odd: When I submit a job that just does "/usr/bin/perl -we 'sleep 60;'" on tools-exec-08 (job 1954795), it reports peak vmem 17.1M. On tools-exec-09 (job 1954796), it reports 40.5M. I earlier had job 1954712 apparently die due to ulimit ("libgcc_s.so.1 must be installed for pthread_cancel to work") on -09, but it runs fine on the other nodes. Any idea what's up with that? [15:33:57] looks like things are getting back to normal here, back to my usual channels, happy trails [15:36:48] Coren: stuff is still acting borked [15:37:07] Betacommand: define "acting borked"? [15:37:36] anomie: Hm. Not offhand, but that /is/ odd. [15:37:40] Internal error on http connect, taking forever on https [15:37:57] * Coren checks. [15:38:02] Getting 502's [15:38:36] Betacommand: Can you give me a broken url to test? Those I'm looking at now all seem okay. [15:39:07] http://tools.wmflabs.org/betacommand-dev/cgi-bin/replag [15:40:31] Betacommand: Looks like one of the webservers didn't recover. [15:40:36] * Coren goes to poke it. [15:40:57] Ah, it's OOM [15:43:45] Betacommand: Still ver loaded, but it looks like there's a bazillion things trying to catch up to the outage all at once. [15:44:01] Betacommand: BTW, you should look into switching to the lighttpd setup -- much more resillient to load spikes. [15:44:20] Also much faster, and gives you better (complete) error logs. [15:44:24] Coren: ... [15:45:01] !newweb [15:45:01] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help/NewWeb [15:45:23] ^^ Unless you have .htaccess with stuff in them, it's as simple as typing "webservice start" :-) [15:46:35] (If you do, you'll need to configure .lightpd.conf a bit to replicate the functionality first) [15:52:50] Coren: im getting a 404 after starting the webservice http://tools.wmflabs.org/betacommand-dev/cgi-bin/replag [15:54:58] Betacommand: Ah. Forgot that -- no magic /cgi-bin/ by default. Lemme add the config stanza for it to the doc. [15:55:22] Coren: http://tools.wmflabs.org/betacommand-dev/ [15:55:26] also 404 [15:55:58] Betacommand: You don't have an index.something in there. [15:56:15] So its not as simple as you said :P [15:57:02] Apache also doesn't have directory listing turned on by default. So same difference. But I did forget cgi-bin. Gimme a sec. [15:59:46] Betacommand: Just added the cgi-bin stanza to the doc. [16:00:00] Betacommand: But you will have to put your cgi-bin under your public_html [16:01:22] * Betacommand grumbles about a borked system [16:01:33] apache should work fine [16:04:46] Coren: can you take a look why job 1955029 died? [16:06:06] Coren: I think I figured out the difference in memory usage on -09. On -08, the job has no LANG environment variable. On -09, the job has LANG=en_US.UTF-8. If I pass "-v LANG=en_US.UTF-8" to qsub, it uses 40M on -08 as well. [16:10:11] anomie: Ooo. Nice catch. [16:10:43] -09 is the newest of them all, and uses a newer image. I'm guessing there were some minor changes introduced in the update, and the existence of a default locale might well be it. [16:11:05] Part of the reson why I'm going to be switch to ensure => latest in eqiad. [16:11:34] Coren: can you take a look why job 1955029 died? [16:13:01] Betacommand: sigkill. Certainly OOM. How much mem do you start it with normally? [16:13:40] It got to maxvmem 596.527M [16:14:50] (You can get that info yourself with 'qacct -j ' BTW) [16:17:26] Coren: Speaking of qacct, what's the delay between the job ending and when qacct will actually work? [16:18:22] anomie: There are two; gridengine keeps it in qstat for ~1m first, then there is a avg 2min delay before it ends up visible from -login [16:19:19] Average 2min, so max about 4min? [16:19:32] Should be about right, yeah. [16:20:00] did beta labs just die? [42960ed3] /wiki/Main_Page Exception from line 819 of /data/project/apache/common-local/php-master/includes/db/Database.php: DatabaseBase::factory no viable database extension found for type 'mysql' from http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page after logging in [16:22:30] Coren: ive set it to 600m for the last 3 months without issue [16:22:45] 600m or 600M? [16:22:58] 600M [16:23:37] Betacommand: Apparently it just managed to tickle that limit by bytes. [16:23:57] Since the last sample was a 596.527 it was very very close. [16:24:16] I'd add a couple M for safety. [16:26:03] 600MMMM [16:26:31] * Coren groans. [16:26:37] :D [16:26:47] Coren: you walked right into that [16:27:16] "I just tried 600MMMM but it didn't work!?" [16:27:56] Coren: any eta o fixed the apache servers? [16:28:34] Betacommand: They're fine now, just overloaded by the stuff that piled up while the filesystem was broken. [16:29:40] Coren: do you know if anyone has futzed with beta labs just now? within in the last few minutes a login attempt yields a stack trace from Special: CentralLogin [5d2acc98] /wiki/Special:CentralLogin/complete?token=ef27cdb0cb4d14b3562a6dc92b289524 Exception from line 819 of /data/project/apache/common-local/php-master/includes/db/Database.php: DatabaseBase::factory no viable database extension found for type 'mysql' [16:30:25] chrismcmahon: I heard nothing about anyone playing on it. [16:30:33] chrismcmahon: Which doesn't mean nobody is. [16:30:54] it's fine until you try to log in :-) [16:32:31] Coren: svwiki replag is high [16:32:41] 9:24:02 [16:35:57] Betacommand: Long query running. 1708183s. Boom it goes. [16:38:36] Betacommand: Should start catching up shortly. [16:38:57] Coren: thanks. Im about to give my replag tool a facelift [16:39:15] http://tools.wmflabs.org/betacommand-dev/cgi-bin/replag is ugly [16:41:53] maybe also move it to a new project called 'replag'? [16:45:55] valhallasw: new project for a 30 line script? [16:46:59] Yes, because it's then in a clear location, and it's easier to share development with other authors [16:47:13] otherwise we will again end op with 20 different /.../.../.../replag's [16:51:46] +1 [16:54:26] Can we have a tool to tell us how long the replag tool will take to respond? [16:56:21] xD [17:04:34] Answer: Never [17:04:35] : global name 'databases' is not defined [17:07:22] Reedy: im working on it [17:14:30] Coren: why are the webservers taking x20 longer to execute a script? [17:14:52] ... than? [17:14:58] Normal [17:17:05] Because they're still overloaded and you're still relying on the shared apaches. Anything using the lighttpd per-tool setup is just as fast as usual. [17:18:15] Coren: why not deploy more apaches then? [17:18:47] Because the setup with apache is deprecated and will go away anyways; I'm not going to add more of them. [17:23:46] * Betacommand grumbles about yet another pain in the ass since moving from toolserver [17:27:25] !log wikimania-support Updated scholarships-alpha to e7a6ce7 [17:27:27] Logged the message, Master [17:37:27] Coren: are there tools for converting .htaccess to whatever the new format is? [17:40:21] none that work really that well, especially for complex htaccess files [17:41:32] Betacommand: why you use xx-rewrite rules? [17:44:56] Coren, sorry for late arrival this morning :( Can I do anything? [17:45:01] Is everything fixed already? [17:45:25] andrewbogott: It's back up, but a signal that we really want to move to eqiad. [17:45:50] I haven't read all the backscroll… what was tampa-specific about this failure? [17:45:54] Betacommand: No, but if you only use options and cgi-bin it's a simple matter. [17:46:13] andrewbogott: The same XFS filesystem we've been patching with duct tape and bailing wire. [17:46:33] labstore100[12] -> ext4 ftw [17:46:38] Are you not using XFS in... [17:46:40] oh, oh :) [17:46:47] Steinsplitter: ?? [17:47:19] I doubt it's just XFS at fault, fwiw [17:47:37] paravoid: Probably not, it's obviously the interaction between xfs and something else. [17:48:03] yeah [17:48:51] paravoid: Also possibly a regression in more recent kernels w/ xfs (I have to use a recent kernel because of a bug in the raid controller driver in older ones) [17:49:15] oops :) [17:49:19] * andrewbogott goes back to wondering why we leak an LDAP host record every time an instance is deleted [17:51:23] * Coren loves messing with storage *sooo* much. Because there are so very few subsystems that can go wrong in subtle ways. [17:51:25] Betacommand: well... permission denided for the cgi-folder..... [17:52:21] Steinsplitter: thats default [17:52:36] Coren: Btw - in regards to failure. If a continuous job gets stopped because the world ends, then I submit another job because I detected it has stopped... will the grid start another job to replace the one that got stopped in the first place? [17:52:47] This would explain my weird bot killing its self with duplicate processes issue [17:52:51] Betacommand: oh O_O [17:53:17] Steinsplitter: cgi-bin should never be world readable [17:53:29] Damianz: That can happen, yes, depending on exactly how you restart it. [17:53:49] I'm basically taking qstat, grepping if it's running and if not jsub'ing it as a cronjob [17:54:48] I should just go fix the cases where it exit's 0 and remove that cronjob... but that's not really so straightfoward without making it more hacky. Since that's the only reason it exists [17:54:54] Damianz: That's ripe for a race because of the possibly long interval between the qstat and the moment you start it anew. If you add a -once to the jsub, you'll get a better result. [17:56:48] Hmm, I do actually have -once in there. Specifically: /usr/bin/qstat 2>&1 | /bin/grep -q cbng_core || /usr/local/bin/jsub -once -continuous -e $HOME/logs/cbng_core.err -o $HOME/logs/cbng_core.out -N cbng_core -mem 2G $HOME/cluebotng/run_core.sh &> /dev/null [17:57:08] Hm. [17:57:25] Though I really should change that, because it also has the side effect of '524G logs' [17:58:10] Coren: when you have a minute I need help converting to the web system [17:58:18] The problem with this is that there is a delay between when a job is queued and when it shows up in qstat; if you add a sleep after the || then the -once should catch the race condition. [17:58:20] Wikitext takes up rather a lot of space... and a lot of whitespace at that. [17:58:28] Betacommand: I can help in about ~10m [17:59:11] it's kind of annoying that it's 2014 and there's no distributed self-healing storage that is stable and works [17:59:49] paravoid: Storage? Use the cloud, all that nosql stuff! [18:20:49] Betacommand: I'm all yours. [18:22:02] Coren: http://pastebin.com/GuqysvCZ is my current .htaccess [18:24:00] and I need to covert that to whatever the new format is [18:26:18] Betacommand: I just have; this should be functionally identical with the addition of a symlink in public_html (I've added it for you) [18:26:48] and the cgi-bin also [18:26:51] ? [18:27:01] Well, directory indices will probably not look quite the same, but close enough. [18:27:22] yes, I've added the directive to cgi and made a symlink to the same place your ~/cgi-bin pointed to [18:28:02] and Im getting 404's [18:28:33] Or not. It seems to not have accepted the cgi-bin bit. [18:28:46] its 404 for http://tools.wmflabs.org/betacommand-dev/ [18:28:59] Ah, I think I need to say "everywhere" explicitly. [18:29:04] * Coren tests. [18:29:33] Coren: Is there some time to look into 2 cpan installation requests? [18:30:10] !log tools restart grrrit-wm for subbu [18:30:13] Logged the message, Master [18:31:54] Coren: any updates? [18:32:06] Betacommand: Tweaking it now. Give me a couple minutes. [18:32:37] Betacommand: Indices work; give me a minute for the cgi [18:33:04] apropos cgi … [18:33:17] thanks, make sure that cgi-bin files cannot be read [18:33:22] Watching .NET framework install vs getting wet on the way home... think I'll get wet walking home. back later, probably [18:33:28] only exicuted [18:33:40] i have to attach some fcgi test cases for you to the ticket that might be more minimal [18:34:25] giftpflanze: Always useful. [18:34:51] mh, where is my bug? :( [18:36:41] phew, found it :) [18:40:15] Betacommand: Ah, I perhaps I shouldn't use a lighttpd 1.5 config directive since we have 1.4 [18:40:19] * Coren facepalms. [18:43:46] Betacommand, Coren: maybe it help: in http://tools.wmflabs.org/ptwikis/ we use this .htaccess: http://pastebin.com/mtpcFUj7 [18:44:11] danilo_: .htaccess isnt the issue [18:44:39] danilo_: Im trying to move to the new web, since the apaches are screwed and not getting fixed [18:44:54] new web doesnt use .htaccess [18:49:37] * Coren is a moron. [18:50:05] So, I spend all that time trying to figure out why the hell it can't match the "^/betacommand-bot/cgi-bin/" regex [18:50:29] Coren: lol [18:51:22] Betacommand: You're all lighttpd'ed up. [18:52:01] Betacommand: If you take a look at your tool's .lighttpd.conf, you'll see you can adjust the regexes if you want to only do directory listings in some places, etc. [18:53:01] Coren: is is possible to hide the cgi-bin from listings and prevent it from listing files there? [18:53:38] Right now it's listing when matching "/$"; you can set whichever regex you want there instead. [18:54:29] Alternately, you can add a match after it for /cgi-bin/ and set dir-listing.activate to "disable" for that one. [18:55:32] The syntax in .lighttpd.conf should be transparent. [18:56:08] Coren: that would be a good option, Im trying to wade into those docs but it will take me a long time to figure out the new format [18:56:47] * Coren restarts your webservice with that turned on. [18:57:16] {{done}} [19:10:10] gah, beta is still hosed. /me goes spelunking in the logs