[00:05:44] (03CR) 10Jorm: "This is a standalone set of pages so I wanted to keep it self-contained. I'm fine with pointing to a cached version, but I think we get i" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/103606 (owner: 10Jorm) [00:51:08] Is Tools slow or is it just me? [00:51:37] slower than usual [00:54:37] Why would I get a Proxy Error? [01:00:33] I'm just trying to load http://tools.wmflabs.org/pirsquared/ but it's very slow ... IDK why [01:08:37] hrm. once i get that commit +2'd, i can go about making changes to a couple other pages. [01:08:45] i don't want to get things all mucked up with the single commit. [01:43:21] jorm: Was that a hint? [01:43:41] (03CR) 10coren: [C: 032] "Good 'nuf for me." [labs/toollabs] - 10https://gerrit.wikimedia.org/r/103606 (owner: 10Jorm) [01:43:57] (03CR) 10coren: [V: 032] "Good 'nuf for me, and tested as working." [labs/toollabs] - 10https://gerrit.wikimedia.org/r/103606 (owner: 10Jorm) [02:11:32] * Coren tries to figure out what he broke. [03:12:26] jorm: A point of wonder; you allow sorting by maintainer, but that is not 1:1 so you're effectively sorting by /first/ maintainer alphabetically. That doesn't seem all that useful. [03:12:40] I.e.: it can't help finding "all tools maintained by X" [04:28:26] hello where can a view a detailed list of the current hardware and software in use for wikimedia or wikipedia ? [04:54:51] ttyS1: You mean globally? [04:55:23] ttyS1: There's a lot of it, spread in a large number of subsystems. I don't think it's put in any one consolidated place at this time. [06:04:40] (03CR) 10Tim Landscheidt: "So that's the difference between a programmer's and a designer 's output :-). Very neat indeed." [labs/toollabs] - 10https://gerrit.wikimedia.org/r/103606 (owner: 10Jorm) [06:24:04] Coren: no, no hints; I just saw a bunch of stuff I wanted to fix but couldn't until I had sync. [06:24:22] Also, great comment on that sort. I'll have to figure out something on that. [07:07:49] (03PS4) 10Tim Landscheidt: Simple tool to simplify using the backup snapshots [labs/toollabs] - 10https://gerrit.wikimedia.org/r/76313 (owner: 10Platonides) [07:09:33] (03CR) 10Tim Landscheidt: [C: 04-1] "Man page warnings as described above." [labs/toollabs] - 10https://gerrit.wikimedia.org/r/76313 (owner: 10Platonides) [07:25:57] (03PS1) 10Tim Landscheidt: Use absolute paths especially in error pages [labs/toollabs] - 10https://gerrit.wikimedia.org/r/103641 [07:26:52] (03CR) 10Tim Landscheidt: "As part of testing, I already updated /data/project/.system/public_html/index.php." [labs/toollabs] - 10https://gerrit.wikimedia.org/r/103641 (owner: 10Tim Landscheidt) [07:49:21] I like the mew main page :D [08:01:14] legoktm what main page ? [08:01:20] tools.wmflabs.org [08:03:25] yes ... it can do but more descriptions but that will grow :) [08:04:28] descriptions have to be created by the tool owner, by putting a certain file in their home directory...might be worth advertising that feature a bit more [08:05:10] I am going to blog about the gift of a fix ... first [08:06:00] maybe I will later ... PS ... there are tools by you that can do with a description :) low for instance [08:10:01] oh, hmmm. [08:10:05] Yeah, I should do that [08:10:20] sorry [08:10:29] :) [08:24:40] http://ultimategerardm.blogspot.nl/2013/12/reasons-to-be-cheerful.html [13:36:47] something in the grid engine seems to allocate too much memory after job execution, in my logs there is "libgcc_s.so.1 must be installed for pthread_cancel to work" which is a sure sign of memory exhaustion since dec 19th and in qacct it describes normal maxvmem values with "failed 100 : assumedly after job" [13:37:01] it isn't really bad, just annoying [13:37:25] it looks like an indication of error [14:17:42] giftpflanze: You got it backwards. That error means that your jobs /specifically/ busted its memory limit. [14:17:57] really? [14:17:58] Not that there wasn't memory for it. The grid never overcommits. :-) [14:18:15] what? i don't get it [14:18:38] the default memory limit is 256m [14:18:42] Right. [14:18:44] it used about the half [14:18:51] according to qacct [14:19:01] qacct is sampled. [14:19:11] (every 5s iirc) [14:19:24] hm [14:19:27] Was the return value 137? [14:19:32] 100 [14:19:40] see above [14:19:56] iirc that's the return value [14:19:57] No, not that line. The one labeled exit_status [14:20:06] um, let's see [14:22:03] exit status is 134 [14:22:33] Ah! Then it wasn't memory. 134 is SIGBUS [14:23:00] oh, what was that again? [14:23:59] Segmentation fault. [14:24:18] a-ha [14:24:53] exit_status > 128 = killed. You can see what signal by doing a kill -l; the status is 128+signal number [14:25:24] i sort of remember that [14:26:13] isn't it sigabrt then? [14:26:31] If you invoke your exectuable in a wrapper script, you can allow dumping core with 'ulimit -c unlimited' which will allow you to get a stack trace. [14:26:43] Math fail! [14:26:47] Yes, SIGABRT [14:27:25] and what does that tell us? [14:27:26] Most likely cause of sigabrt is a failed assert() [14:27:36] Hm [14:28:12] Libraries sometimes put asserts() for consistency; I know glibc() uses it when you double free memory for instance. [14:28:39] What is your code written in? [14:28:42] tcl [14:29:21] I don't know that tcl uses assert() itself; but likely a library you're using does. [14:29:36] ok [14:29:59] Allow core dumps, then you can trivially use gdb to figure out what caused it. [14:30:00] but why does it happen all of a sudden? [14:30:07] i will [14:30:21] although that sounds a bit scary ;) [14:31:05] That's hard to say unless we know where the assert comes from. It may be data-driven (i.e.: happens for certain inputs, like a malformed XML file). [14:32:04] It's bad practice to error out at runtime with an assert() unless you're compiling debugging code in, but that doesn't mean that programmers never do it. [14:34:56] coren: why ulimit -c unlimited? how many core files are possible? more than 1? [14:35:22] It's the limit on the /size/ of the corefile. That defaults to 0. [14:35:39] oh, i'm sort of stupid ;) ty [14:36:02] Not stupid; you just didn't know it. :-) [14:36:47] i didn't read the man page properly [14:38:12] Coren: do you have any idea what the /data/project/giftbot/.rnd file is? [14:39:16] No, but if it's a smallish file I'd venture the guess that it's a random seed of some sort for some cryptographic library. It's not put there by the system at least. [14:41:00] hm [14:42:40] Merry Christmas [14:48:10] Coren: apparently there are core files being created [14:48:20] even without that ulimit wrapping [14:49:01] Ah, it's entirely possible the gridengine turns it on. That's a good thing, since it means you can look at it now. :-) [14:49:06] is it safe to use gdb on -login? [14:49:12] giftpflanze: Yeah. [14:49:45] "/data/project/giftbot/core": not in executable format: File format not recognized [14:49:48] oh great [14:51:17] oh, wrong invocation [14:51:49] You have to 'gdb core' [14:52:26] gdb tclsh script.tcl core should also work? [14:52:45] or just gdb tclsh core? [14:52:50] the latter. [14:52:54] ok [14:53:02] The program already crashed, it's too late to give it arguments. :-) [14:53:11] mh [14:58:53] Coren: gdb seems to miss some libraries that were loaded by the script, does that matter? [14:59:28] Probably not, but that's also a little odd; there should be no libraries in the exec environment that aren't on -login. [14:59:42] Well, unless tcl loads stuff by hand which might make it hard to find. [14:59:42] hmm [14:59:54] As long as you get a useful backtrace, you're golden. [15:00:35] so i just type backtrace and i'm done? [15:06:13] Well, you'll still need to interpret it, but yes. [15:07:17] this is what i get and it does not seem very helpful to me: http://bpaste.net/show/161660/ [15:08:13] Oy. It comes directly from libtcl. [15:09:13] Well, strictly speaking, it's glibc that did the abort(), during a pthread_exit() -- that normally happens when data structures got corrupted so bad it doesn't know how to go on. [15:09:28] But the pthread_exit() comes from within libtcl. [15:11:00] libtcl doesn't have any debugging symbols in it though, so that doesn't help us much except knowing "tcl crashed by giving pthread_exit() bad parameters" [15:11:43] so, should we add debugging symbols? [15:11:54] You'd have to rebuild libtcl [15:12:16] Lemme see if I have the debugging symbols for it in apt. [15:12:45] No -dbg build. :-( [15:13:13] Look how your trace starts with a clone(), this happened in a thread. [15:13:53] Looks like a bug in tcl's threading implementation. Lemme try to google if that's a known issue [15:14:11] -dev packages are something different? [15:16:38] No, the -dev only includes static libraries and header files to build /against/ the library. [15:16:45] Hm. [15:16:47] ah, ok [15:17:07] Do you try to cancel threads or do any sort of signal handling in your code? [15:17:50] i don't have any threads that i know of and probably no signal handling, but let me see … [15:17:59] tcl does much in the background [15:18:38] and this is not only one script but as it seems all of them [15:19:13] except those that have a -mem limit of 1g [15:19:39] or so [15:20:12] i can't identify the pattern [15:20:15] Bleh, so this is happening in whatever threading tcl may do by itself in the background? [15:20:23] maybe [15:20:52] although i don't remember that tcl does any threading by itself [15:21:22] well, the backtrace is clearly off a thread, so unless you're doing so yourself it has got to be the interpreter. [15:22:18] or a tcl package? [15:26:44] giftpflanze: That's also possible. [15:27:00] giftpflanze: Though I'd expect they'd be explicit about using threads. [15:28:29] i don't know if they are [15:29:43] I'm afraid I've reached the limits of my tcl-fu there. :-( [15:30:47] All we know is that there is a pthread_exit called from a thread running the tcl interpreter that goes boom and abort()s. [15:33:04] That kind of thing normally happens when you call thread-unsafe code from a thread; and that may even have happened in a /different/ thread than the one which actually crashed. If I had to venture a guess, I'd guess that one of the packages you use relies on a non-threadsafe library. [15:33:20] yay [15:33:37] If that's the case, then you'll get essentially random crashes with probability increasing as runtime increases. [15:34:35] Threads are *hard*. I wish fewer programers used them that aren't experts. In practice, threading is often slapdashed together in the expectation of "free multi-core". [15:34:51] mh [15:35:15] What packages are you using? [15:39:30] Your fundamental issue might be using 8.6; that one (from the documentation I can find) has threading built in-core by default, but possibly not all packages have been written to be safe. [15:40:20] *sigh* [15:40:45] *grep for packages* [15:42:38] You might want to try using 8.5 [15:42:48] And see if the problem goes away. [15:43:00] (both are available) [15:45:31] oh no, all the new cool features … :| [15:45:53] the problem was absent until a few days ago, so … [15:46:38] packages are tcllib, mysqltcl, TclCurl [15:49:16] afaics they are all thread-safe