[08:36:34] hmm, no scfe_de [09:00:28] I wonder, if we generate multiple passwords on a VM, if it'll run out of entropy for the random generator [09:18:22] YuviPanda: I remember having problems with that a couple of years ago on vmware [09:50:19] multichill: yeah, I'm guessing that might be an issue [09:50:21] multichill: but should be ok, though - I'm not generating *that* many password [09:50:23] s [09:50:25] doing final steps to make mongodb available to all tools :) [09:50:31] puts a mongo.conf.json in everyone's tool dir [09:51:45] Would be nice to get rid of all the local passwords with something like Kerberos [09:53:24] YuviPanda: http://docs.mongodb.org/manual/tutorial/control-access-to-mongodb-with-kerberos-authentication/ :P [09:53:38] multichill: :D not happening anytime soon, though [09:54:30] Used to have that in my old university cluster. Log in once, access all services.... [09:55:00] multichill: yeah. mongodb already supports ldap, and we already have ldap... [09:56:06] With Kerberos you don't need any passwords anymore after the inital login, with ldap you do [09:57:00] multichill: true, but I wonder how tools will handle it. [09:57:00] multichill: won't they have to do some form of auth as well? [09:57:22] multichill: plus that is a 'mongodb enterprise' feature [09:57:39] That's all part of the system. The server you're connecting from has to be part of the Kerberos setup [09:58:13] Take a Windows Active Directory domain. It's all Kerberos. If I'm authenticated on my laptop and it's part of the domain, I can open network shares and access services [09:58:31] aaah, of course [09:58:34] That's because my laptop tells the other side I'm Maarten@ [09:58:35] needs support on running host too [09:58:40] Exactly [09:58:55] And the running host needs to be trusted (user not root etc) [09:59:04] right [09:59:08] Or you get into the really ugly part [09:59:17] we do something similar, but far more trivial for the nginx proxy [09:59:29] when a tool makes a request, we use... identd to verify its authenticity [09:59:46] ouhc [09:59:48] *ouch [09:59:52] multichill: yeah [10:00:06] multichill: but it works for us, since we just need to verify username and users don't have root... [10:02:53] So in a Windows AD enviroment it's common to create service accounts with unknown password and very limited rights. [10:03:10] You just have an application running in that account and it can access other resources without a password [10:03:28] multichill: right. [10:03:38] multichill: I wonder how hard kerberos would be to setup in our environment. [10:03:43] it's not used in prod, so won't be too easy [10:04:43] I think it got quite a few hurdles. I haven't seen it used widely on *nix systems after I left university [10:05:19] But that might be cause too by the fact that I haven't used any big shared user clusters after that ;-) [10:06:25] YuviPanda: https://en.wikipedia.org/wiki/Kerberos_%28protocol%29#Drawbacks_and_Limitations [10:08:40] multichill: reading [10:09:44] multichill: when I first read about Kerberos (years and years ago) it felt... way too complicated [10:10:22] In M$ Windows it's really easy, you don't even notice it's there [10:10:36] In *nix it's a bit more work I guess ;-) [10:10:43] yeah [10:10:52] identd is fairly simple, though... ;) [10:11:00] Imagine if mongodb supported identd auth [10:11:14] And it only makes sense if you have lot's of users and lot's of machines. So a typical office enviroment it makes perfect sense [10:12:08] yeah [10:12:16] I dunno if it's worth the trouble for tools [10:13:08] I think some people might get a mental breakdown if you mention it, but you could try [10:13:25] hehe [10:15:44] But it's always good to keep looking around and see what the possibilities are. The more users/servers/services we get, the more interesting it gets to have some sort of central system [10:16:09] multichill: indeed. one of the things I want to do is to publicize tools more. It's a wonderful environment that many people are unaware of [10:16:20] I didn't even know the toolserver existed until I started doing stuff in labs [10:16:38] Ha, yes, we have a lot of hidden gems [10:16:58] indeed. lots of people would go 'waaat?!' [10:17:07] at the replicas themselves, for exampel [10:17:26] something as trivial as 'most edited wikipedia articles' does the rounds, and people use the API for that kinda stuff [10:17:59] * YuviPanda considers making the mongodb announcement email as 'ToolLabs is finally webscale!!1' [10:18:42] ok, gotta go. cya guys later! [11:17:47] !log local-heritage Did some hacks with Krinkle to get i18n working(ish) again (api.php and html formatters). Still need to commit it [11:17:49] Logged the message, Master [12:21:28] What's the command to see why a job is no longer running? [12:23:10] a930913: qacct -j [12:23:21] will show you some basic info, at least (exit code, maxvmem) [12:25:38] valhallasw: 137? :) [12:26:11] Out of memory? [12:26:41] a930913: I think so, yes. Check the maxvmem line to be sure [12:27:02] (we seriously need better reporting on job kills) [12:29:12] Hmm, I think 137 just means killed by grid? [12:29:43] Because I restarted it with more memory to prevent killing again, and it had 137 exit status. [12:31:53] a930913: I'm not sure if there are other reasons for the job to be killed other than OOM [12:32:04] a930913: but the simplest way to check is, as mentioned, the maxvmem value [12:32:15] if that's the same as your mem= value, you're OOM [12:53:41] valhallasw: Does gerrit-patch-uploader work by email? That would be nice on labs [12:54:17] multichill: as in: mail a patch? not at the moment, but it should be possible to implement that [12:55:49] Would be nice. I have some changes in a shared project and I don't have my keys there (don't plan to have them there). If I could just invoke a command that submits the changes to gerrit for review, that would be nice [13:02:17] oh, that brilliant indeed [13:02:32] just git mail-patch gerrit-patch-uploader@wmflabs.org or something like that [13:03:46] valhallasw: Could any tool parse the kill logs, and email the tool in question? [13:05:09] a930913: SGE can mail users itself [13:05:24] you can turn it on with -m e or something like that [13:05:51] -m a [13:06:02] although that will also mail on reschedule, not just a full kill [13:06:27] see man qsub [13:20:39] i'm getting "Upload failed! Your IP address has been blocked automatically, because it was used by a blocked user " when trying to use one of the tools on wmflabs [13:21:00] valhallasw: jsub passes those parameters through, doesn't it? [13:21:11] comets: Dynamic IP? [13:21:42] yep.. [13:21:45] but i doubt anyone is blocked from editing in my country.. [13:22:06] comets: Could be a proxy run from nearby. [13:25:21] how to check globally on wiki? [13:25:25] a930913: yes [13:25:53] comets: which tool, and when do you get it? [13:26:00] it could also be the wmflabs ip is blocked for some reason [13:26:13] croptool, and now.. [13:26:22] 'when' as in 'where in the process' [13:26:59] I'm not sure how IP blocks and OAuth work, actually [13:27:04] when the tool tries to uplaod using my account name [13:27:23] because it's the webserver that does the actual upload, so that would be the IP that's reported [13:27:43] comets: does the error mention the actual IP? [13:28:07] nope sadly :( [13:28:40] i got an Ip block exempt now for commons but would like to know what caused it .. [13:28:41] ok.. [13:29:02] no one blocks FIJI :P [13:29:08] ? [13:29:46] my country, no need to block anyone here, no one barely edits wiki here :P [13:29:51] you [13:29:57] you're missing the point [13:30:06] it's not *your* IP that has been blocked [13:30:14] because you wouldn't be able to edit commons otherwise in the first place [13:33:33] "22:23:14 my IP 10.142.150.249"@ #wikimedia-commons . I thought 10.*.*.* was for internal network [13:33:59] vodafone mobile.. [13:34:23] yeah, so? that's your vodafone mobile internal IP [13:34:45] Ah, so not public IP :P [13:35:40] nonetheless, the 10.* ips *should* be autoblock-exempted [13:36:08] and possibly blocked from editing without being logged in, but whatever [13:37:12] anyway. croptool is running on tools-webgrid-02 = 10.68.17.9 [13:38:55] using 10 range ip's since march 2011.. [13:39:40] ? [13:40:10] how are tool dbs named? [13:40:11] * YuviPanda checks [13:40:43] YuviPanda: s1234_... [13:40:51] where 1234 is the group id [13:40:58] ah, hmm [13:41:11] I'm setting up the mongo thing, and giving everyone a db by default [13:41:21] comets: ok, not sure. tools-webgrid-02 doesn't seem to be blocekd on commons [13:41:21] s/_/__/ no? [13:41:38] good to know :) [13:42:34] YuviPanda: What's the benefit of the mongo thing? [13:42:40] valhallasw: I'm writing it in py3 as well :) [13:42:46] apparently this will be the first time any py3 thing is running... [13:43:16] YuviPanda: hey, wikibugs runs py3 and you know it :-p [13:43:17] a930913: it's a document data store, useful for different things than mysql [13:43:17] heritage, for example, would've been a great use case. [13:43:18] a930913: data that's not fully structuredd [13:43:24] valhallasw: haha :P I meant from puppet [13:43:28] ah, right [13:43:46] YuviPanda: For text search things? [13:43:59] a930913: depends on how big a search you're looking for :D [13:44:02] ES is more suited for that [13:44:11] ES? [13:44:27] a930913: ElasticSearch [13:45:02] a930913: it does support full text search as well [13:45:04] goddamn my fucking slow internet [13:45:05] * YuviPanda curses [13:45:31] a930913: http://blog.mongodb.org/post/40513621310/mongodb-text-search-experimental-feature-in-mongodb [13:45:51] a930913: don't put in a wikipedia dump in that, though ;) [13:46:15] http://docs.mongodb.org/manual/core/index-text/ [13:47:08] a930913: it's also useful as a general data store if you don't want to deal with schemas [13:47:14] explicit schemas anyway [13:48:46] a930913: why were you looking for full text? [13:52:58] a930913: we should try to get replicas of production ElasticSearch nodes on to labs [13:58:33] a930913: but primary reason, I think, is to just get it out there and see what tool authors do with it. you're an ingenious bunch :D [13:58:33] we just put Redis out and people built things on it [14:10:24] YuviPanda: I have a tool that currently greps a load of files, and it's getting slower and slower :p [14:10:44] a930913: how many files? :) [14:10:58] It's the BBC subtitles one, so we're only talking small. [14:11:19] YuviPanda: One file to search per factual episode aired. [14:11:34] Well, subtitled, factual episode aired. [14:11:34] a930913: hmm, mongo should be able to deal with that, I think [14:11:38] a930913: what, 1GB? [14:11:41] a couple of gigs? [14:13:03] YuviPanda: Nowhere near. [14:13:14] a930913: you should be fine using it :) [14:13:25] How do I see how much a memory a directory is using? [14:13:25] a930913: I haven't enabled text search in our current install, but it's trivial to do so [14:13:42] a930913: du -d1 -h [14:15:12] Wow, 100M, surely not? [14:15:51] a930913: :) [14:15:55] Hmm, that's ~100kB per episode. [14:16:53] If we let the average episode be 50 minutes, that's 2kB per minute. [14:17:16] a930913: we'll get postgres for most users soon too [14:18:45] YuviPanda: How many kB do you speak per minute? [14:19:02] a930913: probably not that much ;) [14:19:07] a930913: although it depends on how you measure it [14:19:33] And on the language :-). [14:19:43] If each line spoken had an overhead of say 100 bytes. [14:19:55] hehe [14:21:38] BTW, re auth/MongoDB, MySQL and PostgreSQL offer auth per SSL certificates. In a land, far, far away we would just have one cert (auto-/re-generated) per tool that they can then use for auth against all services. [14:22:01] scfc_de: heh [14:22:03] far far away indeed [14:26:26] scfc_de: I'm going to call user accounts / user dbs as 'db_'. thoughts? [14:28:08] YuviPanda: I've never looked deeper at MongoDB, but given the usernames in MySQL, that sounds sane. [14:28:24] (Or is it ?) [14:28:33] scfc_de: I was going to use the toolname, but then realized things about illegal characters and how normalization can cause problems [14:28:38] scfc_de: gid and uid are same [14:28:39] for all tools [14:30:45] Ah, good to know! [14:49:28] 3Tool Labs tools / 3[other]: enwp10 tool is down - 10https://bugzilla.wikimedia.org/66565#c3 (10Tim Landscheidt) 5NEW>3RESO/FIX a:3Tim Landscheidt I've started enwp10's webservice by "webservice start". By error.log, it looks as if the webservice was (involuntarily?) stopped at 2014-06-03 17:03:19Z.... [14:52:36] scfc_de: why don't webservices autorestart again? [14:56:28] jsub -continuous webservice? :D [14:56:47] Fix the problem first, then restart?! :-) I'm very for mails on webserver stops, though. [14:57:00] a930913: Wouldn't work, as the OOM kills the outer loop as well. [14:57:12] (while ! $JOB; do sleep 5; done) [14:58:33] BTW, after experimenting recently with Perl's Dancer, I'm totally hyped on FCGIs & Co. If we could ease the process for that in the same way we do for Tomcat, that would be cool. [15:00:50] * YuviPanda intends to setup uwsgi at some point [15:01:54] scfc_de: I'm building the mongo user creator in a nice extensible way. should be easily reusable for postgres as well [15:02:15] scfc_de: I was told that'll be possible in a week or so (postgres for everyone) [15:04:28] I haven't followed that closely; akosiaris handles that? [15:04:43] scfc_de: yup, talked to him a while ago [15:05:37] scfc_de: except there's no way to 'fix the problem' [15:05:43] username = 'u_' . self.tool.uid [15:05:44] I just wrote that [15:05:45] 'OOM' on itself is not a 'problem' one can 'fix' [15:05:46] garalskgh [15:05:56] then it's just guessing what caused the OOM [15:08:01] yeah [15:08:02] valhallasw: It's certainly non-trivial to debug, but blindly restarting and restarting the webservice doesn't sound like a solution either. If the latter is necessary, IMHO it would make much more sense for those tools to increase the memory limit. [15:08:17] plus it requires the tool author to come back and do something just to keep it running [15:08:34] scfc_de: it should be restarted, say, no more than 2 times every X hours (24? 48?) [15:08:42] just on OOM [15:08:58] and then should email the maintainers [15:09:05] so no restart loops [15:09:38] YuviPanda: Once every day would be fine for me; I'm more afraid of "every ten minutes" :-). [15:10:01] scfc_de: yeah, once a day sounds ok. second time it happens kill it [15:11:57] ... but that requires that we have an external watchdog, and that needs to know how the webservice was started (not for now with only "webservice start", but for the future). So, on "webservice start" we would start two jobs, one for the webserver and one for the watchdog that gets passed the information on how to restart the webservice? [15:12:21] * (because an internal watchdog would be killed with the webserver) [15:13:02] (Or we could save the "how to" information in Redis with the port and use that.) [15:13:18] scfc_de: hmm, proxylistener already has 'when was this started' information [15:13:22] well, not 'when' but that's easy enough to add [15:13:46] scfc_de: so only thing to do is to figure out a way to trap an OOM, and check when it was last started, and restart appropriately [15:16:22] scfc_de: only question is how to trap the OOM [15:16:25] But doesn't proxylistener free the entry when the webserver shuts down? If on the other hand we have one watchdog process per webserver process, we can pass all the information along we deem necessary, and the watchdog could keep its own tabs. [15:17:36] scfc_de: hmm, actually yeah. we can use the proxylistener 'free' event as a trigger, even. [15:17:43] YuviPanda: "webservice stop" = stop the watchdog, after that, stop the webserver. On the other hand, "webservice start" = "start webserver", pass job ID to "start watchdog", watchdog = if job $ID no longer with us, record that fact, look in the past, start a new one or complain. [15:17:44] scfc_de: yeah, but that's one extra process for every ws :) [15:18:21] scfc_de: also, SGE should support a way to do --continuous but with only X restarts [15:19:02] I wouldn't hold my breath for the latter :-). And the code doesn't look very welcoming for patches :-). [15:19:03] Why can't there be one process the scans all webservices? [15:19:24] Or all jobs for that matter. [15:19:50] I think security-wise it's easier if a tool can only restart itself rather than one root process all others. [15:20:20] *But* for the watchdog job to work, we'd need to allow jobs to submit jobs. [15:20:38] scfc_de: Not even restart, just a web interface where you can tick, alert me if this process stops. [15:20:49] scfc_de: what's the problem with having same behavior as --continuous? [15:20:55] if they can thrash, so can --continuos normal jobs [15:21:17] a930913: that's something I'll be working on months from now. custom icinga for labs [15:21:22] or some form of custom monitoring, at least. [15:21:35] 'alert me when this URL returns non 200', 'alert me when this job is no longer running', etc [15:21:36] a930913: I think having "webservice start" start the webserver with "-m a" should work for everyone. [15:22:10] scfc_de: submit patch? ;) [15:22:12] scfc_de: Yeah. Does webservice pass parameters through? [15:22:18] please tell me webservice is puppetized [15:22:19] I think it is [15:22:34] YuviPanda: --continuous is a "while ! $COMMAND; do sleep 5; done" loop. If $COMMAND OOMs, the loop gets killed, so you never get a chance to restart. [15:22:58] YuviPanda: Yes, it is, but the tool-lighttpd (or is it lighttpd-starter?) isn't. [15:23:08] scfc_de: bah [15:23:14] scfc_de: oh wait, no, I remember the starter was puppetized [15:23:17] I remember seeing the source for it [15:23:21] and going 'bah perl' [15:23:31] or bah 'something' [15:23:33] a930913: You mean "-m e"? No, they're hard-coded AFAIR. [15:23:49] Perl is beautiful! [15:24:10] I've no qualms against well written perl, just I don't know it [15:24:10] YuviPanda: I stand corrected: They are. [15:25:24] YuviPanda: And they're bash scripts! :-) [15:25:32] scfc_de: ah, yeah. [15:25:33] bash [15:25:33] not perl [15:26:00] scfc_de: I want to learn perl one of these days, but hard to find a use case that python doesn't solve for me [15:26:58] which isn't the case with other languages I want to learn (C, Scheme, Scala, etc) [15:28:28] YuviPanda: For me, it's the other way round: I always look at Python and think I should do something with it, but then I look at the clock and take Perl :-). Thankfully, WMF forces a bit of variation on me :-). [15:28:41] scfc_de: hehe :) [15:28:57] scfc_de: I avoid bash scripts whenever possible, though. too risky [15:29:15] "forgot to double quote and your prefix has a space? too bad, you have lost / now" [15:31:04] scfc_de: shelling out in general also feels a little icky to me [15:33:53] YuviPanda: Share the sentiment. There's nothing better than real data types. Not strings that you need to (de-)escape yourself when they get passed from one program to the next. [15:35:43] scfc_de: :D [15:44:43] scfc_de: maybe "blindly restarting and restarting the webservice" isn't 'the solution', but it most certainly is what people do [15:45:04] scfc_de: because, given a lack of information, the only thing I *can* do as maintainer is restart the server, and hope it stays up [16:11:14] valhallasw: Well, you get the information that the server OOMed, but *I* would certainly just restart the server in 99 % of all cases. However, I'm concerned that if we "teach" tool maintainers to "just press reset!", we're gonna miss "real" errors that noone reports because they expect a webserver to die every x minutes. [16:23:58] scfc_de: well, you don't really get that information either, as you just see 'oh, the webserver is not running' [16:24:21] you already have to know how to use qacct to get any information at all [16:35:36] scfc_de: I'm putting mongo on hold for now, until there's a way around this. [16:35:38] indefinitely [16:35:38] sigh [16:36:25] valhallasw: Yes; but if you get the error "libgcc.so.0 something ...", it doesn't tell you that the job OOMed, either. (And if possible, we should make both errors more explanative, but ...) [16:37:45] YuviPanda: Why not pass credentials out manually until then? [16:38:14] (Or self-serving, as on #wikimedia-operations: And something self-serving? A daemon on tools-mongo or (properly reviewed) sudo rule that allows individual tools to create databases?) [16:38:37] scfc_de: yeah, that case is maybe even worse :-p [16:38:49] (Disclaimer: I don't intend to use MongoDB, so don't waste your time because of me :-).) [16:39:16] scfc_de: that's an option, yeah. [16:39:25] scfc_de: then again, 'tool not starting' is significantly easier to debug than 'webserver crashes randomly' [16:39:57] scfc_de: self-serving would be good (something like: make an empty mongo.conf.json file in your root directory, it'll fill with a db info in 5 mins!) [16:40:26] scfc_de: but I'm concerned about general usability of Mongo now. seems a bit half baked if it doesn't even support things like that [16:40:29] scfc_de, the job scheduler seems to be stuck. And been for 5 days now when submitting to cyberbot-exec [16:40:40] apparently the most common use production config doesn't use auth [16:41:49] "queue cyberbot marked QERROR as result of job 1468366's failure at host tools-exec-cyberbot.eqiad.wmflabs". Hmmm. [16:42:07] "can't close file usage: No space left on device". Let's see. [16:43:21] It looks as if DNS can't resolve tools-exec-cyberbot (again -- I'm pretty sure we had the same problem some weeks ago.) [16:44:02] /facedesk [16:45:46] "ssh 10.68.16.39" works, /var is full. [16:46:47] Meaning? [16:47:39] scfc_de, ^ [16:48:06] That maybe SGE on that host got a hiccup and reported as unavailable to the master. Let's wait a few minutes and see if it normalizes or if we need to reboot the host. [16:48:30] !log tools tools-exec-cyberbot: No DNS entry (again) [16:48:32] Logged the message, Master [16:48:39] Did you do something to it yet. Because it's been stuck for 5 days. [16:48:46] !log tools tools-exec-cyberbot: rm -f /var/log/diamond/diamond.log && restart diamond [16:48:48] Logged the message, Master [16:48:51] Cyberpower678: ^ [16:49:05] 512M free now on /var. [16:49:57] No change [16:50:14] !log tools qmod -cq cyberbot@tools-exec-cyberbot.eqiad.wmflabs [16:50:16] Logged the message, Master [16:50:53] And they seem to be starting again. [16:51:03] All tasks re-initialized. [16:51:20] scfc_de, thanks and here * <-- A barnstar for you. [16:51:57] np [16:53:29] andrewbogott_afk: tools-exec-cyberbot can't be resolved (again). IIRC last time you restarted pdns and waited for an hour or so until the negative cache entry in dnsmasq got purged. I think that'd be necessary now as well. [16:55:57] scfc_de: we should figure something for the /var problem soon, even if it requires downtime [16:56:03] err and the /tmp [16:56:07] scfc_de: and we should make lighty create pid files in /run [16:56:08] not in /mtp [16:59:38] YuviPanda: Hmmm. They're in /var/run, a tmpfs. But I'm pretty sure that recently a "webservice start" was stuck because /tmp was full. Or was it just because the socket for php couldn't be created? [16:59:53] scfc_de: socket for PHP, IIRC [17:00:01] either way, we should make a bigger /tmp available [17:00:22] users shouldn't have to get tmp to be on NFS (sloooow) [17:00:34] andrewbogott_afk: You're magic! Just calling out your name made "tools-exec-cyberbot.eqiad.wmflabs" work again. [17:00:40] hehe [17:08:34] akosiaris: any idea at what point we can start handing out postgres user accounts to everyone? I think I've a script almost ready [17:09:31] scfc_de: I'm going to give up on mongo for now, and just get postgres to everyone [17:11:11] ah, and it has a python3 package too [17:11:12] sweet [17:12:52] !log tools deleted tools-mongo. MongoDB pre-allocates db files, and so allocating one db to every tool fills up the disk *really* quickly, even with 0 data. Their non preallocating version is 'not meant for production', so putting on hold for now [17:12:55] Logged the message, Master [17:45:35] hi, do we have a list with all toolslabs ip? [17:45:38] or range [17:50:38] 3Wikimedia Labs / 3Infrastructure: Labs: Enable "Puppet freshness" checks in icinga for cvn project - 10https://bugzilla.wikimedia.org/66573#c1 (10Tim Landscheidt) (Not strictly relevant for "Icinga" (as in: alerting without user action), but you can see the Puppet status of your instances under [[wikitech:S... [17:53:59] scfc_de: does anyone use toolsbeta? [17:54:15] scfc_de: I checked out elasticsearch as well. Nothing for shared/multitenant environments [17:54:20] scfc_de: they don't even have auth :| [18:27:06] somone knows the toolslabs ip range? [18:37:02] Steinsplitter: External or internal? Internal, it uses 10.something. [18:38:42] scfc_de: externals, https://wikitech.wikimedia.org/wiki/IP_addresses seems oudated [18:40:21] Steinsplitter: andrewbogott_afk updated the Labs addresses in February, and they should still be up to date. [18:40:31] k, thx [19:48:26] 3Wikimedia Labs / 3Infrastructure: Enable ipv6 on labs - 10https://bugzilla.wikimedia.org/35947#c21 (10Tim Landscheidt) a:5Ryan Lane>3None http://permalink.gmane.org/gmane.org.wikimedia.labs/2651: | > Of particular interest would be to hear if there are plans to IPv6 | > enable the labs web proxy server...