[00:53:51] petan: Coren around? [00:53:57] awjr: is having trouble adding me to a service group [00:54:05] it tells him he's not part of it, even though I can see that he is [00:54:05] you beat me to it YuviPanda [00:54:10] 'local-bingle' [00:54:13] :) [00:54:21] https://wikitech.wikimedia.org/w/index.php?title=Special:NovaServiceGroup&action=addmember&projectname=tools&servicegroupname=local-bingle&returnto=Special%3ANovaProject [00:54:32] results in 'Not in service group - you must be a member of this service gorup to perform this action' [00:55:01] even though i am in the group (i set it up...) [00:56:44] Tried the usual log out/log in for session failure? [00:56:57] awjr: ^ log out and back in? [00:57:18] harumph [00:57:45] ok, tried it, still doesn't work :( [00:59:17] awjr: I guess we caught the admins in a bad timezone :) [00:59:41] oh well [00:59:55] awjr: but become works for you [00:59:57] so that's good [01:00:03] yeah so i can at least get the tool set up [01:00:07] indeed [01:00:34] awjr: I'm going to be around for a while now anyway, poke me if you hit snags :) [01:00:58] thanks YuviPanda, i need to go soon so i'll probably not get totally done tonight [01:01:03] okay :) [01:01:09] enjoy your weekend, awjr [01:23:58] zz_YuviPanda: bingle now running on tool labs :) [08:55:39] awjr_away: :) [10:38:43] i have a question: (i know i need to read the manual but it's kind of urgent): how do i get the top-edited articles for a certain wikipedia? does anyone have a ready script or can someone explain quickly how the best query will look like? i understand that the edit count for articles is cached somewhere [10:43:05] how i can run a mysql in labs? I ran this but an error returned [10:43:07] ladsgroup@tools-login:~/pywikipedia$ mysql nlwiki_p rezanl.txt [10:43:09] ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) [10:49:30] Amir1: try $ sql nlwiki_p rezanl.txt [10:49:32] Amir1: you miss the connect line [10:49:43] ok [10:49:53] sql is a special wrapper around mysql which takes care of authentication and picking the right host [10:50:04] legoktm: thank you [10:50:15] np :) [10:50:17] why you're still awake :P [10:50:31] i was about to go to sleep xP [10:51:42] Amir1: are you creating some query to create a list of pages that will have to be refreshed periodically? [10:51:56] fale: no [10:52:01] Amir1: ok :) [10:52:14] I'm importing articles to wikidata [10:52:24] :D [11:54:33] * YuviPanda considers setting up some form of map reduce environment on betalabs [11:54:36] err [11:54:37] toollabs [11:54:38] not betalbs [11:54:39] grr [12:28:56] YuviPanda: why map reduce and not generic parallel computation stuff? [12:29:45] YuviPanda: SGE make *grin* [12:30:05] of course, this exists. It's the /other/ qmake. [12:33:51] If I have to look all the pages that include a certain string (or regex), I can use three methods: download all pages from wiki and parse one by one (surely deprecated), download a dump and run a pywikipedia/relace.py script or use a mysql query. Whichone of the last two is the best one? [12:36:01] valhallasw: primarily for things like Pig or Hive [12:36:10] valhallasw: lets you write 'queries' in an SQLish way, but far more powerful [12:36:21] what fale asked for could be done that way, for example ;) [12:36:27] fale: dump + scan [12:36:35] fale: dumps are already on tool labs, so you can just scan them [12:37:28] YuviPanda: in the mysql db there is the text of every page... so even a mysql query can do the job ;) [12:37:47] fale: afaik page text is not replicated? [12:38:16] valhallasw: uhm, I haven't checked, but I think is there [12:38:21] it's not, IIRC [12:38:35] IIRC the cluster doesn't even store the text in mysql :D [12:38:38] fale: for wikimedia wikis, page text is not stored in the text table [12:38:42] er, page table [12:38:49] revision. [12:39:01] * valhallasw reboots brain [12:39:38] valhallasw: something like this would be something that sets up nicely on Hadoop, for example [12:40:15] YuviPanda: ah, ok. [12:40:42] All public wikis will be replicated to the LabsDB servers, with private user data redacted. [12:43:17] fale: no revision text tho [12:43:45] YuviPanda: I see... so the WikiTech docs has to be fixed [12:44:17] fale: yeah. the reasons were technical and legal [12:44:56] I can see tech reasons, while Legal... not really. On TS there was the full text replicated, IIRC [12:45:35] yeah, something about revdeletes not going through i think [12:45:46] unsure tho [12:47:18] YuviPanda: I see, thanks :). So there is only one way to retrive this kind of data: using dumps [12:47:30] yup [12:47:50] fale: addtravel was working on 'dumpscan' explicitly for this [12:49:12] YuviPanda: nice, I'm thinking about expanding lists to support .py fixes file (for creating lists of pages matching certain regex) [12:49:24] ah [12:49:26] nice [12:50:22] YuviPanda: now lists only supports mysql query, but if they are not capable to look for strings of text... I think the .py extention is needed [12:50:23] valhallasw: http://www.mappian.com/blog/hadoop/using-hadoop-to-analyze-the-full-wikipedia-dump-files-using-wikihadoop/ :) [12:50:56] however [12:50:59] ' We are running it on a three node mini-cluster with quad-core machines and the job takes about 14 days to parse the entire English Wikpedia dump files.' [12:51:45] YuviPanda: not really efficint [12:52:04] YuviPanda: consider that they are talking about a -full-history, I would say [12:53:52] fale: they seem to be actually considering all diffs, so that's even more than full-history [12:53:56] so... not bad [12:54:11] i guess without history it'll be much lesser [12:54:41] YuviPanda: it's 5.5Tb the version they are using... I think the last version onlyi is way less [12:56:22] YuviPanda: is 9.2Gb bz2 compressed, so less than 100Gb [12:56:28] *200 [12:56:40] yeah, so should be much faster [12:56:51] we should play with it, methinks :D [12:58:00] YuviPanda: also, hadoop increases performances linearly increasing the number of nodes ;) [12:58:16] yeah and if we set things up on puppet, should be trivial to spawn new nodes [12:59:07] * fale has not really clear what puppet is and does [13:01:31] New review: coren; "Protection against library injection is actually provided for by ld.so at runtime (it will not honor..." [labs/toollabs] (master) - https://gerrit.wikimedia.org/r/71112 [13:02:53] New review: coren; "Still ok after rebase" [labs/toollabs] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/70733 [13:02:53] Change merged: coren; [labs/toollabs] (master) - https://gerrit.wikimedia.org/r/70733 [13:04:07] New review: coren; "Also still good after rebase." [labs/toollabs] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/70734 [13:04:07] Change merged: coren; [labs/toollabs] (master) - https://gerrit.wikimedia.org/r/70734 [13:07:24] fale: also I'm not sure about using Hadoop here, since it might be too large a hammer for this particular nail [13:09:44] YuviPanda: btw: if you want to have good real-time performances, hbase can be a good solution ;) [13:10:03] yeah but question is how do we map the dumps into the system in an easy way [13:10:06] * YuviPanda looks into HBase [13:10:25] i'm just exploring things. [13:10:29] YuviPanda: surely a calculate have to be done between the costs and the results [13:10:51] fale: in terms of tool reliability, having some form of logging / monitoring would be more important than this [13:11:18] Hbase doesn't seem that useful. [13:11:29] YuviPanda: hadoop have awesome results on a huge dataset (our is large but I would not definite it as huge) and with tens or - better -hundreds of nodes [13:11:37] yeah [13:11:43] but it runs with 3-4 ones too [13:11:48] I guess hadoop is overkill here [13:12:26] YuviPanda: hadoop can even run on a single node... but has comparable performances to any other map-reduce shell script ;) [13:12:32] indeed [13:14:04] YuviPanda: I've used hadoop/hbase only for other kind of data (financial data) so I can't say howfast it would handle wiki data [13:14:34] in this case I think the 'cost' is going to be 'how much time will we spend on setting it up' and the results would be 'what new things can people do with it'? [13:14:39] not sure about the answers to both [13:14:50] setting up Redis was ridiculously justifiable :) [13:15:31] YuviPanda: on the cost consider also the time to (eventually) rewrite some rules that already written for other systems (mysql/pywiki) to hadoop [13:15:46] true [13:16:57] YuviPanda: can be interesting if enought nodes are available, since hadoop handle very well a situation in which the same data are used imultaneusly on more than one request [13:17:19] fale: indeed! [13:17:25] data locality optimization, etc [13:17:30] yep [13:17:42] I guess we can setup 3-4 nodes easily on toolsbeta without Coren hitting me with a stick [13:17:52] but they'll be slow, because VM [13:18:04] YuviPanda: am I wrong or the tool-labs is based on amazon hw? [13:18:19] fale: no, it's running in the foundation's data center [13:18:23] fale: it's based on OpenStack [13:18:30] YuviPanda: I see [13:18:33] which has some of the capabilities of AWS, but not all [13:19:03] is there local HD or is all SAN/NAS HD? [13:19:29] fale: I *think* there's a local HD, but I'm not completely suer about the architecture [13:19:37] I know that we have some stuff on Gluster, and that's 'going away' soon [13:20:40] that could be a good point to start, I think, unserstanding the physical server structure [13:21:43] indeed [13:21:51] fale: Coren is also changing it right now [13:21:51] so [13:22:16] YuviPanda: Coren has physical access to the datacenter? [13:22:46] he's sorting out what he called the current 'NFS mess', so... [13:22:51] not the data center itself [13:23:12] I see [14:17:43] New review: Tim Landscheidt; "@Coren: Thanks. I'll still need to fix Architecture: and the Depends: stuff, so I keep at -1 for a ..." [labs/toollabs] (master) - https://gerrit.wikimedia.org/r/71112 [14:29:23] Coren: are you part of the ldap/ops group? [14:29:27] * YuviPanda needs a small favor [14:29:46] need to tick a checkbox (or something) in Gerrit :) Nothing major, I swear [14:31:58] fale: http://lintool.github.io/Cloud9/docs/content/wikipedia.html [15:16:34] anyone with a tools account who can help me with a small experiment? [15:16:45] like, really, small! [15:25:34] YuviPanda: sure [15:25:38] !ask [15:25:38] Hi, how can we help you? Just ask your question. [15:25:40] ;-) [15:26:20] valhallasw: can you read /home/yuvipanda/testing/hi [15:26:20] ? [15:26:27] from your account? [15:26:35] and also from a tool account? [15:26:41] should be able to, but I'm just double checking [15:27:02] YuviPanda: no [15:27:17] also not from nlwikibots [15:27:22] hmm [15:27:27] drwxrwxr-x 2 yuvipanda svn 15 Jun 29 15:26 testing [15:27:37] drwx------ 13 yuvipanda svn 4096 Jun 29 15:27 /home/yuvipanda/ [15:27:40] but drwx------ 13 yuvipanda svn 4096 Jun 29 15:27 yuvipanda [15:27:41] yeah [15:27:47] set a+x [15:28:00] read rights are not necessary, execute (enter directory) rights are [15:28:23] hmm [15:28:50] i'm a little wary of messing with perms on my home dir, doing it wrong and locking myself out :D [15:29:03] :D [15:29:08] chmod a+x /home/yuvipanda [15:29:10] should be OK [15:29:14] that's what I did :p [15:29:26] well, okay ;) [15:29:27] try now [15:29:32] can you access? [15:29:37] YuviPanda: yep [15:30:01] valhallasw: but you still can't read, for example, /home/yuvipanda/.rnd [15:30:05] but you can list it [15:30:07] is that right? [15:30:52] my (weak) permissions fu says that's correct, but it is 'weak' [15:47:13] YuviPanda: no [15:47:30] YuviPanda: no +r so no listing [15:47:45] what do you see when you do a ls /home/yuvipanda [15:47:45] ? [15:47:48] YuviPanda: and no +r on /home/yuvipanda/.rnd so I can't read that [15:47:58] ls: cannot open directory /home/yuvipanda: Permission denied [15:48:07] but ls /home/yuipanda/testing works [15:48:16] so +x is traversal rather than listing permission [15:48:24] yes [15:48:32] +r is listing permission [15:48:32] makes sense :D [15:48:54] but I can explicitly ask about .rnd [15:48:54] -rw------- 1 yuvipanda svn 1024 Jun 28 16:19 /home/yuvipanda/.rnd [15:49:29] and I can read your .vimrc >:_ [15:49:45] but that was the point of the exercise to begin with iirc [15:50:01] :) [15:50:10] valhallasw: that's one of the points, but that's not the sole point [15:50:20] valhallasw: i'm testing out a way of making ipython notebook available to tool users [15:50:27] with public permalinked shareable urls for the outputi [15:50:46] oooooh. [15:51:03] so by default their output will be private, under their userspace. If they chose to publish, it'll move to a properly permissioned folder in their userspace [15:51:16] and then the tool will just read those and serve them with appropriate transformations :) [15:51:43] I see [15:52:05] valhallasw: so this... should work [15:52:24] ipython brings with it security issues, so we can't easily open it up to random users (no matter how much I'd like that) [15:52:33] but restricting it to tool-users who already have login seems okay [15:52:45] for now it can run on -login, if that gets too much we can easily setup another system for it :) [15:52:47] YuviPanda: why not just letting tool users run ipynb theirselves? [15:53:09] valhallasw: yeah, that's what this will do. Will need to proxy via ssh tho [15:53:17] it uses websockets, so we can't actually just serve them via apache [15:53:22] right. [15:53:37] i'm just writing a tiny script that'll automate the 'ssh in, setup ipy + deps, start server on appropriate port, and setup an ssh tunnel' [15:53:52] so instead of telling people 'do x, y, z oh and then a, b, c' you just go 'curl