[00:19:03] odder: that's a min, but not where the trend changes [07:04:59] Need some help regarding the wiki dump.! [07:05:04] Someone?? [07:08:26] apergos: ^ [07:08:52] \ [07:08:55] hi, what's up? [07:09:11] apergos, hey.! [07:09:31] what's your dump issue? [07:10:33] i have downloaded the dump from a torrent given on the dump.wikipedia.org [07:10:41] and its in an xml format [07:10:50] right [07:11:00] its a xmp dump of en-recent articles [07:11:12] what's the name of the dump file? [07:11:28] (so I know which of several different content files we're talking about) [07:12:03] enwiki-20130604-pages-articles [07:12:15] ok great [07:12:40] okay so now its a 44.2 GB extracted xml file. [07:12:40] so how can I help? [07:12:48] yes, these are pretty large. [07:12:52] how can i read it.! [07:13:04] what do you want to use it for? [07:13:23] i had tried that mwxml2sql parser but could not get that to work. [07:13:28] e.g. are you wanting to set up a local copy of Wikipedia on a MediaWiki installation? [07:13:57] i have to do some statistical analysis of words used in the articles, their frequency and all. [07:14:04] ah, what command did you run exactly and what was the problem, did it give you an error message? [07:14:49] mwxml2sql is intended to turn that file into page/revision/text tables in sql, which would be suitable for importing into a MediaWiki installation; is this going to work for your needs? [07:15:25] as i told u i have to some statistical analysis.! [07:15:39] yes [07:15:50] so, which tool would u suggest me 2 use for that.! [07:16:39] well our analytics team has some tools for generating various charts that show numbers of edits/pages/new editors etc per month but I don't know that they will have anything specificalyl for you [07:16:44] it's doubtful [07:17:22] hmmm... [07:17:53] so how does that tools of urs work. [07:18:22] does it generate things like frequency of a word in articles! [07:18:23] ? [07:18:29] no not at all [07:18:51] its purpose is to convert the xml to sql so that the output can be fed into a MySQL database [07:19:11] it's for making a local copy of WIkipedia or your favorite wiki [07:19:19] or a local copy of some subset of the content [07:19:36] if you want to do queries you would need to write some code for those [07:20:00] lemme see, there is a list of research tools around somewhere (I still guess you'll wind up writing code), just a sec [07:20:11] okay. [07:21:47] I think writing a text analysis tool for XML files is out of scope for this Wikimedia channel, it's too generic :) [07:22:15] :-) [07:22:33] http://meta.wikimedia.org/wiki/Research:Resources#Research_Tools:_Statistics.2C_Visualization.2C_etc. [07:22:36] there's some things here [07:22:45] ... the code of the Analytics team which is free is available under https://git.wikimedia.org/repositories/ - search for "analytics" [07:22:49] existing statistics here: https://meta.wikimedia.org/wiki/Statistics [07:24:07] okay lemme check it out.! [07:24:48] and one more page with an overview of datasets: https://meta.wikimedia.org/wiki/Research:Data [07:25:01] that shoudl give you some places to look, there's also the mailing list for researchers, [07:25:17] if you get completely stuck. good luck! [07:26:45] shanu1991: the file is an XML file [07:26:56] shanu1991: you should make your own program to parse that and do the analysis you want [07:27:09] I'm guessing that's the whole point of your assignment :-) [07:34:46] paravoid, i guess ultimately i would have to. [07:35:09] * shanu1991 checking for the available help.! [07:35:13] out of curiosity and if I may ask, what exactly do you want to do and why? [09:10:24] apergos, u thr? [09:12:59] yes [09:13:21] shashank_: [09:13:43] ah what exactly do you need to get as far as statistics, and what project is this a part of? [09:14:10] just a sec [09:15:35] yes.! [09:16:06] so i wanted to ask about his DBpedia.! [09:16:15] what kid of db is it.~ [09:16:19] can u tell me indetail. [09:16:29] *in detail [09:16:58] also it says that it allows to make sophisticated queries. [09:17:08] ah I haven't really played with it [09:17:11] can we make the statistiscal ones there [09:17:17] you'll need to talk with the people responsible for the project [09:17:25] ah.. [09:19:51] apergos, what i actually want is given a length or regex find the most probable/most used word.! [09:20:14] so can I ask, is this a homework assignment of some sort? [09:21:54] :) [09:21:57] * Aaron|home needs to sleep [09:22:16] * apergos hits Aaron|home with a sleepbat [09:22:23] *boff* [09:23:04] yeah you can say that.but not totally [09:23:32] hey shashank_ [09:23:36] ok, well what I wuld say then is you need to figure out how to write the regexp and the wrapper around it. it;s not so hard, but we can;t do it for you [09:23:41] curious, is this a final year project of some sort? :) [09:24:08] hey YuviPanda ! o/ [09:24:18] hello shanu199 :) [09:25:01] apergos, i m not asking for that. :/ [09:25:38] so is there any irc for this DBpedia also? [09:26:07] YuviPanda, not final year its actually some work that i m doing for a friend of mine. [09:26:13] I imagine but I would have to google it like everyone else [09:26:38] you don't need a full fledged xml parseryou know [09:26:54] oaky [09:26:55] you just need a regexp to get the stuff out of the text tags [09:26:58] *okay [09:27:18] that's all for thexmlfile, the rest is whatever you do with plain content [09:27:39] okay [09:28:00] but do u have any tools for generating the sattistics already written?? [09:28:05] I do not [09:28:09] maybe there is one somewhere [09:28:14] you will have to hunt, or just write one [09:28:28] really I think you could write one faster, if you didn't find one already on those pages [09:29:41] * shanu199 enjoying this search on wikipedia.! YEAH.!!! [09:30:25] anyways word frequency tools are way outside the scop of this channel [11:02:09] apergos, [11:02:17] need ur help again.!!! [11:02:57] let's hear it [11:09:50] so now i am trying to make a local wikipedia repo [11:10:01] can that xml sump help me soing it.! [11:10:08] *doing [11:11:10] i have tried that mwxml2sql [11:11:39] as u had told that it is used for the sql database for local wikipedia [11:11:51] apergos, u thr bissuy? [11:12:30] 8buddy [11:12:35] *buddy [11:13:34] yes [11:13:58] ah so there are instructions in the readme file in the git repo for how to use all those files [11:14:12] there's also a couple of step by step descriptions (read them close first) on um [11:14:25] ya so i have used the -s and -m switch [11:14:38] but then after that my cpu usage wnt to 100% [11:14:40] http://meta.wikimedia.org/wiki/Data_dumps/Import_examples [11:14:46] and nothing seems to be happening [11:15:05] also i got the following error [11:15:06] WHINE: (10) unknown attribute in text tag [11:15:26] that's ok, one whine means it's going to skip that entry [11:15:36] did you give it an output file? [11:15:57] u mean text file?? [11:16:52] well a table prefix string [11:17:04] no [11:17:07] ok [11:17:13] cause none is there. [11:17:30] ? [11:17:43] no prefix is there. or that i want. [11:18:13] so in another terminal if you ls -lt in the directory where you are running the program, do you see any files that have page or revision in the name? [11:18:46] lemme check [11:19:17] * apergos ownders idly what the unknown tag is, oughta check that later [11:20:03] no!! [11:20:52] in fact can you open a bz ticket with the version of mqxml2sql you are using, the full error message, the command you ran (including the dump file you used) and assign it to me? cause I should whack that anyways [11:21:12] ok, and I assume you only got the one whine, not piles of them..? [11:22:59] ya only 1 [11:23:09] and some sql commands on the terminal too. [11:23:58] did you give it a mysqlfile (-f) option? [11:25:22] I should really make that mandatory since nothing else is resonable there [11:27:05] no [11:27:13] ok well you definitely need to do that [11:27:21] okay so i have to a mysql file also [11:27:25] !! [11:27:32] * shashank_ damn.! [11:27:42] mwxml2sql -s elwiktionary-blahblah-stub-articles.xml.gz -t elwiktionary-blahblah-pages-articles.xml.bz2 -f elwikt-pages-current-sql.gz -m 1.20 [11:27:50] this is a typical invokcation of the command [11:27:51] and i am not very familiar with mysql too. [11:27:58] t's going to write mysqlfiles out [11:28:40] okay so then a file with the namae alwikt-blahblahs-sql.gx would be written is it.! [11:28:57] no [11:29:15] have a look at the usage message (./mwxml2sql --help ) [11:29:26] -s is for the stubs, you've already done that [11:29:38] -t is for the pages-articles, you've already got that [11:29:48] -f is for the output file, it's going to write 3 or 4 of them [11:30:06] but they will end in hat you give to -f [11:30:18] the .gz or .bz2 would mean that they will be written compressed [11:30:23] that's really all there is to it [11:30:28] see i have only written : mwxml2sql -s en.xml -m 1.21 [11:30:40] ok well you need the stubs file [11:30:54] and then the pages-articles file should be give to -t (text) [11:30:57] *ginen [11:30:59] *given [11:31:04] so wn.xml is the stub file(the downloaded dump file) [11:31:14] ok without text content? [11:31:17] *en.xml [11:31:39] see i have downloaded only a single xml.bz2 file.! [11:31:42] ok [11:31:48] so do i need to download onemore.! [11:31:50] so let's back up and look at these dumps files [11:31:51] *one more? [11:31:52] yes you do [11:32:25] oaky and then what should be the name for the -f file.! [11:32:27] *okay [11:32:27] lets look at the description page for tenwiki since it's less cluttered [11:32:32] http://dumps.wikimedia.org/tenwiki/20130610/ [11:32:55] if you look at this you see that there is a 'stubs' file (three of them actually) [11:33:02] which hav ethe description "2013-06-10 00:11:23 done First-pass for page XML data dumps" [11:33:11] and it explains below: These files contain no page text, only revision metadata. [11:33:25] so you need the stub file corresponding to the content file you already downloaded [11:33:38] you with me so far? [11:34:45] (in which case don't bother with bugzilla, you were giving it a file with the wrong schema so forget that) [11:34:57] yes [11:35:00] i m with u [11:35:18] ok [11:35:29] so now the -f option is just 'what do I call the output files' [11:35:44] you will wind up with 3 or 4 output files all of which will end in whatever you pass to -f [11:35:54] if your name ends in .gz hey will be gzipped [11:36:03] if your name ends in bz2 they will bzip2 compressed [11:36:06] that's all there is [11:36:17] try something random and check it in another terminal after a couple minutes [11:36:21] you'll see [11:36:36] so how do i recognize a stub file for the dump i have downloaded? [11:36:56] s did you look at the tenwiki dump page? [11:37:11] ya [11:37:20] ok, do you see the three stub files? [11:37:41] ya got that [11:37:50] * shashank_ acted so lame. [11:37:58] ok, do yu see above those three pages- files with similar names? [11:38:03] that's all there is to it [11:38:23] you just don't want say the history stub and the pages-articles content dump, it won't work [11:38:29] see what I mean? [11:38:50] ya [11:39:04] ok, have fun [11:39:17] and one more thing how to get this output into my database? [11:39:26] import??? [11:40:10] so the link I gave you above to meta has a couple of examples, read them carefully before you try it [11:40:30] also please check the readme files in the directory with mwxml2sql [11:41:24] of course I don't discussion enabling extensions or anything, that's outside the scope of these tools [11:42:00] there's a mialing list you may want to get on: xmldatadumps-l [11:42:29] where we make announcements about these sorts of tools, people ask questions, discuss issues, etc [11:42:45] okay [11:43:18] i have already gone through the ReadMe file but i think you should make a change about the mandatory option [11:43:26] yep I have already made a note [11:43:40] or should i raise a ticket for that :D [11:43:43] nah [11:43:49] it's not strictly a bug [11:43:53] (can be of some help here) :p [11:44:19] yeah, that's true but keeps u bugging around though.! [11:44:21] :D [11:44:31] thanks but no thanks :-P [11:44:44] ok well I need to get back to regular work now, but do get on the mailing list [11:45:15] there is a tool wpmirror that is supposed to make things simpler, I haven't really tested it out but you could ask about that on the list as well [11:54:43] okay :) [16:47:02] Hi. Does anyone know where I can find $wgRateLimits set for Special:Emailuser in wikipedia? [16:47:13] (Is this information public?) [16:53:12] dalba: https://noc.wikimedia.org/conf/ perhaps? [16:53:15] yes [16:53:24] it was reduced recently [16:53:51] thank you! :) [16:54:25] https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php dalba [16:55:55] odder, That's exactly what I was looking for. Thank you very much. [18:15:30] legoktm: do you know much about lua? [18:15:44] Yes, why? [18:15:45] quick question, If I get a ERR_READ_TIMEOUT from a squid when performing an action on wiki and as far as I can tell the action has completed can I probably presume it has completed correctly? [18:16:03] addshore: yes [18:16:07] good :> :P [18:16:17] I thought so, just wanted to double check :) [18:16:18] my bot reads "timeout" as success [18:16:29] hehe :) [18:16:53] Can it be used to put buttons in templates like http://enwp.org/User:Technical_13/SandBox/AFC_draft [18:18:58] Erm, what? [18:19:03] Why would you use lua for that? [18:20:51] So that everyone sees the button regardless if they have JavaScript on. [18:22:11] Or is that not how it works? [18:30:16] T13|mobile: I don't see what lua has to do with it. [18:32:06] Buttons are an html feature that can be done without js. I was hoping lua could make such a non-js button. Would make it easier for all afc submitters. [19:37:42] Hi guys, Im lookin for a bot that could work through tor. I only found some "brainstorming" around httplib and http proxies.But nothing about bind the bot to 127.0.0.1:9050 or something like that.Do u guys know where I should look for my answers? [19:45:37] m00nlit3__, what exactly do you want to do through Tor? [19:45:46] You won't be able to edit [19:46:11] It sounds like what you're having trouble with has nothing to do with wikimedia, you should probably ask for help from python or tor [19:47:31] Krenair : I want to edit a wiki to keep my link not deleted [19:47:51] ... through tor. [19:47:53] um yeah, no [19:47:57] lol [19:48:22] Chances are it'll be removed anyway if it's not relevant [19:48:33] Krenair : yes,its for a wiki through tor [19:48:45] Why are you using Tor? [19:48:47] that gets vandalized by bots [19:50:14] Using TOR in some activities is highly recommended,thank you for your help.You are right,I would have asked more on python forum [19:51:51] Cyberpower678: I figured out how to do full nicks on my droid to ping you.... [19:52:08] T13|mobile, cool. [19:54:09] m00nlit3__, what activities exactly are you doing where tor is highly recommended? [21:51:01] don't suppose anyone's seen hoo lately? [21:51:27] we have [21:51:33] you just have to be in here at the right time [21:52:13] *lately* [21:53:03] I saw him yesterday [23:24:38] * Jasper_Deng pokes ashley [23:25:01] Jasper_Deng: hi! what's up? [23:25:14] when are we going to get global CheckUser? [23:26:15] * Jasper_Deng believes ashley is working on http://www.mediawiki.org/wiki/Admin_tools_development/Global_CheckUser [23:26:20] I wouldn't know :-) someone else might [23:27:06] (I haven't been working on that project for a fair while)