[01:17:50] Hey gang, I can't seem to access the "text" table via Quarry, any ideas? [01:20:28] drewmutt: how are you trying to do it? [01:20:40] DESCRIBE text [01:20:42] (for example) [01:20:52] Table 'enwiki_p.text' doesn't exist [01:21:24] drewmutt: some of the tables might actually be views on quarry, i think [01:21:37] does `select * from enwiki_p.text limit 1` not work? [01:21:56] No go. [01:23:19] I'm trying to find revisions where the content contains a key phrase, if that helps. [01:23:32] huh. that's funny. [01:23:36] Namely 'dank memes' ^_^ [01:23:46] drewmutt: oh, hm, i don't think you can actually do that though. the actual text is stored externally [01:24:04] Ohhhh... That would explain the dumps I saw. [01:24:05] drewmutt: the 'text' table only contains things like: [01:24:06] +--------+--------------------+----------------+ [01:24:06] | old_id | old_text | old_flags | [01:24:06] +--------+--------------------+----------------+ [01:24:06] | 1 | DB://rc1/5801939/0 | external,utf-8 | [01:24:06] +--------+--------------------+----------------+ [01:24:36] the old_text just contains a reference to another mysql database on a different host. i don't think that's available on quarry [01:24:58] if you need to search the text, i think you'll need to scan the xml dumps [01:25:02] Well, poo. Any protips on how I can pull it off? Do I have to DL a dump, and do some silliness locally? [01:25:09] !dumps [01:25:22] (hm) [01:25:23] For information on how to get dumps from Wikimedia Wikis, see http://meta.wikimedia.org/wiki/Data_dumps . For a how-to on importing dumps, see https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps . [01:25:55] Is there a REST call for it perhaps? [01:26:01] if you need to look at old revisions, i think scanning dumps is the only way :( [01:26:11] (if you need only current pages, you could just search for it :) ) [01:26:21] drewmutt: REST call for what? [01:26:37] To search revisions, I imagine not. [01:26:46] probably not [01:28:01] Butttt the data in somewhere in the WP universe. I mean, worse case I could scrape the diffs. [01:29:44] I doubt that when you click a diff, it pulls from an XML dump somewhere (but, I've been wrong before). [01:35:22] no, it pulls it from the external database [01:35:33] which afaik are not publicly mirrored [01:36:52] * drewmutt pouts [01:37:29] No problem, I'll go dig around in the dump then. Thanks! [01:38:42] Oh, so how should I go about suggesting (or doing so myself) updating the DB diagram? [01:46:04] o.O [01:47:20] I think updating them is a non trivial process... [01:47:24] Are they that far out of date? [01:54:52] Reedy: If the text table isn't accessible, perhaps that should be noted? [01:55:12] Cameron11598: You're everrryywhereee [01:55:38] * Cameron11598 is always watching. [01:55:43] <_< [01:55:45] >_> [01:55:46] Keep that in mind drewmutt [01:55:55] * drewmutt acts cool [02:11:01] drewmutt: What are you trying to do? [02:11:39] Do we have a labs diagram? [02:11:43] Use the API? [02:11:43] Yvette: Hi there, I want to pull up every revision to articles containing a key phrase. [02:12:04] On the English Wikipedia or somewhere else? [02:12:09] en [02:12:24] Yeah, you need the big dumps. [02:12:42] Okie :/ [02:12:50] I think they're already mounted somewhere, maybe. [02:12:52] Let's see. [02:15:22] drewmutt: Yeah, they're available on Tool Labs. [02:15:35] You want the pages-meta-history dump files, I guess. [02:15:37] https://wikitech.wikimedia.org/wiki/Help:Shared_storage#.2Fpublic.2Fdumps [02:16:30] mzmcbride@tools-bastion-03:/public/dumps/public/enwiki/20170101$ ls -lh enwiki-20170101-pages-meta-history8.xml-p001755616p001791079.bz2 [02:16:35] -rw-rw-r-- 1 abcdefg icinga 979M Jan 6 23:39 enwiki-20170101-pages-meta-history8.xml-p001755616p001791079.bz2 [02:17:23] Compare with "enwiki-20170101-pages-meta-history8.xml-p001755616p001791079.bz2 978.7 MB" at https://dumps.wikimedia.org/enwiki/20170101/ [02:17:47] So you can just read through those files. [02:18:30] Probably take... two weeks to scan all of them? [02:18:30] Maybe less. [02:18:51] enwiki-20170101-pages-meta-current.xml.bz2 24.7 GB takes like 90 minutes, I think. [02:20:30] Yvette: Two weeks? Even with C? [02:20:47] Amazon estimation? [02:20:59] Promise longer than expected for a happier customer. [02:21:07] That's my usual excuse with bad estimates lately. [02:21:28] drewmutt: It's mostly I/O speed, probably? [02:22:07] Disk read and decompression time, I guess. [02:22:28] And it matters where you download all the dumps somewhere else first. [02:22:37] Yvette: Makes sense. I wonder if there's a sneaky way I can do it searching via edit logs or something. [02:22:43] (on the site) [02:22:48] Since the bandwidth usage to download than many gigs is pretty hefty. [02:23:13] drewmutt: I hope not. :-) [02:23:24] https://en.wikipedia.org/wiki/Special:Statistics says there are about 869 million edits. [02:23:50] Which is wrong, but in the same ballpark as the actual number of revisions. [02:23:50] Can I cull for just the last year? [02:24:03] You only care about edits made since 2016-01-01? [02:24:17] I doubt doing this over HTTP will be faster. [02:24:20] It's just too much data. [02:24:32] * Yvette looks at revs per day. [02:24:43] drewmutt: Keep in mind, these figures and files are encapsulating all Wikipedia pages. [02:24:47] Not just articles/content pages. [02:24:54] Most revisions are probably outside of articles. [02:24:58] Lots of dumb bots making dumb edits. [02:25:10] Dumb people too, ofc. [02:25:21] ClueDumbBot pphh. [02:25:33] Can I get just the article space ones? [02:25:46] Nope. [02:25:55] But you can filter the XML dump to certain namespaces. [02:25:58] Or to certain date ranges. [02:26:40] Ah, so I don't have to download the whole humdinger? [02:26:58] Well, if I were doing this, I wouldn't download anything. [02:27:02] I'd just run it on Tool Labs. [02:27:07] Since they've already downloaded all the data. [02:27:16] Can ya link me, plz? [02:27:21] I did above! [02:27:52] [21:15] https://wikitech.wikimedia.org/wiki/Help:Shared_storage#.2Fpublic.2Fdumps [02:27:58] [21:16] mzmcbride@tools-bastion-03:/public/dumps/public/enwiki/20170101$ ls -lh enwiki-20170101-pages-meta-history8.xml-p001755616p001791079.bz2 [02:28:01] [21:16] -rw-rw-r-- 1 abcdefg icinga 979M Jan 6 23:39 enwiki-20170101-pages-meta-history8.xml-p001755616p001791079.bz2 [02:28:56] Oops, sorry missed it. [02:29:29] You don't (probably can't) decompress the files on disk. [02:29:33] You can just read through them compressed. [02:29:41] But it's a lot of disk reading and a lot of decompression. [02:29:43] So it may take a while. [02:29:47] And/or get interrupted, probably. [02:29:53] Makes sense. [02:30:20] I've done it with the smaller dumps many times. [02:30:25] The current history, not the full history. [02:30:51] I guess I don't understand the diff. [02:31:19] Current history is the latest revision of every page. [02:31:29] Oh, I gotcha. [02:31:29] Full history is all revisions of every page. [02:31:40] Most people don't need/want the full histories. [02:31:45] So how do bots do it? [02:31:48] Just the current ("current") wikitext is enough. [02:31:50] Do what? [02:32:06] The XML dumps are always at least a week or two old. [02:32:11] Such is the nature of producing large dumps. [02:32:14] Cluebot and the likes. Do they just chunk though recent changes? [02:32:47] Searching through real-time data is a different problem-space than searching through historical data. [02:32:47] One is huge and accumulates. [02:32:47] The other is ephemeral. [02:32:59] The short answer is that there's an IRC feed/stream of edits, though. [02:33:33] Each wiki has a channel. Most anti-vandalism tools/bots that are real-time use that. [02:33:33] You could also use api.php or the database replicas. [02:33:33] Ah, same as Huggle.. [02:33:40] I mean, don't quote me on ClueBot's internals, but yeah. [02:34:34] Fair enough. [02:35:15] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Editing_trends/Raw_data/Revisions_per_day [02:35:18] Is some older data. [02:35:21] I guess I could update that. [02:36:58] drewmutt: Anyway, what are you trying to do? [02:37:06] You're still XYing me. [02:37:23] XYing? [02:37:39] The XY problem is asking about your attempted *solution* rather than your *actual problem*. http://meta.stackoverflow.com/a/66378 [02:37:59] Oh, sorry, sorry, I mentioned it earlier, but you may not of been in. [02:38:13] [20:23] I'm trying to find revisions where the content contains a key phrase, if that helps. [02:38:17] Hmm. [02:38:24] [20:23] Namely 'dank memes' ^_^ [02:38:26] Oh, hah. [02:38:44] Hehe, yeah, partially for my own amusement and to see trends in vandalisim. [02:38:45] Sorry, I missed that line. [02:39:00] I'd skimmed most of the rest of it. [02:39:14] MatmaRex is right about Quarry, the database replicas, and the text table. [02:39:29] drewmutt: You're willing to write C for this? [02:40:22] https://meta.wikimedia.org/wiki/Euphemisms <-- dumps. [02:40:24] Yvette: Realistically not, but I could see other uses for wanting to sift through all the revisions, so it might be useful if it doesn't exist. [02:40:40] You're the first person to ever want to search Wikipedia page histories. [02:41:59] I'd be lying if I said I wasn't a little proud of that ^_^ [02:42:14] I'm kidding, of course. [02:42:21] People have wanted similar tools since forever. [02:42:34] Every researcher who looks at Wikipedia asks about this. [02:42:46] People often want a variant on a per-page basis. [02:43:01] A way to "blame" a page to see which revision introduced a particular string of text. [02:43:07] Similar to how version control works. [02:43:11] Yes, I have also wanted that. [02:43:20] I used a blame tool that was around somewhere. [02:43:23] Or sometimes people take it a step further and want to make every wiki page a Git repo. [02:43:26] Or whatever. [02:43:30] Yeah, some exist. [02:43:50] Anyway, the dumps are on the file system and will work for this, if you're interesting. [02:43:55] Interested, rather. [02:44:09] I don't think finding "dank memes" in every Wikipedia revision is quite up to the level that I'd want to help out. [02:45:02] Heh, I don't blame ya. I thought I could bang out a quick SQL for it, but much more investment than that, meh. [02:45:29] https://dpaste.de/9M2q/raw is the ratchet script I used for enwiki-20161220-pages-meta-current.xml.bz2 the other week. [02:45:43] pages-meta-history is probably different and would require some tweaks. [02:45:56] Plus you gotta figure out how to clue/concatenate all these files together. [02:46:10] Which pages-meta-current doesn't really require. [02:46:33] Oh, I had one other thing (which is much less silly). [02:47:44] base36 SHA-1. I mean really. [02:48:18] Heh.. So I was using the api, and I believe I was getting "recentchanges" and there a user exclude field.. [02:48:29] I can dig it up, unless you know what I'm talking about. [02:48:39] I know everything. [02:48:42] I run Wikimedia. [02:49:21] > rcexcludeuser [02:49:21] Don't list changes by this user. [02:49:37] Yep, is there a way to pass multiple into that baby? [02:49:49] https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brecentchanges [02:49:53] Probably not. [02:50:09] The more filtering you apply, you either get fewer results. [02:50:20] Or it becomes a lot more expensive to provide X results. [02:50:26] Since it requires more scanning. [02:50:42] But if you want to just search recentchanges, that table is available on Tool Labs and via Quarry. [02:50:54] So you could filter out many users in SQL. [02:51:15] rc_user_text is the field for user names, including IP addresses. Or rc_user for user IDs. [02:51:17] Yeah, I was doing it as a fork to RTRC. [02:51:36] So it's already tied in via API. [02:51:48] https://www.mediawiki.org/wiki/Manual:Recentchanges_table [02:51:50] All right. [02:52:31] Really, I want to exclude every user with the name "dank memes" (kidding kidding) [03:37:24] drewmutt: don't feed the troll [03:38:15] I know, I've normally been good with it. Can you tell the other guys too, plz? [06:17:58] [18:45:03] Heh, I don't blame ya. I thought I could bang out a quick SQL for it, but much more investment than that, meh. <-- fwiw I think there are Wikipedia dumps imported into the public BigQuery stuff [06:42:33] Is there public BigQuery stuff? [06:51:40] Yvette: the NFS mount is slow, so it's much faster to pipe from the 7z. Less data to read and faster decompression makes for a decent throughput [07:18:32] Sigh, when you can read files at 10 Mbps no solution will be fast enough [07:26:35] Nemo_bis: 7z doesn't use the NFS mount? [07:32:18] "Less data to read and faster decompression" [07:34:43] 116 GB vs. 755 [20:13:38] I guess I read your clauses as dependent. [20:13:59] I know how to read bzip2 files, I don't really know how to deal with 7z files. [20:14:03] 7zip? [20:14:07] Whatever it is. [20:14:26] Plus I guess the xml is divided into .1, .2, etc.? [20:17:36] No, just the same XMLs 7zipped one by one [20:17:49] You just do 7z e -so [20:18:28] Labs NFS is so slow that I often compress even smaller text files with LZMA, so that I can grep them more quickly [20:39:57] Heh. [20:40:15] 7z e[xtract] -so[standard out] ? [20:40:26] I guess that's easy enough to read in a script. [20:40:47] I've been using Python to do the file opening and decompression, but a shell binary works just as well, presumably. [20:41:03] I've used bzcat | in the past. [20:41:31] Nemo_bis: Have you done a scan of enwiki full history before? [20:42:13] Yvette: yes, but NFS seems to have gotten slower [20:42:55] I wonder when Labs will move to AWS. [20:43:25] When I really need to process some data, I download the dumps from your.org on some other server [20:43:51] And other server does the processing? [20:44:10] Or even home computer [20:44:20] Sounds like a lot of local bandwidth needed? [20:44:41] Dunno, I can download at 100 Mbps or more which is 1-2 orders of magnitude more than IO on labs :) [20:44:52] That's fast! [20:44:59] Home residential connection? [20:45:19] Yes... nowadays new contracts have 1000 [20:45:36] Not around here, I don't think. [20:45:55] Not for standard residential, anyway. [21:03:46] I just upgraded my Wiki from 1.24 to 1.28 and a mod I had developed to allow XenForo users to log in (and have an account made for them) using their XenForo credentials has broken because of large changes. [21:03:59] What is the correct way to implement this in 1.28? I know there are serious changes to how auth works now. [21:33:39] KiwiNull: https://www.mediawiki.org/wiki/Manual:SessionManager_and_AuthManager [21:34:13] You want #mediawiki btw (but the people most experienced with AuthManager happen to be in this channel too) [21:53:29] Nemo_bis, what's the best place or hook to set this auth manager variable? [22:29:55] I updated https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Editing_trends/Raw_data/Revisions_per_day [22:57:13] Speaking of LZMA, this is shocking: http://packages.ubuntu.com/yakkety/xz-utils [22:57:53] Which part is shocking? [22:58:04] 5.1 [22:58:19] I've been enjoying multi-threaded LZMA on fedora for... years now? [22:58:30] (now at 5.2.2) [22:58:45] Latest is 5.2.3, I guess. [23:00:06] where does it mention multi-threaded? [23:00:45] https://sourceforge.net/p/lzmautils/discussion/708858/thread/d37155d1/#d8af [23:01:02] We've discussed this a few years ago on xmldatadumps-l [23:03:35] 2012 feels just like yesterday. [23:03:57] Platonides: https://en.wikipedia.org/wiki/Xz#Implementation [23:05:11] Also, http://git.tukaani.org/?p=xz.git;a=blob;f=NEWS;hb=HEAD [23:05:19] > Added support for multi-threaded compression. [23:05:22] Under 5.2.0. [23:09:48] I think I remember that discussion [23:10:10] oh, they incorporated pxz into main xz [23:55:16] hay de tdo