[18:17:40] for those of us who are phabricator illiterate, could someone summarize what the status is with not working? [18:30:08] Rhododendrites_, someone has done the equivalent of suggesting a change be made on an article talk page. What's needed now is for that person or someone else to implement that change in the code itself, then have it tested and reviewed [18:51:44] ok, thanks AntiComposite. Hopefully soon. Seems to have been broken for several days now [19:28:05] Hello, what is the status on external parsers? https://www.mediawiki.org/wiki/Alternative_parsers is pretty sad to read :< How can I help? [19:28:13] I am especially interested in wikitext to plain text conversion, https://dizzylogic.com/wiki-parser this looks cool, but no command line AFAIK and not maintained [21:17:51] maxzor: the status is always the same, don't make them [21:18:17] Nemo_bis, :< [21:18:35] I am trying to fiddle with postgres full text search on big tables [21:18:54] How is that related? [21:19:01] importing the dumps, and moreover getting rid of wikitext is unecessarily painful [21:19:13] Sounds like a good reason to not do it. :) [21:19:27] What's your chief goal? [21:20:05] For now playing with a large dataset on postgres. For later maybe allowing simpler pipelining from WMF to postgres [21:20:09] for others [21:20:58] Pipeling from WMF? What does that mean? [21:21:21] from dumps.wikimedia.org [21:21:35] For what purpose? [21:21:52] It sounds like you only need a text corpus, any text corpus, to test some postgres features, is that right? [21:23:01] Because if so, maybe MOSES can help you http://www.statmt.org/moses/?n=Moses.LinksToCorpora [21:23:09] yes [21:23:46] Wikimedia wikis are certainly not the easiest text corpus out there to use. [21:24:00] Unless you need some very specific features which are not found elsewhere, just use something else. [21:24:40] I find it sad that the biggest public corpus is not more easier for people to work with. [21:24:45] Thank you for the link [21:25:30] The fact that wikitext is broken but works is not a reason to not fix it IMO [21:25:53] Unfortunately, wikitext was not designed, if I dare use such a word, to be parsed [21:26:15] There's nothing sad about it. Wikitext is designed for its purpose. [21:26:26] Secondary purposes are secondary. [21:26:48] You cannot have something that is both infinitely flexible and very easy to use. [21:27:11] I nevertheless stand by my adage. https://bash.toolforge.org/quip/AU7VVyPV6snAnmqnK_0N [21:27:49] We do of course like the idea of providing plain text dumps: https://phabricator.wikimedia.org/T161773 [21:28:45] But if you just need a few GB of text, whatever it is, to fill some DB, there are very good alternatives. [21:29:29] If you have more specific needs like parallel corpora or structured lexical data then it's another matter. [21:29:30] Nemo_bis, I totally agree wikipedia may not be the best currently for a corpus, heck a few days ago I crawled some epubs sites and converted them to plain text, and it was easier [21:30:38] maxzor: of course it's easier. epub is HTML so it's pretty much the easiest thing to convert into plain text. [21:30:50] Have you considered starting from the HTML/ZIM dumps? [21:31:25] Maybe even the 2008 dumps suffice. https://dumps.wikimedia.org/other/static_html_dumps/2008-03/ [21:32:28] If you want just English, there's https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_nopic_2019-12.zim [21:33:34] I am still new to the mariadb model but AFAIK infoboxes, which provide valuable relational data, are treated as nosql and molded into text_table.old_text . Well some not-so-pesky things that come to my mind and that hamper the project imo [21:33:45] (The additional advantage of using the ZIM dumps is that if you find a bug you can fix it in Parsoid, without reinventing a parser.) [21:34:25] I'm not sure what you're talking about. Infoboxes have no role whatsoever in the database schema. [21:35:02] If now you're talking about extracting structured data from wikitext, or parts thereof like the infoboxes, it sounds like DBpedia https://wiki.dbpedia.org/ [21:47:54] Nemo_bis, thank you for the links! [21:49:21] There were attempts at formalizing wikitext grammar, unfortunate this never become real https://www.mediawiki.org/wiki/Parsoid/Roadmap/2014_15 [21:50:38] https://www.mediawiki.org/wiki/Wikitext_standard [21:51:16] Pretty much archaelogy, this should be much higher priority IMO [21:51:37] it's not an easy task [21:51:46] 90% of such grammar is easy [21:51:51] then you start finding cornercases [21:51:54] Spending a 70k grant on dbpedia is somewhat a symptom of technical debt [21:52:00] used on hundreds of pages [21:55:13] 70k is a relatively small amount compared to the many millions needed to make sense of wikitext. [21:55:51] the biggest problem with parsing wikitext into another format is that the parser has to understand core wikitext, extensions, and HTML markup. [21:55:53] :) [21:57:02] which is why the actively maintained parsers parse into HTML or a syntax tree. [22:16:17] What is your opinion on markdown? [22:17:40] Ah I guess it is vastly insufficient for wikipedia [22:21:43] I think with the eyes of a db guy currently. I would make it this way : store separatly A the plain text + B the structure + C the data connectivity. Keep wikitext for user editing, and parsoid for html. You probably loose some flexibility but your data is not sick :) [22:22:11] if only you had a time machine :) [22:28:59] I remember having rewritten the english UAV article 5 years ago, and I can still feel the pain of wikitext templates... [22:32:05] Well in 5 years things have changed quite a bit, nowadays you wouldn't need to know templates for most of the editing. [22:32:49] As for offloading some jobs from wikitext, that's basically what Wikidata is about. [22:33:13] TemplateStyles and Lua too if you believe that's a large part of wikitext complications. [23:01:39] What is wikidata underlying database management system? [23:01:47] Same as the rest of wikimedia? [23:12:51] yes but with some additional tables for the wikibase extension