[11:16:43] Hi! I'm about to go dig through why pywikibot is choking on a unicode escape sequence in a title on a wiki. For some reason the wiki I am working with is emitting a \ufffd in its API response, which seems bad, but I am ok with ignoring it for my purpose [11:16:59] is there some trick to do this, or do I have to patch the bot framework? [11:17:43] for context, what I'm trying to do is scrape a site, and the pymediawiki package was not really looking sufficiently powerful for my purposes as I'd have to write custom requests for most of the things I want to do [11:22:32] In case it's of help, it's this wiki: https://cptdb.ca/wiki/index.php/Special:AllPages?from=&to=&namespace=14, but the problem page isn't appearing except via the API, I'm guessing as a redirect perhaps? [11:22:50] the bad page is "Category:Agence Métropolitaine de Transport" [11:27:09] (to reproduce, create the family for the site and a Site object, then do `list(site.allcategories())`) [12:13:14] Graypup_: this is a question for #mediawiki . The most likely answer is some kind of misconfiguration on the wiki itself, are you a sysadmin of it? I see it was created in 2007 https://cptdb.ca/wiki/index.php?oldid=1 [12:15:24] Hm, looks like I might be missing it in https://archive.org/details/wikiteam ; let me see if I manage to dump it now [12:16:49] It's not uncommon to have issues with unicode normalisation of titles [12:17:15] I am not the sysadmin. I can try to poke them in a few days, and for now I've hacked pywikibot to ignore the broken articles. But there's some more fun brokenness with the API where the category members endpoint appears to always return empty list, so I might just have to screen scrape this part for the minute :( [12:25:26] Graypup_: what are you trying to do? [12:29:53] Nemo_bis, machine learning project for class (I wish I could conscientiously object but it's a degree requirement), and I need a corpus. Thought it would be fun to train an image recognition model to recognize different bus models. [12:30:45] Graypup_: so you just need the standard XML dump? [12:30:49] hm, perhaps! [12:31:03] https://github.com/WikiTeam/wikiteam can do it for you, --xmlrevisions will bypass titles [12:31:44] If you want to do something serious in academic style, use my dataset https://archive.org/details/wikia_dump_20200214 [12:32:06] thanks! I will give that a shot. [12:32:56] A predecessor of that dataset was used for https://doi.org/10.1111/jcom.12082 ; https://wiki.communitydata.science/Publications can give you some ideas [12:36:41] Graypup_: if you end up using any of this I'd be curious to read what you end up writing, so I'd be happy if you archived it (under cc-by-sa or other free license ideally) at https://zenodo.org/ :) [12:37:51] ah, it's just an assignment, not a paper. I don't think stuffing a bunch of photos of buses into an image recognition model and doing some transfer learning is publication worthy on its own [12:42:47] Ah ok, you only need the images? [12:43:10] The biggest wiki of that kind that I found is HK buses on Wikia, that's a hell of a ride to dump [12:45:33] Hmm not sure it was this one https://archive.org/details/wiki-hkbuswikiacom [12:51:52] Nemo_bis, I need the pages as a first step so I can get good labels for the images, at least for this wiki. It only has the fleet numbers on the images themselves, and I'd need to parse the tables off the pages to get the full details. These tables seem not to use a template, so I may be in for a wild ride parsing them. They seem to be in very consistent format thankfully. [13:03:51] got it [14:18:18] Hi all, I'd like to eventually migrate locator-tool from OAuth 1 / Python / Flask to OAuth 2 / JavaScript. Could someone please approve https://meta.wikimedia.beta.wmflabs.org/wiki/Special:OAuthListConsumers/view/19fa35277a8db0be5742d1b32c5f7083 for initial tests? Thanks in advance! [14:44:55] done simon04