[15:22:44] https://www.wikidata.org/wiki/Q36020#P1472 warns Commons link should exist, but it does exists! [15:25:59] to the devs thx! [19:38:29] Hi guys, did anyone try to parse the 780Go dump json or am I doing something wrong? I can't even count the number of lines [19:40:21] datasc: it's... pretty brutal cruising through it [19:40:28] I don't even think I tried the full dump [19:41:13] Yeah... the problem is that it contains a bunch of useless elements [19:41:58] datasc: what I do is actually use sparql to get a base set of elements and then request other statements as necessary [19:42:14] it's muuuuuch faster than trying to crawl the dump - and yes, I have limits in place to respect the TOS [19:42:28] It could be a LOT faster if I didn't, but that's obviously terrible behavior [19:42:59] (I also cache statements, so I don't request them over and over again) [19:43:10] I didn't try the query service [19:43:22] datasc: what kind of information are you digging for? [19:44:36] Virtually all informations about the world. Demographics, resources, political situations etc.. For example I would first like to query all cities in the world [19:44:49] Is that possible? [19:45:33] Is wikidata the right dataset for my usage, for example I would like all cities that are still active in our current world [19:45:45] ? [19:46:20] Hmm, not sure about "still active" but I imagine you could get a pretty good sample set of cities. I don't know about "all cities," because *someone* would have to create that data [19:47:21] Look up carthage and new york city, see what the statements say about the two [19:48:16] of course, there are lots of cities called "carthage" these days [19:49:58] Polysemy could be a problem but some errors don't bother me [19:50:31] I'll just try to fetch all instances of city [19:51:33] *nod* I'd use the query service - you're going to have much more manageable datasets that way [19:51:49] If memory serves, the TOS is 6 requests at a time, so ... [19:52:20] (I have been using that as a limit for a while now and do not get the "server busy" throttle, so I think that's applicable) [19:52:52] I have a retry set up just in case I get the server busy response but it hasn't fired off for a long time [19:53:26] 6 requests at a time? But is one big request valid? [19:53:43] yes [19:54:20] my process is to request all organizations (businesses) which gives a 240000-line response... I then trawl through those, looking for changes [19:54:21] Ok so no problems, let's try [19:54:47] Oh ok, yeah I don't need to look for changes for my problematic [19:55:23] How much time does it take for your request to complete? [19:55:33] that one? 22-24 sec [19:56:48] Ok, I'll make my request and give it a try, thanks for the help [19:57:03] the full runtime is a lot longer, because I have to look for connected edges too with the TOS limit, but caching helps (there's a lot of common statements!) [20:00:45] devs should probably try to make a better json dump, only with english labels, less informations and more interesting instances. Or one dump per instances [20:01:13] well, the labels aren't THAT much of the data... [20:02:35] each property has an hash, that's not useful for many application [20:12:18] datasc: I am familiar, yeah. I actually filter out the non-english labels myself too. [21:32:44] Well I can't even reach 10k cities in one query [23:54:38] Hi, I'm quite new to SPARQL and am trying to query a csv of cities with headers "city,country,population". However, I was only able to make the country display as "wd:Q12345". How do you make the query display the country's name? https://w.wiki/4vt