[02:08:14] Nudin, it wouldn't be on my own computer of course :p. Basically, I've been using wikidata to test something for work, turns out it works very well, so now we want to deploy it to prod, but of course we don't want to hit the wikidata sparql endpoint with that... [02:18:20] Would I need to be able to fit the whole wikidata dump in RAM to use these ? [02:27:14] hello i have a question [13:00:28] Nazral: no, you don't need to fit the whole dataset in memory. For example, the servers behind query.wikidata.org currently use ~600GB of disk space fort he data and have 128GB of memory. [13:06:46] nikki: I'm playing around a bit with the RDF URI thing now that we're finally in lod-cloud. [13:06:52] Wait, the link https://lod-cloud.net/dataset/wikidata is broken now [13:07:19] So someone did an update and Wikidata is gone again [13:08:07] oh no! :'-( [13:22:55] gehel: nice, and how many requests per day does it process if that's not indiscreet? [13:23:54] Nazral: around 50 req/s, but there is soo much variance in the cost per request that this really does not mean much. [13:24:10] have a look at https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m&orgId=1 if you want some metrics [13:24:15] multichill: is that what the duplicate formatter urls were for or was that just a mistake? [13:24:42] Duplicate formatter url's? Do you have an example? [13:25:29] gehel: thank you very much [13:25:57] nikki: Oh wait, I think my search was set to case sensitive so I might have missed some properties that already had it [13:26:18] gehel: the different wdqs are different machines correct? [13:27:04] Nazral: no problem! Ping me if you need help. We have some documentation about setting up your own wdqs endpoint: https://www.mediawiki.org/wiki/Wikidata_Query_Service/Implementation, https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual and https://www.mediawiki.org/wiki/Wikidata_Query_Service/Implementation/Standalone are probably good starting points [13:27:25] Nazral: "different wdqs", what do you mean? [13:27:50] Nazral: What do you actually want to achieve? All seems to be very meta [13:27:59] Oh, in the graphs? yes, they are the different servers [13:28:38] multichill: just reproduce wikidata's sparql endpoint on our side so we don't hit the servers of wikimedia [13:28:46] also note that we have different clusters (drop down on the top left), the public endpoint is wdqs, but we have other clusters, with less traffic [13:28:49] multichill: on https://www.wikidata.org/wiki/Property:P1014 for example [13:29:38] Yeah, my bad [13:31:24] ok :) [13:31:45] gehel: ok so I do have a couple more questions :p. If I remove many languages from the db (by using the -l option with the munge script), will that improve performances when replying to queries ? [13:31:47] I was a bit confused by it, had to squint at my screen to check they weren't slightly different :P [13:32:11] Something different, if you look at the source of pages like https://www.museodelprado.es/en/the-collection/art-work/death-of-the-virgin/6ebfe544-41dd-44ac-a217-d7ba24fc0d48 it seems to contain semantic data [13:32:33] What format is that? Things like [13:32:34] Nazral: it will definitely use less storage space, the impact on query cost / performance is probably trivial [13:32:54] atm some of my queries take some minutes to complete and I would like to get it down to say 20s, I imagine more CPU is more important than more RAM to get there right? [13:33:20] (the hardest ones are the queries that try to look "all people with some given property" [13:33:24] ) [13:33:41] or to rephrase: What python library can I feed this to, to do something useful with it? :-) [13:34:23] Nazral: Re-ordering the query might sometimes make it faster. So first the property and than the instance of human. [13:34:24] Actually, to have them not timeout, I restrict the range of the age of the people [13:34:45] but ideally I don't want anything like this [13:34:52] Basically the most restrictive triple first [13:34:57] Nazral: I suspect that IO is probably the bottleneck, so faster (or more) disks might help. [13:35:02] Ok [13:35:11] isn't that rdf-a? [13:35:14] multichill: I can try it [13:35:20] thanks [13:35:28] Nazral: and there are probably ways to improve the query itself, but that's outside of my area of expertise [13:35:45] the source says "XHTML+RDFa 1.0" so I guess so [13:36:16] Nazral: I'm only guessing, CPU / RAM might help as well. The only way to know is to test... [13:37:00] https://pastebin.com/YLv5tZbA obviously the %s replace entity code/language [13:37:11] gehel: right :p [13:37:22] Thanks for all the help! [13:39:20] Nazral: you can have a look at the resource utilization of one of our servers on https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=wdqs1004&var-network=eno1 [13:40:43] Nice [13:40:46] how many cores? [13:41:29] 16 hyperthreaded cores [13:42:05] I didn't imagine you received so many requests ! [13:43:23] Nazral: it all depends on what you compare it to :) At the wikipedia level, this is a very small number of requests. But a fairly complex service to keep stable [13:43:30] multichill: https://www.w3.org/2012/pyRdfa/ is able to extract stuff if I paste the source code (doesn't work when I enter the url for some reason), seems it's based on a python package called "pyrdfa" [13:44:57] gehel: I just assumed that not so many people would use this kind of service [13:45:23] You worry me a bit saying that it's difficult to kee stable :p [13:45:28] keep* [13:45:28] Nazral: there are a number of bots, which generate most of the traffic. [13:46:01] Eh... [13:46:31] The nature of WDQS is very similar to exposing a raw SQL endpoint to the internet. We allow anyone to run any kind of query, including complex and expensive queries. [13:46:56] Running an internal endpoint, where you can predict the kind of traffic your are going to receive is much easier. [13:47:20] Good point [13:50:37] nikki: Thanks for being my rubber duck. Something goes wrong in the https with the Prado website so if I just download a page, I can parse it with standard rdflib library [13:51:45] :) [13:57:17] nikki: What kind of URI would you add to https://www.wikidata.org/wiki/Property:P5321 ? [13:58:02] For the formatter we now have https://www.museodelprado.es/coleccion/artista/whateverwelinktoaddfortheslug/39568a17-81b5-4d6f-84fa-12db60780812