[01:55:48] hi, is there a way to request two entries be merged? [03:06:06] hi, my sparql query returns a list of locations. Is there a way to find the CLOSEST WD entry to each one of them? [03:06:35] cc: SMalyshev :) [03:07:15] (without running a query once for every location) [07:34:30] yurik: I'm afraid there's not a good way without running queries once per location... [07:34:51] bummer [07:35:11] I found that service works with multiple locations, but i couldn't get it to limit to 1 per it [07:37:44] SMalyshev, what do you think about the idea about geo lookup based on the .map pages? [07:37:53] there is a volunteer who is looking into it right now [07:38:14] yurik: what exactly is the idea? [07:39:17] SMalyshev, https://phabricator.wikimedia.org/T179991 [07:39:28] i wonder if a function is better though [07:41:26] yurik: if somebody would implement this as patch to Blazegraph, I'm fine with adding it [07:41:42] downloading part is a bit concerning though [07:42:11] but since we do service callouts, I guess we can do the same for maps... [07:43:49] SMalyshev, do you think it would be better to do it as a service or as a function? [07:44:06] yurik: where is World_Countries_Outline.map on commons? [07:44:51] yurik: probably a service, function is something that converts value to value, generally (or multiple values to value) [07:45:01] this does not look functional [07:45:30] SMalyshev, https://commons.wikimedia.org/wiki/Data:Naturalearthdata.com/admin-0-countries-no-antarctica.map [07:46:14] why not functional - you give it geo point, and you get back a property from .map [07:46:21] its a lookup function [07:46:30] yurik: so I see there are features there in the map, which have "properties". Are those standardized? [07:46:37] yep [07:47:05] geojson standard specifies that each feature can have a properties object, with arbitrary content [07:47:32] WDQS could have a restriction of "simple values only", and ignore strange cases like subobjects and arrays [07:47:58] yurik: so the answer is "no" :) on the account of "arbitrary content" [07:48:08] arbitrary key-values :) [07:48:22] so yes - they are standard as a key-value lookup, but the content is object dependent [07:48:52] e.g. i can have iso country code attached to the countries objects, and you can have state codes attached to your state map [07:49:21] ok, I guess if features thing is part of geojson we can parse it [07:49:31] yep, exactly :) [07:49:54] and the service/lookup function can specify which key the user actually needs [07:50:36] but the lookup part would require digging in the guts of blazegraph... which I am not very familiar with [07:51:19] why? if we implement this as a custom function, we can simply use JTS geo library for the shape lookups [07:51:26] yurik: no, I don't think the function idea is going to fly.. functions are usually small and I don't think this should be a function. Service maybe [07:52:40] yurik: ah, you don't need to retrieve data only check if particular location is in or out? that may be done independently I guess [07:52:43] i tried to read up on the difference - what is the fundamental diff between the two? service seems to be too limiting for this, as the goal is to process a huge number of points very quickly (btw, the function can be made extremelly fast in this case, because the map will be pre-parsed) [07:53:14] yurik: what you mean by "too limited"? [07:53:26] service is not limited by anything as far as I can tell [07:53:31] e.g. you cannot do one per subquery as above [07:53:53] yurik: I'm afraid I do not understand what you just said [07:54:34] sec, first lets figure out the lookup part -- i think its slightly more complicated than in/out -- given a point, you need to look at all of the geoshapes, and take the right property [07:54:50] so my steps would be to load the .map, parse it into an RTREE [07:55:10] and for each point, do a quick lookup in the rtree (should be very fast) [07:55:23] for each point that what? [07:55:34] that the rest of the query specifies [07:56:24] ?wd wdt:P625 ?loc . BIND(geof:maplookup('mapfile', ?loc, 'myproperty') as ?myprop) [07:56:46] I'm not sure how this would work though... if you have 10 invocation if the function in the same time, which one loads the map? which one builds the tree? [07:57:26] is there a way to hook into query parsing? [07:57:42] and preload the mapfile->rtree during that time? [07:57:48] functions that have side effects are not good... and I'm not very fond about functions doing long-running http requests either [07:57:56] that seems something that services do [07:58:34] yurik: query parsing won't help you. query parsing won't have data [07:59:13] i'm worried that services are not as flexible -- they don't follow the same patterns as the rest of the sparql query [07:59:23] and doing http on query parsing doesn't look like so hot idea to me either. esp. as that runs in single thread (per request) as far as I can see [07:59:41] yurik: I still have no idea what you mean by "flexible" [07:59:57] yurik: service literally IS the same pattern [07:59:58] e.g. i couldn't get the service to give me one value per radius -- my initial question [08:00:10] the body of the service is graph pattern [08:00:29] so its more of a subquery, which means it has to run first? [08:00:31] yurik: one value per radius? What you mean? [08:00:35] yep [08:00:46] given a list of locations, give me the nearest value to each [08:01:07] nearest value of what? [08:01:27] btw, i tried to do it as a function, and function blew up https://phabricator.wikimedia.org/T180314 [08:01:46] nearest wd entity [08:01:59] ?wd wdt:P625 ?loc <--- this is not a good query [08:02:13] ? [08:02:13] you running iterator on literally every coordinate in the database [08:02:17] yes, i know [08:02:30] not good [08:02:34] but it blew up for a different reason [08:02:41] I know, still not good [08:02:43] and in my case i wasn't doing it on all values [08:02:56] i pasted it there as a simplified bug report [08:03:34] in this case, it is probably bad data [08:03:44] -290 does not look like valid coordinate value [08:04:03] yeah, but i think its two issues - the function shouldn't blow up either [08:04:17] i could have used it on a bad value myself, and the query shouldn't fail [08:04:29] that can be fixed, it probably should return undefined there... [08:04:48] maybe nan is better? [08:05:07] nan has very specific meaning... I don't think this case matches [08:05:34] undefined is also a special meaning :) nan is often used for errors me think [08:05:51] undefined means the value wasn't found. [08:05:58] for errors in floating point calculations. this issue has nothing to do with floating point [08:06:12] technically -290 is a valid data ;) [08:06:21] unless its missing the latitude [08:06:30] not as a coordinate [08:06:44] coordinates are -180..180 [08:07:07] no, that's a convention. Mathematically its unbounded, and should be reduced to -180..180 for calculations [08:07:31] look at leaflet for example - i could have any longitude [08:07:35] and it will still work [08:07:51] in wdqs, coordinates can be only -180..180 [08:08:15] and I do not plan to support anything else, not unless very strong use case shows up [08:08:16] i think that's a mistake - earth is round. If i add a few degrees, it should work regardless [08:09:05] i haven't seen a library yet that requires longitude bounds - check with pnorman - he is the GIS expert [08:09:50] I do not see any use in such coordinates so I am not going to support it, unless such use shows up [08:10:30] I have a lot of work to do without trying to support broken data, so... [08:10:58] oki, but please check with Paul, it is a fairly common thing [08:11:29] back to the service thing - so you think service has better facilities for initialization? [08:11:43] because other than that, its a very quick algo [08:13:24] yurik: for service, yes, you can definitely have initialization phase. That's kinda whole point, that's how federated call works - you make request, send it, parse results [08:13:47] service returns a set of bindings (i.e. iterator essentially) while function returns one value [08:13:55] that's the difference [08:14:38] can service start after the rest of the matching is done? I don't want the service to run through all the points in the DB, instead it should lookup just the needed values [08:15:18] given 100m points, i want to filter them by some complex query first, and later the small subset should be passed to the service for lookup [08:15:42] (i meant 100m subjects that have points and other data) [08:18:37] yurik: well, yes, you can tell it that or optimizer can do that automatically, but the last part is somewhat buggy [08:18:52] yurik: if you use hints, you can tell when to run service (first or last) [08:19:32] you can also tell which variables are needed for service which should cause optimizer to put service in proper place automatically [08:19:37] but that part is somewhat buggy [08:21:37] SMalyshev, could you write this up in the phab ticket - so that the other volunteer would see your suggestions? [08:22:45] yurik: which part? generally if you look up how other services are done it's there but writing full guide on implementing BG service probably would take some time :) [08:25:40] SMalyshev, just mention that you recommend this as a service, possibly mention the optimizer - https://phabricator.wikimedia.org/T179991 [08:26:00] this way we will have a linkable discussion there [08:26:36] maybe give a link to another similar service? (in terms of bindings to BD, not the actual lookup) [08:28:13] yurik: I am still not though - is ?location input or output parameter? [08:28:38] input [08:28:45] I feel like I still do not understand what this service will be doing [08:28:56] ok if it's input what is the output? [08:29:35] in the example it would have two outputs - ?wikidataId and ?isoCode, but we probaly should have a better way to specify it [08:29:46] also, I am not sure if multi-value lookup is really needed [08:30:23] e.g. even though the feature's properties could have many key-values, i would think only one would be needed. But i could be wrong [08:30:32] the problem with services is that it is called on set of previous bindings, which you do not control. I.e. you have no control over how many times the service is called [08:30:48] or in which order [08:30:55] that's fine - each call is independent [08:31:11] if you have previous bindings. If service is called first, then it'd be called once of course because there's no point to do more [08:31:30] but if you have 100k incoming bindings, it can be called any number of times [08:31:38] with any subset of those bindings [08:31:41] the goal is basically to have a lookup function - given a location and a .map file, extract some property for the object that first matches [08:31:59] it could be done multiple times, etc [08:32:10] the problem is ou don't have map file. yu have a string [08:32:47] right, internally it should be a synchronized caching sytsem, with a promise [08:33:12] once the promise fulfills, (i'm not sure what java uses for promises - tasks in C#), the service is ready to work [08:33:45] java doesn't use anything [08:34:08] specific applications use different things, I have no idea what Blazegraph uses [08:34:41] also I'm not sure in function you will get any access to it... from what I see, function gets set of values and that's it [08:34:50] e.g. first call - tries to get the .map rtree from cache, if not there waits for the cache to populate with it. The other threads should see that rtree is being built for that .map, and wait for it too [08:35:06] service gets much bigger context so it probably can get http infrastructure etc. [08:35:11] right, so services it is [08:35:21] yurik: what you mean by "first call"? [08:35:38] the very first time service is called with the yet unseen .map value [08:36:05] all service instances should share an rtree cache, with .map as the key [08:36:14] so all other queries in all other threads would be waiting until this one gets the map? [08:36:39] all other queries that use the same service and need the same .map rtree - yes [08:36:41] that looks kinda dangerous... [08:37:08] why? its the same pattern as any other external data download/federation/... [08:37:25] basically you don't want multiple threads to go fetch the same resource at the same time [08:37:32] federation is handled by Blazegraph, I don't have to manually syncronize it [08:38:05] and don't have to clean up if something goes wrong (e.g. somebody cancels in the middle of the request) [08:38:11] sure, this service would have to handle this. I would guess Java has some facilities for multithreaded cache, wouldn't it? [08:38:14] it's all handled by Blazegraph [08:38:29] even if you build your own service? [08:38:44] can't i use my own cache for rtrees? [08:38:47] yeah, probably, but I am not feeling very comfortable having parallel system inside blazegraph... [08:39:38] i don't think it will be a big system - this is simply a cache object, that would relie on the standard Java libs [08:40:00] yurik: you can. but this cache has to be bulletproof against any multithreaded scenario blazegraph throws at you, including cancelling any operation at any time, killing any thread at any time, etc. etc. and still allow timeouts, memory limits, etc. work as before [08:40:48] if that happened within one request, i'd be mostly ok with it, but coordinating between requests kinda makes me worry [08:41:05] understood. Can blazegraph abort-kill any thread at any time? Or does it do something more magical? [08:41:09] especially when we consider what if data lookup returns error, what if it takes too long, etc. etc. [08:41:33] right, i understand - has to be very simple and verifiable [08:41:45] yurik: Well, probably not kill (that's considered impolite in java world) but definitely interrupt. There are timeouts [08:42:12] SMalyshev, this system will have an extensive trial run on sophox first -- https://wiki.openstreetmap.org/wiki/Sophox [08:42:39] so we will have a good way to evaluate it in stressful env [08:43:02] btw, i figured out the issue i had - it was simply too large of insert/update queries at once [08:43:06] standard http client would probably deal with it ok since federation uses it, but your system then will have to deal with the fact that http lookup can be interrupted, and somehow also deal with what to do with threads waiting on that lookup [08:43:18] (which waits also should be interruptable of course) [08:45:04] fun challenge :) [08:45:58] i think it is still very valuable to pursue. I am mostly worried about the optimization part - I don't want the service to do a linear scan instead of using the result of another query [08:46:16] SMalyshev, do you think service params should be different somehow? [08:46:36] i'm not very happy with them at the moment, but i can't think of a better pattern [08:48:12] yurik: well, for starters ?place wdt:P625 ?location . probably should not be there [08:48:30] it should be bd:serviceParam wikibase:location ?location or something like that [08:48:47] ?place shouldn't even be there, service has no business with it [08:48:50] i was modeling it on some other service i saw. Ok, will correct [08:49:15] yurik: probably on geosearch - which is somewhat misleading because it rewrites its pattern internally [08:49:37] but for your case you probably don't want that [08:50:56] yurik: one thing goes for you in services though - service has create and call stage. Service call is created only once. So there could be a moment you load the map. [08:51:17] then it's called on all incoming bindings, and that's where you do the lookups [08:53:18] SMalyshev, I updated the ticket. Please comment on it, i might not remember these things, and it would be good for the other volunteer to see it all [11:15:06] leszek_wmde: Help with https://gerrit.wikimedia.org/r/390414 is appreciated. I'm really lost with that. [11:15:19] I also wanted to poke amir about this one, but he is not in the IRC? [11:15:53] Thiemo_WMDE: oh yeah, I said in the daily I will have a look at thi [11:15:54] s [11:16:05] Superb. :-) [11:16:24] I'm afraid the patch that is already merged introduced an issue we are not aware of. [15:24:59] SMalyshev: I'm having the feeling that removing a description doesn't get reflected in search anymore. [17:26:56] Pff, abusefilter and API are a bad combination. Had to manually create https://www.wikidata.org/wiki/Q43087708 to see that an abuse filter rule was hit... [17:28:29] because of the label starting with “Image:”? [17:30:31] Yup, the api use throws some random save error [17:31:14] Margritte always trying to disrupt and now disrupting Wikidata ;-) https://collectie.rijksmuseumtwenthe.nl/zoeken-in-de-collectie/detail/id/7a694d6d-d284-5fb5-a976-1656a6545e00 [17:35:11] You'll probably screw up some maintanence queries is you start your work with "Category:" [18:09:46] sjoerddebruin: Haven't encountered that one yet I think. [18:10:01] Wow, we already have 91589 paintings with height/width [18:12:24] We have more people with heights than paintings though. [18:12:43] Oh, and at least 3,5k trees. [18:22:20] I guess in general we have more people than paintings :-) For paintings it's now around 40% [18:23:29] So it shows up in the entity suggestion thing now :-) [18:25:14] :) [18:28:30] sjoerddebruin: do you mean prefix search? [18:30:20] SMalyshev: I don't know all the exact terms, but it seems to fetch the last existing description if the description was emptied later. https://phabricator.wikimedia.org/T180382 [18:30:42] sjoerddebruin: fetch when you do what? [18:31:16] The descriptions used by wbsearchentities, seems to be a different index. [18:31:19] ah, I see, wbsearchentities. Ok, I'll check it [18:43:50] multichill: are you involved with painting and artworks in WD? [18:46:50] Uhm https://gerrit.wikimedia.org/r/390301?! [18:47:33] reosarevok: Yes, I am. Working on https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings [18:48:28] :) [18:48:59] See https://www.wikidata.org/wiki/Talk:Q21257263 - what's the current workflow for an etching (rather than a painting) of which a bunch of copies exist, in different museums? [18:49:18] Would that be one WD item? One per copy? Depends on the amount of copies? [18:51:06] reosarevok: That's why I picked paintings. Paintings don't really have this problem ;-) I would say pragmatic disambiguation/splitting up. [18:51:22] So start with one item and split it up in the concept and the different physical copies when needed [18:51:35] Well, we want to collaborate with the local art museum, which mostly involves paintings :) [18:51:42] When needed is usually when you notice people using "applies to part" [18:51:46] But sometimes we end up with stuff like this which is trickier [18:52:26] Prints and photographs both have this issue. Statues too by the way, with multiple casts [18:52:28] So for example, if we want to add the copy in the collection of this specific local museum, would that make sense? And then should we have an item for the concept and the rest as what, (FRBR) manifestation of? [18:53:15] See for example https://www.wikidata.org/wiki/Q724377 [18:54:57] Ok :) [18:55:20] I'll probably pester you more when we actually start working with the museum, but this is good enough for now, thanks! [18:55:28] How many works are you planning to import? [18:55:46] Eventually, a lot [18:55:51] Please do read https://www.wikidata.org/wiki/Wikidata:Flemish_art_collections,_Wikidata_and_Linked_Open_Data [18:56:08] We decided to leave out some of the types to not flood Wikidata [18:56:49] Ooh, this does seem promising [18:56:53] I'll send it to the involved people [18:57:01] Wikidata is no data dumping ground. If you add 500.000 prints, you'll probably overload the community [18:57:05] And we'll definitely get in touch before mass-importing anything [18:57:19] AHA! [18:57:27] Eventually we might be able to handle that kind of numbers, but not now I think [18:57:49] I don't think it's *that* huge :) But yeah, this is mostly a first plan, so I'll bring this up with the people involved and talk about it! [18:58:21] (I am not that familiar with the collection yet, personally, so I need to learn more myself too) [18:58:54] SothoTalKer: oh no, here too? Can I never be free of you? :p [18:59:17] Maybe you can update https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings/Location/Estonia ? [18:59:45] See https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings/Location/Netherlands for an example [18:59:45] reosarevok: well i'm sorry. i love data :D [19:00:35] multichill: Tartu Art Museum is the one in question :) [19:00:44] https://www.wikidata.org/wiki/Special:Contributions/BotMultichill is what I'm indexing right now [19:00:45] Will add info there when I have more of it [19:02:14] Good to hear :-) [19:49:19] What are the terms of wikidata? [19:49:40] Say I wanted to create an RDF type for Podcast, then create links to them, with episodes and links to those [19:49:54] It sort of would form a dataset itself but also include some data like a description of them [19:50:02] Would this not be a wikidata thing though? [20:03:53] https://www.wikidata.org/w/index.php?title=Wikidata%3AArticle_placeholder&action=historysubmit&type=revision&diff=593494866&oldid=593493237 [20:04:57] sec^nd: https://www.wikidata.org/wiki/Wikidata:Data_access#Basic_important_things_to_know ? [22:16:38] SMalyshev: We ran into missing prefixes at https://www.wikidata.org/wiki/Topic:U1uymmnxodqm9x0m . What are the default prefixes these days? [22:29:26] multichill: hmm not sure I have the full list handy... what is missing? [22:33:02] I see you found the topic SMalyshev . Thanks for responding there [22:33:40] The query is at https://www.wikidata.org/w/index.php?title=Wikidata:WikiProject_sum_of_all_paintings/Collection/Rijksmuseum_Twenthe&action=edit [22:34:17] I do not see any prefix errors there [22:34:29] It works as a manual query to the query interface so I'm a bit puzzled too [22:35:08] Does the query interface insert some missing prefixes that are not there if you are hitting the sparql endpoint directly with a bot? [22:35:24] multichill: query endpoint does not insert anything [22:35:46] Than we just have to wait for Magnus to explain :-)