[02:36:04] SMalyshev, around? i was looking at blazegraph wmf setup, and importing OSM. Would it be possible for me to create a simple file generator in some magical format that can be easily consumed by existing loadData.sh? I would parse the OSM data into that format, and the rest of the plumbing would be reused [02:36:56] this way i could host the WD data and OSM data on the same server [03:56:12] yurik: loadData consumes gzipped TTL. It's certainly possible to use it on any such file. or just load a file manually into Blazegraph workbench (localhost:9999 on any install) [03:57:49] SMalyshev, thanks, is it the same as wikidata-20170503-truthy-BETA.nt.gz ? [03:58:48] or is that an entirely different beast alltogether? [03:59:32] yurik: no, truthy has way less data, only direct statements [04:00:40] SMalyshev, thx, is there a documentation on the format of the import file? I could fake a similar format for OSM data [04:01:27] yurik: you mean the TTL dump? sure, it's a standard RDF Turtle serialization: https://www.w3.org/TR/turtle/ [04:01:45] SMalyshev, no, the files that are generated by mange [04:01:47] munge [04:02:19] yurik: same thing. munger just converts some things mentioned here: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#WDQS_data_differences the format is the same [04:02:39] ah, thanks :) [04:02:53] if you load non-wikidata date, no need for munger [04:03:05] it's only because wikidata dump is not exactly the same as DB contents [04:05:13] all that different ontology :( [04:06:28] in RDF, we pay for simplicity of representation by having to know the complexities of ontology [04:06:39] no free lunch :) [04:09:02] SMalyshev, what do you think would be the best way to represent OSM data in the similar model? E.g. an entity R123 (relation # 123) with statement Tname%3Aen (tag name:en) and a string value "..." [04:09:29] yurik: don't they already have some representation? [04:10:04] SMalyshev, where? they are stored in a compact PBF file. I will be extracting them and generating the TTL dump [04:10:12] in linked geodata? [04:10:58] the one i found yesterday? It seems they do much more smart work on that data, whereas I need the most rudimentary raw access to all the data [04:11:06] yurik: http://linkedgeodata.org/RDFMapping - isn't that what you need? [04:12:15] not exactly - they add all these smarts during the import, which may or may not work, e.g. if the value is in the wrong format: https://github.com/GeoKnow/LinkedGeoData/blob/master/linkedgeodata-core/src/main/resources/org/aksw/linkedgeodata/sql/Mappings.sql [04:12:27] i don't want to have domain knowledge inside the parser [04:13:00] all i know is that there is a node, a way, and a relation, and each of them could have zero or more tags (key-value string pairs) [04:13:33] so what are the keys? Aren't they those in Mappings.sql? [04:13:46] nope [04:13:50] https://taginfo.openstreetmap.org/keys [04:14:00] there are 65k keys, and they keep growing (sadly) [04:14:14] most of them are outliers [04:15:04] keys are just like values - can be typed in without any restrictions [04:15:58] values I don't care about [04:16:12] it's keys mostly that worry me [04:16:20] heh, they worry me too :) [04:16:42] i would love to organize an effort to significantly reduce their count [04:16:54] and to encourage key reuse [04:16:56] so what is missing in LinkedGeoData converters? [04:17:10] they do too much [04:17:36] i want a much simpler copying of OSM data, but so that wikidata data exists in the same DB [04:17:54] as it requires 100k entity merges [04:18:12] (matches 100k OSM objects against wikidata) [04:18:23] actually closer to 300k [04:22:31] too much like how? I am just looking there and there are some complex things going on, and I'm afraid creating parallel ontology is not the best idea [04:22:41] also, linkedGeoData goes a very long route (it seems) - they export the data into pgsql, and afterwards reimport it [04:22:43] if one already exists [04:23:50] I am lazy, so I always prefer building on existing projects :) [04:24:15] but maybe you can reuse at least mappings... not sure how important they are - are they used by anyone? [04:24:27] i might be misunderstanding the word ontology as used in RDF, it seems the LGD uses ?t lgdm:key ?k and ?t lgdm:value ?v ontology, which might work i guess, but it would mean instead of having one entity with a number of statements, they have tons of links [04:24:28] except for linkedgeodata of course.... not sure what is their status [04:24:39] seems like they haven't done much lately [04:24:55] sure, i am as lazy as you are :) [04:25:05] yurik: let me see how their actual data looks like [04:25:49] but they seemed to have created a much more elaborate system that requires them to maintain "understanding" of the OSM data - e.g. an object with a certain tag combination results in them creating a building or a road entity [04:26:05] whereas i don't want to make/maintain those rules [04:26:20] yurik: that depends on that is the goal of the mapping [04:26:29] we should have probably started with that :) [04:26:44] right - my goal is to simply have a queriable data of everything in OSM [04:27:25] and if the "value" is wikidata value, to make an autolink :) [04:27:33] right, but queryable how? what kind of queries? [04:29:11] "give me a list of all OSM objects, whose 'wikidata' tag value points to a WP disambig page (entity whose instance-of/subclass of WP disambig page) [04:29:44] also take into account that some queries listed here: http://linkedgeodata.org/OSM we cannot do since Blazegraph doesn't have full geosparql implementation (no complex functions, only point search) [04:30:13] yurik: ok, this is simple enough, what about other tags? [04:30:35] they should simply be stored as strings, so that I can output them, or possibly do string matching on them [04:31:53] there are a number of other tags that are also wikidata, e.g. "subject:wikidata" tag might point to the person, whose statue the OSM object represents [04:33:18] yurik: if you don't need to do anything interesting except retrieval with them, you could just do something like osm:data [ osm:key "key" ; osm:value "value" ]. But that won't work well if you have many tags per object and only need few of them each time probably [04:33:40] but that's about all the preprocessing i want to do - to treat any tag that matches "(.*:)?wikidata" as POSSIBLY having a wikidata value [04:33:44] sec [04:34:15] well, that won't work because i could have multiple of wikidata tags in the same object [04:34:26] e.g. "wikidata" and "subject:wikidata" tags [04:34:34] (no duplicate keys) [04:35:11] yurik: for wikidata tags, i'd probably have dedicated predicate [04:35:26] predicate==property? [04:35:35] like osm:wikidata or osm:wikidataSubject [04:35:49] * yurik still learning the lingo [04:35:52] predicate is the thing going between subject and object :) [04:36:03] subject property value :-P [04:36:12] the middle of the triple [04:36:13] not in RDF :) [04:36:17] sigh [04:36:27] sorry I didn't invent the terms :) [04:37:03] sure. In any case - i'm totally fine with osm:wikidata, but can we have those auto-generated based on all the tag values in the database? [04:37:14] can we generate 65k predicates dynamically? [04:37:16] during the import [04:37:49] and we could invent some silly escaping technique so that the ':' symbol is easy to type [04:37:53] because it is used all the time [04:38:14] yurik: sure, there's no limit on the quantity of predicates. But if you're not querying them, using them as such may be complex, since predicate has to be an URL [04:38:18] most tags are either "a" or "a:b" or "a:b:c" [04:38:57] well, i think most of the time people will query by the value, when they know the predicate [04:39:09] and it recommended to be a well-namespaced URL . SO it can'd be just "a" - it has to be "http://osm.org/ontology/osmData#a" or something like that [04:39:12] although obviously there could be cases when it is "any predicate that matches a pattern" [04:39:21] sure thing [04:39:38] yurik: matching predicates to a pattern will be a problem [04:39:40] the "a" could be like "OSM:a" in the query [04:39:52] they are not indexed that way, so it's a full database scan [04:39:53] its not that critical TBH [04:40:13] there are other ways to do that [04:40:57] most of the queries imply that I know the exact name of the predicate [04:43:12] Is it even possible to dynamically generate the list of those predicate URLs so that they are easier to use in the query? [04:43:46] i guess i could simply dump the list of all found keys into a file, and blazegraph would pick it up [04:43:47] ok, I guess you can generate predicates then though it is not in a good taste to have predicates which have no ontology description. if you look at https://www.wikidata.org/wiki/Special:EntityData/Q4242.ttl you'll see we define all predicates. [04:44:14] yurik: technicall it's no problem, it's just strings [04:44:38] escaping may be tricky but I guess percent-encoding should work [04:44:55] oooo, the secret url :) [04:45:21] semantically some people would dislike it but maybe it's ok [04:45:42] yurik: you mean entity data? not secret at all :) [04:45:56] well, i don't want to dump the entire list of all 65k predicates with every request [04:46:03] only the ones used by that entity [04:46:17] yurik: surely, not every request. the question is request for what? [04:46:32] yurik: we don't dump all of them (unless it's a full dump :) only used ones [04:46:41] ah, ok [04:47:48] so how would i go about creating these TTLs then? I guess I would need to come up with an escaping function, and a script that would parse OSM data and generate two files - the list of predicates, and an import file with data (possibly broken into parts) [04:50:03] each entity (subject?) would have a list of predicates (properties) with their values (objects?). Most of the time it will be a string, except in some cases it will be a wikidata Q entity [04:50:31] sounds good so far [04:51:51] can you write a sample of both files, so that I can match that format? [04:52:51] yurik: well, if you're interested in how ttl looks like, then just take ttl of any wikidata entity - like https://www.wikidata.org/wiki/Special:EntityData/Q4242.ttl I pointed out [04:53:13] that format is the same as I would generate? [04:53:27] yes [04:54:12] you'd need to define some prefixes for your specific preficates and subjects, and then just write statements as they go [04:54:42] so I wouldn't need multiple files, each file would simply be prepended with all the prefixes? [04:54:52] i mean each import file [04:55:03] prefixes = list of all predicates in that file [04:56:09] in short - if i have a blank blazegraph installation, can I take the Q4242 ttl dump from above and upload it as is? [04:58:13] yurik: yes, you don't need multiple files [04:58:43] yurik: you can, but if you don't run it through munger it won't link with other data and queries from the manual/examples would not work [04:59:15] because @prefix wikibase: . has beta suffix, and wdqs database does not [04:59:23] eventually this will be gone, but not yet [05:00:16] so i really should look closer at the munger's output to see what i actually need to generate. Is there a way to view the data somehow? With wikidata, I have the actual Q pages. Is it possible to get an entire entity from blazegraph? [05:00:53] as in SQL's select * from Table where ID=XXX [05:01:04] yurik: not sure why it matters for you which data munger generates... are you going to use wikibase ontology? [05:01:30] no, but i will need to understand how TTL is structured, and its better to do it with the data that i understand [05:01:43] yurik: it is possible, select * where { wd:Q4242 ?x ?y}. Not sure why you need it though [05:02:04] and i also need to link to wikidata's entities, using WD ontology. I need it for debugging. [05:02:29] thx, i think its enough to get me started :) [05:02:50] yurik: ttl is just triples. you link to WD entries with wd:Q123, using wd prefix as @prefix wd: . [05:02:50] [05:02:53] lets see if i can get blazegraph running - i was seeing some CSR errors in the browser [05:08:45] SMalyshev, do you get errors too? http://88.99.164.208:9999/blazegraph/#splash [05:09:01] 88.99.164.208 refused to connect. [05:09:17] SMalyshev, try http://88.99.164.208:9999 [05:09:22] I suspect it's localhost only [05:09:38] oh, i think it got cached locally somehow [05:09:39] thx [05:09:41] what error do you get? [05:10:08] i actually get the redirect to http://88.99.164.208:9999/blazegraph/#splash which seems to show ok, but doesn't work [05:10:22] suspect it got locally cached after my previous attempts to run it [05:10:26] sec, tunneling [05:10:48] how do i expose read-access publicly? [05:11:44] yurik: if you just expose GETs those are only reads [05:11:58] check blazegrpah's puppet module it has nginx configs :) [05:12:04] bleh [05:12:13] or rather wdqs puppet module [05:12:17] i thought you might remeber them off hand ): [05:12:30] not really [05:12:43] but you should be able to access 9999 directly, no need for nginx [05:13:06] https://github.com/wikimedia/puppet/blob/production/modules/wdqs/templates/nginx.erb [05:13:15] makes for good demo though [05:13:30] do i need nginx for it? [05:13:40] not really [05:13:43] can blazegraph itself serve some static content? [05:13:57] but if you expose it to the internet then better to have proxy [05:14:13] unless you want everybody to have full access to it [05:14:28] yurik: it can but it's annoyingly complicated. you need to put it inside war... not worth it [05:14:40] putting nginx in front is about 100 times easier [05:14:49] sigh, i didn't want to install nginx ... never dealt with it [05:15:07] you can use any webserver you like that can do proxy [05:15:09] oki, will do it later, first thing is to start the wikidata import [05:15:10] apache is fine [05:15:23] same difference, its been a while since i set up a webserver [05:15:39] worry not, running local import and tunnel to localhost [05:17:03] SMalyshev, btw, i was hoping for the nice query.wikimedia.org, and instead i see... hmm, i see a 404: [05:17:05] No context on this server matched or handled this request. [05:17:05] Contexts known to this server are: [05:17:27] /bigdata ---> o.e.j.w.WebAppContext@52e677af{/bigdata,file:/tmp/jetty-localhost-9999-blazegraph-service-0.2.3-dist.war-_bigdata-any-1807374230115208853.dir/webapp/,AVAILABLE}{file:/home/yuri/service-0.2.3/blazegraph-service-0.2.3-dist.war} [05:17:45] yurik: ahh wrong url [05:17:56] go to /bigdata/ [05:17:58] i went to / [05:18:45] yep, seems to work. I thought raw blazegraph now redirects '/' to '/blazegraph', not '/bigdata' [05:21:13] SMalyshev, so how come i don't see the nice query.wkidata interface there? Is that a separate setup somewhere? [05:21:41] yurik: yes, it's in gui subdir and that requires webserver to run [05:21:51] meh [05:21:58] easiest is to redirect / to gui subdir [05:31:46] SMalyshev, at which point will i start seeing results after querying { ?a ?b ?c . } LIMIT 10 [05:31:57] the importer have been running for the past 10 min [05:32:13] yurik: are you importing the whole db? [05:32:16] yep [05:32:24] where are you running it? [05:32:32] very beefy SSD server [05:32:44] well, probably will take about a day or so [05:33:03] even for the first results? [05:33:15] depends... how you run it? [05:35:14] $ ./loadRestAPI.sh -n wdq -d `pwd`/data/split [05:35:25] just like the doc say :) [05:35:52] what's the space requirement for the whole WD ? [05:36:25] hmmm not sure when that one commits [05:36:28] maybe at the end [05:36:34] whole db... let me see [05:37:00] 175G currently [05:37:51] if you use loadData.sh it will commit after each chunk [05:38:00] may be a bit slower but you get results sooner [05:38:09] might take some time... :) [05:38:10] in any case ,thanks for all the help! Hopefully something fun will come out of it ;) [05:38:21] don't forget to vote in the election :) [05:38:31] yes loading db takes time [05:38:40] already did :) [05:38:45] awesome :) [05:39:29] i think it commits after each file [05:39:37] because its restapi [05:39:49] yeah but it's a weird rest api... not sure [05:51:08] SMalyshev, why would i see wikidump-000000009.ttl.gz.fail.fail as one of the dumps? [05:51:27] all others seemto be ok [05:51:32] yurik: hmm that means loading failed for some reason. something should be in the logs [05:51:45] ahh wait fail.fail... [05:51:52] did you try to stop it in the middle? [05:52:17] usually fail.fail happens when it fails on a file, and then you don't rename it back [05:52:32] so it tries to guess which format is the file that ends in .fail and fails again [05:52:50] if you rerun it, you need to rename .good and .fail files back [05:52:58] I told this api is weird :) [05:53:03] but supposed to be faster [05:54:22] SMalyshev, hehe, we should document that somewhere :) [05:54:35] we are talking about the munger, right? [05:54:40] that's what would rename it? [05:54:57] actually, i thought i deleted all of the files before rerunning it [05:55:21] yurik: no about loadRestAPI.sh [05:55:42] you shouldn't delete them, otherwise you have to re-munge [05:55:48] yeah, i did cancel the restapi one too... [05:56:12] so i should simply remove the .fail.fail part, and reupload just that one file [05:56:15] yup so that's what happens. but do not worry, you can ignore it for now and just load that file manually after it's done [05:56:25] the order is absolutely unimportant [05:56:28] yep, that's what i thought, thx! [05:59:53] SMalyshev, i am looking at the import files, and they have properties without defining them, e.g. wdt:P1546 wd:Q2016568 ; -- is that because the namespace "wdt" is defined already, and the actual P1546 does not need to be? [06:03:35] yurik: it will be [06:03:55] yurik: probably in some other file, but wdt:P1546 will be defined [06:04:12] but doesn't have to be pre-defined before being used. Gotcha [06:04:21] yurik: no, it doesn't have to be [06:04:50] it has nothing to do with use, it's just a good linked data habit to have proper definitions for classes and properties [06:04:58] predicates [06:05:20] most tools don't *need* them, but some OWL tools might in theory [06:05:37] i understand. Do i need proper "wikibase:Dump a schema:Dataset , owl:Ontology ; ..." statement at the top? [06:06:37] (and I absolutely love the complexity of the ranking system :)) [06:06:54] or whatever that thing is -- wds:Q22-4362a929-4ac4-9901-c597-a24de34b0146 [06:09:32] yurik: you don't really *need* it. It's part of the linked data headers which may be helpful [06:09:45] yurik: wds: is the statement [11:25:51] Aleksey_WMDE: I updated https://github.com/DataValues/Geo/pull/103 . That should be the last change needed before tagging the new version. [11:27:25] Thiemo_WMDE: Do you want me to merge it or you'll do it yourself? I've looked at it - seems fine. [14:57:23] Hi! is there any way to use wbgetentities API for specific revid? [15:00:43] Someone that can read this and patrol it? https://www.wikidata.org/w/index.php?title=Q364194&type=revision&diff=480584863&oldid=454235925 [16:36:05] sjoerddebruin: Any idea how to trigger the edit interface at https://petscan.wmflabs.org/?psid=978762 so I can mass add P31 ? [16:36:41] Other sources > Use wiki > Wikidata [16:38:20] Right, I knew it was hiding somewhere in plain sight. Thanks :-) [16:44:17] It's the most useful hidden button [17:47:29] Average statements keeps growing. https://grafana.wikimedia.org/dashboard/db/wikidata-datamodel-statements?refresh=30m&orgId=1&panelId=4&fullscreen&from=now-6M&to=now [17:49:03] PROBLEM - High lag on wdqs2003 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1800.0] [17:49:42] PROBLEM - High lag on wdqs2001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1800.0] [17:50:43] PROBLEM - High lag on wdqs2002 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1800.0] [17:50:43] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1800.0] [17:50:44] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 37.93% of data above the critical threshold [1800.0] [17:51:02] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 34.48% of data above the critical threshold [1800.0] [17:52:34] sjoerddebruin: I thought you were talking about the lag ;-) [17:52:41] :O [17:53:43] I'm just working on the 300± people without genders on nlwiki... [17:56:42] all of the lag [17:57:29] I found a big batch of paintings on the pt Wikipedia [17:57:33] Adding missing info now [18:05:40] revi: Ohhh, you understand Korean :-D Can you help out with https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings/Possible_paintings ? Much appreciated [18:05:52] hmm [18:06:01] sure, if you can wait few hours [18:06:07] (03:06 KST) [18:06:34] No rush, I'm already happy I found someone who understands the language and is active on Wikidata [18:08:22] Updating the (fixed?) Wikidata logo now... https://gerrit.wikimedia.org/r/#/c/350697/ [18:09:36] well.... if you looked up the list of admins by language (I thought there is?) you would have found me [18:09:40] anyway kk [18:41:03] RECOVERY - High lag on wdqs2003 is OK: OK: Less than 30.00% above the threshold [600.0] [18:42:03] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] [18:42:43] RECOVERY - High lag on wdqs2002 is OK: OK: Less than 30.00% above the threshold [600.0] [18:42:43] RECOVERY - High lag on wdqs2001 is OK: OK: Less than 30.00% above the threshold [600.0] [18:42:44] RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0] [18:43:43] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [19:23:23] * yurik is confused why we have p: wdt: and ps: prefixes :( https://phabricator.wikimedia.org/T164782 [19:23:28] cc: SMalyshev ^ [19:23:40] apparently others ran into the same questions [19:24:41] no rush of course :) [19:24:55] yurik: these are different relationships... p: is between item and statement, ps: iis between statement and value, wdt: is "best" of the claims. See https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#/media/File:Rdf_mapping.svg [19:24:57] * yurik is much more interested in all the fun bots SMalyshev was working on ;) [19:25:16] also https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Prefixes_used [20:46:54] https://www.wikidata.org/w/index.php?title=Special:Contributions/Pintoch&offset=&limit=500&target=Pintoch <- interesting [20:54:34] "Determing the correct protection policy for a Wikidata project" ;) :) [20:56:52] multichill: I wonder if BlazeGraph has a trie search function? ;) [21:05:25] So next time I register some IP space with RIPE I find it in Wikidata? [22:57:51] SMalyshev, https://wiki.openstreetmap.org/wiki/User:Yurik/rdf [22:57:59] i am thinking of this RDF to import [22:58:13] it seems that ":" is an ok symbol if its not first [22:58:35] (i mean - the very first one is used as a namespace-value separator) [23:44:21] yurik: ok, one thing I'd advise it to put language tags on strings for which language is known [23:47:07] yurik: also, for osmt:wikipedia I'd rather have actual wikipedia link. Depends on what it's used for, but if it points to the article, why not have article URL? [23:47:52] yurik: also, if you know some values are numbers, may make sense to store it as numbers. [23:48:01] e.g. osmt:population [23:50:05] and osmt:capital might be boolean? https://www.w3.org/TR/turtle/#booleans