[13:29:23] Is the list of data types here complete: https://www.wikidata.org/wiki/Special:ListDatatypes [13:39:18] voidDotClass: Depends on how you define complete. These are the data types currently available, but more will be added in the future [13:40:43] multichill, can you or anyone please clarify about the times. the docs are ambigious, saying it could be gregorian or julian. what's the difference, and how to tell which one it is? [13:41:06] also, for wikibase-item, is the entityId always the id of the thing minus P / Q? [13:41:50] are times always YYYY-MM-DD or can they also be YYYY-DD-MM ? [13:52:51] voidDotClass: Easiest to explain with an example. Take a look at https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q2483000 and scroll to P569 [13:53:32] Time is always in the same format "time": "+1938-01-14T00:00:00Z", and the calendarmodel tells you if it's Gregorian or Julian [13:54:48] What are you trying to do exactly? [13:55:22] gotcha. thanks multichill . i'm in a hackathon, and i'm trying to load all of wikidata's json dump into elasticsearch , and then i'll attempt to build an NLP question / answer engine around it. [13:55:30] the only issue is i need to standardize the data [13:55:44] aude is also working on that [13:55:47] cool [13:56:05] voidDotClass: And check out this nice easter egg https://www.wikidata.org/wiki/Q2483000?action=cirrusdump [13:56:54] multichill, so 1) can you tell me about entity ids? 2) about amounts, what's the purpose of the upper / lower bounds? [13:57:13] Currently label/description/aliases (multilingual) and links are in elasticsearch. Statements (like strings) are not [13:57:42] yeah, that's what i'm having trouble with too, i've loaded labels etc, the statements need to be standardized [13:57:55] so i'm writing parsers for all the different data types [13:58:24] Just use the Q and P id. What language are you using? [13:58:29] i'll be happy to release the wikidata -> elasticsearch loader as open source when i'm done. java [13:58:52] We already have something in java I believe so you might be able to reuse the parsing for that [13:59:03] multichill, https://github.com/aliakhtar/orak/blob/master/src/main/java/com/github/aliakhtar/orak/scripts/WikiDataToElasticSearch.java [13:59:08] awesome, may i see please? [13:59:09] https://github.com/filbertkm/wikidata-dump-parser [13:59:15] That's aude btw [13:59:40] (googling right now) [14:00:19] voidDotClass: You need https://www.mediawiki.org/wiki/Wikidata_Toolkit [14:01:43] thanks multichill , i'm just about done though, just need to finish handling these data types. i need to standardize them the way i need them for my search. [14:02:04] https://github.com/Wikidata/Wikidata-Toolkit/tree/master/wdtk-datamodel/src/main/java/org/wikidata/wdtk/datamodel/implementation [14:02:14] so for entity ids, where it says numeric id, if i add P or Q, i can locate the actual property, right? [14:02:31] Probably yes [14:03:03] and can i safely ignore the "Property" data type? [14:03:08] i dont get its meaning [14:03:13] https://www.wikidata.org/wiki/Special:ListDatatypes [14:03:19] voidDotClass: To answer your earlier question, see https://github.com/Wikidata/Wikidata-Toolkit/blob/master/wdtk-datamodel/src/main/java/org/wikidata/wdtk/datamodel/implementation/TimeValueImpl.java#L39 [14:03:40] Property datatype is only used for linking properties to other properties. You can just ignore that part [14:04:08] See for example https://www.wikidata.org/wiki/Property:P1647 [14:04:46] tyvm [14:05:32] Can I see an example of the 'Mathematical expression' type anywhere? e.g on the api? [14:05:36] to get an idea of the format [14:08:21] Check out https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q41591 "value": "I = \\frac{V}{R}", [14:08:28] Isn't that latex? [14:09:13] Right, "Literal data field for mathematical expressions, formula, equations and such, expressed in a variant of LaTeX." (from https://www.wikidata.org/wiki/Special:ListDatatypes ) [14:10:58] hmm, how do you display that in the browser? [14:16:36] multichill, so for 'amount', is it always that either upper/lower bound is specified, OR amount is specified [14:16:46] or can both be specified? it wouldnt make much sense if both are [14:21:19] So it's in form of latex, it's get rendered to something and that gets somehow rendered by the math extension [14:39:01] yeah [14:39:12] multichill, and amount? [14:58:24] voidDotClass: Was AFK. Where do you see amount? What type? [15:11:35] multichill, https://www.wikidata.org/wiki/Special:ListDatatypes [15:11:48] 'quantity' not amount i guess, https://www.wikidata.org/wiki/Special:ListProperties/quantity [15:12:15] but still doesn't make sense to have lower, upper, AND amount. Makes sense to have EITHER lower/uppoer, OR amount [15:14:02] Take for example width. That can be 50 cm and lower/upper are for the fault [15:20:36] multichill, fault as in/ [15:20:46] 0.1 / 0.1 ? [15:24:30] voidDotClass: When you do a measurement there is always a fault margin [15:24:55] ah, got it. multichill , is amount decimal / float, or integer / long? [15:25:14] I believe it's all floats [15:54:10] nikki: How did you do the identifiers statistics? [16:45:05] hey Lydia_WMDE :) [17:02:11] sjoerddebruin: I updated https://tools.wmflabs.org/multichill/queries/wikidata/noclaims_nlwiki.txt . Only 6252 left (dropped from 29161) [17:02:16] Are you working on that? [17:02:39] Not much... [17:02:59] Have you checked some previous items yet? [17:04:24] No [17:06:03] tseppelt: Scope of https://www.wikidata.org/wiki/Property:P981 changed. Could you update the German label and description? [17:06:30] (I noticed you removed the *fuzzy*, that's a little translatewiki trick to indicate that things changed) [17:08:36] multichill: okay done. I wasn't sure abotu the *fuzzy*. After I removed it I noticed that you added it in several languages. I was a bit too lazy to revert my edit ;) [17:09:19] hey hoo [17:09:26] multichill: please, I'm interested. :) [17:09:39] I added some good obvious identifier properties hoo [17:09:56] multichill: Ah, cool [17:10:05] do addshor.e's page? [17:11:16] yeah [17:12:19] sjoerddebruin: Oooh, shoot [17:14:38] MariaDB [wikidatawiki_p]> SELECT * FROm page_props WHERE pp_page='2006295' LIMIT 1; Empty set (0.00 sec) [17:14:41] aargh [17:14:52] The page_props table is still not fully populated [17:15:15] hm... maybe it's filtered on tool labs? [17:15:23] Nope, just did a purge [17:15:29] 4 rows in set (0.00 sec) [17:16:09] I'll file a task to track this [17:16:23] Thanks [17:18:11] Do you know what "wb-status" is? [17:21:17] not offhand [17:22:39] If you file a bug about this, I can potentially mass purge/ links update pages [17:22:50] I just need to know about these things [17:23:01] hoo: https://phabricator.wikimedia.org/T127655 :-) [17:23:13] I'll see if I can run a database query to get a list [17:23:27] thanks :) [17:23:56] Lol, I already had https://tools.wmflabs.org/multichill/queries/wikidata/no_pageprops.sql :-) [17:24:44] It's not completely correct, but good enough for the first batch [17:26:58] If this is a huge/ large-ish problem for you, I can surely fix this [17:27:09] running the scripts server side is not an issue [17:27:32] Last time I ran this, the list was 2476796 items so server side purge sounds like a better plan ;-) [17:27:45] I've purged about 500k items this week and killed all parser cache entries on WD :P [17:28:41] Do you graph the page_props table size somewhere hoo? Is that easy to setup? [17:29:01] Are you in the nda ldap group? [17:29:25] yup [17:30:40] https://tendril.wikimedia.org/report/table_status?host=db1058&schema=wikidata&table=&engine=&data=&index= [17:30:54] not a graph, though [17:31:32] Planned to make a nice grafan graph from that at some point soon [17:31:43] https://phabricator.wikimedia.org/T68025 (kind of related) [17:33:03] You could make some sort of check to see if the total number of items and the total number of page_prop rows with wb-sitelinks is about the same (and for the other types too) [17:33:12] Or graph both [17:33:53] hoo: https://tools.wmflabs.org/multichill/queries/wikidata/no_pageprops.txt is the first million [17:39:39] hoo: That will be an interesting table growth from 20M to about 66M :P [17:40:12] multichill: YIKES [17:46:18] Anyway hoo, would be nice if you could fire up a purge job [17:47:46] I'll look at it tomorrow [17:47:59] if it's to much of a size change, I'll have to talk back to jynus, though [18:03:07] hoo: Ok. Keep an eye on the table. I'm slowly purging some items in batches of 50 [18:04:46] LOL and it's read-only [18:05:37] db1049 is crying in pain :P [18:06:25] ... and it's good again [18:06:41] use maxlag = 2 or so maybe [18:06:55] How can the table size be 18.9M on that server? [18:07:10] And 20.9M on the other server? [18:07:20] 54699441 [18:07:24] that's the index estimate [18:07:41] it's probably no more than +- 15% off [18:07:49] ah [18:08:22] the labs view might hide additional data [18:08:42] let me find a server that I can tortured with an actual count [18:08:48] * torture [18:10:00] Using a labsdb for that now [18:10:10] but with a user that has permission on the underlying tables [18:10:24] (and from production, so no, I'm not leaking login information) [18:10:56] 44588140 [18:11:02] ELECT COUNT(*) FROM page_props; [18:13:04] The view is unfiltered [18:15:41] hoo: Is it really hurting now? [18:16:30] looks fine now [18:17:37] It's going from low to high: "Purging 50 items in batch 77 ending at Q610130" [18:18:55] What are you using to purge right now? [18:22:10] Funny consequences of Phabricator email headers https://stats.wikimedia.org/mail-lists/wikidata-bugs.html [18:22:22] Just pywikibot. It's just 10 lines of code hoo ;-) [18:22:45] repo.purgepages(purgelist, forcelinkupdate=1) [18:22:50] lol https://stats.wikimedia.org/mail-lists/Liuxinyu970226.html [18:25:46] Is the value field called 'value' for monolingtualtext, string, url data types? [18:33:28] or is it nested, etc? [18:34:21] "datavalue":{"value":"113","type":"string"},"datatype":"string"} [18:34:26] voidDotClass: ^ [18:34:35] ty [18:34:39] that's the datavalue of statement for a string property [18:34:41] hope that helps [18:34:59] hoo, i'm wondering about external-id. i verified urls too [18:35:16] external id and url can be entered like string [18:36:02] and monolingualtext? [18:37:32] {"snaktype":"value","property":"P1476","datavalue":{"value":{"text":"some italian","language":"it"},"type":"monolingualtext"},"datatype":"monolingualtext"} [18:37:42] that's an example for a monoligual text snak [18:37:58] https://www.wikidata.org/wiki/Special:EntityData/Q1234.json [18:38:14] Using those URLs, you can see the json representation of items and look for things like that [18:40:13] ty [18:44:36] any examples of a time with padded zeroes? [18:45:40] nvm, +00000002001-12-31T00:00:00Z [19:31:13] are numeric ids safe to be treated as ints, or do they need to be longs? [19:31:21] they're not over ~billion right [19:33:12] voidDotClass: not in near future [19:33:22] kool [19:33:30] what do you plan to do? [19:33:38] I'm just parsing the json dump [19:34:37] Can anyone confirm that for "wikibase-item", to get to the actual id of the entity, you just prefix P or Q to the numeric id? [19:49:02] voidDotClass: For properties and items, yes [19:49:17] for upcoming types of entities: Not necessarily [19:49:28] well, i just want to know about the current json dump [19:49:43] so as long as it works for that, great. :hoo [19:58:04] hoo: I think we are going to change that? [19:58:17] iirc I saw a ticket about that [19:58:41] change what? [19:58:55] change the serialization format of entityidvalue [19:59:02] to plain string [19:59:15] sure [19:59:20] but the old one will stick around [19:59:41] anything else would be a very high impact breaking change [20:00:04] I meant https://phabricator.wikimedia.org/T56085 [20:00:13] but not sure it covers output format [21:28:53] hi, dennyvrandecic [21:31:31] Amir1: You around? Did you still spend time on a classification system? I know you're working on scoring, but what happend with classification? ;-) [21:32:00] hey, how are you multichill ? [21:32:07] I'm working on the extension [21:32:23] Which one? [21:32:31] Extension:ORES [21:32:47] see wikidata-l or wikitech for announcement [21:33:14] but as a part time hobby I also work on buidling a GUI for Kian in matter of finding possible errors in Wikidata [21:33:21] I've seen it. That's scoring, right? [21:33:28] i.e. making kian to report in a better shape [21:33:41] multichill: no, [21:33:47] ORES scores [21:33:56] the extension highlight [21:34:02] https://www.mediawiki.org/wiki/Extension:ORES <- this is scoring, right? [21:34:17] no [21:34:29] it only stores scores [21:34:39] and use it for classification [21:34:41] http://mw-revscoring.wmflabs.org/wiki/Special:RecentChanges [21:34:49] (make an account with dummy password) [21:34:57] then enable it as a beta feature [21:35:04] or check the screen shot in mediawiki :D [21:35:14] The "Revision scoring as a service" made me think it was about scoring. Stupid me [21:35:30] multichill: hoo: i am actually worried about increasing the size of the entity per page table considerably [21:35:33] multichill: https://www.mediawiki.org/wiki/File:ORES_extension_screenshot.png [21:35:35] very worried [21:35:38] huh? [21:35:43] Is anyone mass creating? [21:35:49] Lydia_WMDE: So good we're not working on that one ;-) [21:36:01] multichill: sorry the other one [21:36:03] page_props will get bigger [21:36:04] through your purging [21:36:07] yes [21:36:09] ah, that [21:36:09] yes [21:36:25] Large tables aren't a problem per se [21:36:35] it totally depends on what its planned use is [21:36:42] are we sure this is ok and not going to blow up in our face? [21:36:50] but that table is primarily for lookups by page, so I don't expect a huge problem [21:36:55] especially with such a large increase? [21:37:04] We have some stuff trying to query + sort on that table [21:37:13] but I don't think that even works right now [21:37:26] jaime told me that s4 or s5 (the one that wikidata's in it) has some storage issues [21:37:34] but it'll be fixed soon [21:37:35] s5 is ok-ish [21:37:51] we have 300gb free or so on the smallest slave [21:38:00] (you can check that in grafana) [21:38:17] the amazing grafana [21:38:26] And probably > 1.5 TiB on the larger 2.8 TiB slaves [21:39:02] ok so we are cool with this considerably increase? [21:39:47] It will happen over time anyway. I'm doing it now because the old way of retrieving claimless items has gotten so slow that queries get killed off [21:40:07] multichill: I would be happy to do some classification too :D [21:40:08] speaking of revision scoring, right now http://labels.wmflabs.org/gadget/loader.js appears dead :) [21:40:22] SMalyshev: let me check [21:40:28] multichill: i understand it'll happen anyway but if we know that a huge increase is coming and that it will cause issues then we should work on the fix first [21:40:33] that is why i am asking [21:40:40] thank you for telling :) [21:40:51] Amir1: My use case would probably be new articles without any statements [21:41:27] hmm [21:41:44] Take for example https://en.wikipedia.org/wiki/Old_Central_Fire_Station_%28North_Little_Rock,_Arkansas%29 (just created) and assume someone creates a Wikidata for that [21:41:46] can you give me a feed of claimless new articles per wikis (only big wikis) [21:41:59] *new items of [21:42:17] SMalyshev: I told halfak to check [21:42:26] great [21:42:40] I'm trying to fix the issue by myself but I doubt I would have a great chance [21:43:33] looks like whole labels.wmflabs.org is in trouble... http://isup.me/labels.wmflabs.org [21:43:44] Would just Dutch Wikipedia to start with work for you Amir1? I could setup a query on the Wikidata recent changes to retrieve newly created items without staments and a sitelink to the Dutch Wikipedia [21:44:11] that would be cool [21:44:19] I hope it doesn't take really long [21:44:31] It's on the recentchanges table so it should be really fast [21:44:46] queries of wikidata tables usually take painfully long time [21:45:13] but joining with wikibase-related tables [21:45:24] because you need to get claimless ones [21:45:42] You missed the page_props discussion Amir1 ;-) [21:46:08] page_props contains "wb-claims" [21:46:10] if it's page_props, awesome [21:46:43] I can make it even simpeler. New pages, that have at least 1 sitelink, but no claims [21:46:46] Would that work? [21:48:58] if the feed is not too big yeah [21:49:15] but if it's big we need to limit it to nl, en, de, fr [21:50:29] SMalyshev: revscoring instances all have permission error [21:50:49] I see [21:51:01] I told to labs people, they told us it should be fixed by itself soon [21:51:11] if not, we will file a bug [21:52:14] labs..... [22:01:12] Amir1: https://tools.wmflabs.org/multichill/queries/wikidata/new_items_with_sitelink_without_claim.sql and https://tools.wmflabs.org/multichill/queries/wikidata/new_items_with_sitelink_without_claim.txt [22:02:29] hmm [22:02:34] that looks not bad [22:02:57] I keep it and run some experiments on them soon [22:03:01] Query takes less than a second [22:03:07] wow [22:03:09] awesome [22:03:58] Bumped the limit to 10.000 items and still 5 seconds [22:04:20] That should keep you busy, right? ;-) [22:05:05] Amir1: Would be really cool if you process it and somehow offer it to users to approve [22:05:14] Is this a human? Is the sport football? Etc. [22:05:28] yeah, [22:05:41] And just ignore anything that contains a ":" [22:05:42] I can feed it to the game [22:05:55] No need to bother humans with categories/templates/other meta junk [22:05:58] but people don't play it it [22:06:05] *it [22:06:45] it will keep me busy for a while [22:07:50] I also trying to make Kian much more complicated so better accuracy, etc. [22:08:35] You could grab the article introduction and offer it to the user [22:09:21] Api has a function for that [22:15:11] the game had it [22:15:19] I don't know why magnus took it [22:15:41] Took what? [22:16:39] took that part in the game [22:17:25] have you played kian game in magnus system? (distributed games) [22:19:11] I don't think I have [22:20:09] Amir1: Going to bed, the login server I was working on got rebooted [22:20:34] okay [22:20:36] sleep tight [22:20:47] you will have it soon [22:30:38] I'm leaving for today o/ [22:30:55] I hope to not get paged over the night because of the purges :P