[00:11:11] zhuyifei1999_: the trailing slash is needed [00:13:51] I see. dunno why iOS irccloud client ignores the slash [01:59:20] gehel: Thanks for fixing that! I'll come back to this on Monday, and maybe recreate these instances to not be stretch if those other errors turn out to be hard to fix [10:29:57] hello. does anyone know if it is possible to optimize the performance of mysql tools db tables created by users? [10:34:45] i mean the "optimize" query [11:44:36] mmecor: you mean the “OPTIMISE TABLE” query? [11:45:07] I not sure ^, do you not have rights to run it? [11:45:16] yup. because i'm seeing that they are not as fast as other tables (replicas, for instance). for a simple query with just one inner join, it takes ages [11:45:31] did you add indexes? [11:45:50] zhuyifei1999_: i do! i realized later i can do the optimize, now it's been one hour after i submitted the query 'optimize table' [11:46:28] yes, i did introduce them [11:46:37] hmm [11:46:43] here's the query: [11:47:02] if optimize table is broken for some reason I guess you can file a bug [11:47:04] query = 'SELECT * FROM s53619__wcdo.wd_geolocated_property INNER JOIN s53619__wcdo.wd_sitelinks ON wd_sitelinks.qitem=wd_geolocated_property.qitem WHERE wd_sitelinks.langcode="'+languagecode+'wiki";' [11:47:22] what’s the EXPLAIN? [11:47:34] zhuyifei1999_: i have no idea if it is broken, i didn't think it was needed. i was just looking for something to help it :) [11:47:50] explain? haven't used it [11:47:53] I might spot some bottlenecks in the EXPLAIN [11:48:03] let me check this explain thing [11:48:20] for your user table, you should be able to just EXPLAIN SELECT ... [11:48:41] same query with an explain before? [11:48:42] tables on replicas are more complicated [11:48:46] yeah [11:49:25] hey, mariadb reacted [11:49:29] MariaDB [s53619__wcdo]> optimize table wd_sitelinks; [11:49:29] Stage: 1 of 1 'altering table' 99.8% of stage done [11:49:34] it should produce a ‘table’ with lots of technical information on how the query is executed [11:49:48] should i wait until the optimize finishes? [11:50:21] i did send the optimize query an hour ago for one of the two tables in the inner join [11:50:25] up to you [11:50:31] perhaps i should do it to the other too [11:50:54] zhuyifei1999_: but i mean, it may be counterproductive to query meanwhile the table is self-sorting or i don't know what it does with the optimize [11:51:27] idk whether running queries while the optimization is on going has bad consequences [11:52:04] let's give it some min [11:52:09] k [11:52:11] Stage: 1 of 1 'altering table' 100% of stage done [11:55:11] +---------------------------+----------+----------+-------------------------------------------------------------------+ [11:55:11] | Table | Op | Msg_type | Msg_text | [11:55:11] +---------------------------+----------+----------+-------------------------------------------------------------------+ [11:55:13] | s53619__wcdo.wd_sitelinks | optimize | note | Table does not support optimize, doing recreate + analyze instead | [11:55:15] | s53619__wcdo.wd_sitelinks | optimize | status | OK | [11:55:17] +---------------------------+----------+----------+-------------------------------------------------------------------+ [11:55:19] 2 rows in set (1 hour 25 min 52.69 sec) [11:55:27] not very successful [11:55:32] it seems [11:55:59] now i'm trying the 'explain' [11:56:05] k [11:56:39] 1 hour 25 min wow [11:59:28] https://pastebin.com/bVNYBvxu [12:00:22] it scans a million rows in wd_geolocated_property [12:00:48] yes [12:00:57] wd_sitelinks has 58313009 rows [12:01:23] and wd_geolocated_property has [12:01:29] (waiting for the count query) [12:04:18] probably a million [12:04:22] yes [12:04:24] around that [12:04:30] (it is still doing the count) [12:04:44] their link is the qitem, and i want to filter by the sitelinks langcode [12:05:10] i first had this db worked in local with a sqlite3 file and it flied [12:05:24] there was no delay in response time [12:05:38] but with mysql it takes ages [12:05:58] so what you are doing is like all geolocated items for a given language code? [12:06:21] yes [12:06:29] it's one of the things i do [12:06:53] i parse the wikidata dumps for many many properties, put them into different tables [12:06:58] and then i work with the data [12:07:29] the problem is that if it struggles with geolocated table it will struggle with the other properties tables [12:07:51] have you considered adding an index to wd_geolocated_property.qitem [12:08:03] i think i had, wait a sec [12:08:09] and select only indexed columns [12:09:01] the EXPLAIN output says it’s a selecting ALL query and not using any indexes for that table [12:09:13] qitem is primary key in both tables [12:10:27] could you check if selecting only qitem produce anything different? [12:10:46] (trying to make it not do a full table scan) [12:10:52] https://pastebin.com/XnXM4uSp [12:11:42] btw: select count(*) from wd_geolocated_property: 5744959 (8 min 20.09 sec) [12:12:23] hmm 4 million rows not indexed? [12:12:36] trying: SELECT * FROM wd_geolocated_property INNER JOIN s53619__wcdo.wd_sitelinks ON wd_sitelinks.qitem=wd_geolocated_property.qitem; [12:12:57] ? [12:13:18] zhuyifei1999_: they should be indexed. isn't the index the primary key qitem? [12:13:25] https://pastebin.com/XnXM4uSp [12:13:29] here the indexes [12:13:41] btree type it says [12:13:50] I mean, could you replace * by qitem so other columns do not get selected [12:13:57] ok [12:14:13] and see the explain output [12:14:30] if that produce a better result [12:14:44] i thought that * was faster because it didn't need to filter [12:14:55] trying the explain before, ok [12:15:26] https://pastebin.com/v3yyh75R [12:16:13] yep it’s now using the primary key [12:16:35] but still 1 million rows to scan through :( [12:16:54] does the query run faster now? [12:17:04] it is stuck [12:17:24] hey, wait a sec [12:17:26] SELECT * FROM s53619__wcdo.wd_geolocated_property INNER JOIN s53619__wcdo.wd_sitelinks ON wd_sitelinks.qitem=wd_geolocated_property.qitem WHERE wd_sitelinks.langcode="cawiki"; [12:17:30] oh about *, it forces the server to access the table for non-indexes columns [12:17:31] 172833 rows in set (9 min 34.73 sec) [12:17:52] that's better than before. but what did we change? the optimize query? [12:17:59] maybe [12:18:12] i mean, before it used to take 2 hours or more [12:18:18] i left it for the night [12:18:28] after 1h aware of it [12:18:53] i'm not sure of what it shortened the time to 9 min [12:19:06] but i only did the optimize thing [12:19:19] it might have helped [12:19:28] maybe a lot [12:20:16] it's important, this needs to scale to the 288 langs (i was just trying catalan and guarani now to have an idea) [12:20:36] so perhaps right after creating the mysql table from the dump i should do the two optimizes [12:20:41] one per table [12:21:31] Are you using the dump for every wiki? Or only for wikidata? mmecor [12:22:07] it is the wikidata dump, i sort different property sets into different tables [12:22:19] I'm not nearly as proficient as zhuyifei1999_ in query optimization, btw [12:22:19] then i run the table filtering with sitelinks.langcode for everylanguage [12:22:30] Sure, I understand now [12:23:40] wd_sitelinks is the key table. wd_language_properties, wd_country_properties, wd_geolocated_property, wd_location_properties, wd_has_part_properties, wd_part_of_properties, wd_created_by_properties, wd_lived_there_properties [12:23:49] are the tables i play with iteratively [12:23:54] well, geolocated is straight [12:24:07] but others need to be processed several times [12:24:31] and sitelinks as i said is the filter that helps me run only the times that exist in a language [12:24:41] times no, items* [12:25:29] are they only 1-column tables? [12:25:48] (containing qitem only) [12:25:56] https://pastebin.com/NLRXY18e [12:26:18] few columns, qitem, the property that relates them to the table, the qitem2 that is the target of the property [12:29:00] this in sqlite3 file was easy to read as many times as needed. but with mysql it is impossible even the first time it seems [12:29:11] uh, if the table data is needed as well, it’s gonna be hard to optimize it. (unindexed columns are needed in any case; can’t rely on index only). yeah optimize table is the only thing I can think of right now [12:29:50] idk how sqlite can manage its speed [12:30:30] sqlite is bad for concurrency but it is faster. reading from disk is also faster. that was my idea... not create a cleaner version of the wikidata dump [12:30:37] with some indexes [12:30:45] to be able to iterate it several times [12:31:30] not selecting page_title might help so it won’t access the sitelinks table, only its index [12:31:41] but that won’t have much effect [12:32:08] SELECT wd_geolocated_property.qitem FROM wd_geolocated_property INNER JOIN s53619__wcdo.wd_sitelinks ON wd_sitelinks.qitem=wd_geolocated_property.qitem; [12:32:18] 13 min 0.09 sec [12:32:36] this is the entire table though [12:32:55] zhuyifei1999_: yes, but i might get page_title = qitem equivalence from somewhere else. [12:33:26] and in any case, this is a plain retrieving with geolocation property. with the other properties it might be imopssible. [12:34:14] well, reading NFS IO will be the slowest thing on toolforge. if you really want to use sqlite, and if the database is not too large, I’d recommend you copying the database to some temp for under /tmp and access that (and make you delete the file) [12:34:30] um [12:34:32] ok, now we knwo more, i can explain more on how the situation is [12:35:02] zhuyifei1999_: he has requested a VPS project for this [12:35:11] oh [12:35:36] If we approve it, I think it'd be better to go back to sqlite [12:35:44] well, then it should get super fast and large instance local storage :) [12:36:23] I think and.rew can make sure it goes on a virt host with good SSD [12:36:58] this is becoming a necessity as i am experimenting with the different bottlenecks [12:37:47] now we know :) [12:38:08] yeah if sqlite works better then great [12:40:14] :) [12:42:16] zhuyifei1999_: I still would like to move the dump parsing to wikireplicas querying, as it will be repeated at least every two weeks [12:43:32] sigh, I wish dataset_p is real [12:44:36] https://phabricator.wikimedia.org/T173513 [12:46:22] ^ less messy importing, most likely, and usable on quarry [12:46:54] I think that isn't the problem for this project [12:48:41] what I mean is, if the database is moved to tools-db, how much benefit would it bring? [12:49:30] what database? [12:49:41] the temp parsing table for WCDO? [12:49:53] this wcdo one [12:49:57] yeah [12:50:12] its temporary, only needed for a while [12:50:30] but quite sizeable, at 6GB [12:51:14] besides being needed for WCDO, it doesn't bring much benefit to other projects that I can see [12:51:24] hmm [12:51:59] the huge resource and time consuming step that I would like to see optimized, is how that table is made [12:53:35] atm the /public/dumps/public/wikidatawiki/entities/latest-all.json.gz is read line by line to make that table [12:54:35] how would you do otherwise? [12:55:07] that is the big question [12:55:21] that file, atm is 29GB, and likely to keep growing [12:55:59] queries to wikireplicas (once the data is possible to be found there) would probably be faster to derive the subset data [12:56:53] yeah, most likely. /me is unfamiliar with wikibase tables [12:57:14] (though it will still be a resource intensive task, checking every wikitada item will never be a trivial task) [12:58:01] but filtering is a task that mysql is better at than reading a file and choosing parts of it [13:02:35] sorry, i was away having lunch [13:02:55] chicocvenancio: i'm not sure i understood how transforming the dump into mysql could be don [13:02:56] e [13:03:13] i think it is the same nature of the data that doesn't allow it. for each item you need the property and the target item. [13:03:46] we would end up replicating sparql endpoint [13:04:52] and in the end, since i need to check all the wikidata items (because there are many properties i take into account), it would be slower than iterating the dump [13:06:54] if the size of the vps disk is key, then it could be possible to just store the dump+sqlite3 and the generated dataset. 50GB working space for peak times. then, at every generated datset, move it to somewhere else. [13:07:59] you should be able to get 60GB working space easily with a m1.large instance [13:09:03] I wish I know how sparql works [13:09:38] sparql is good for precise and little data. you can't ask many things at the same time or the endpoint dies [13:33:20] mmecor: it is not about tranforming the dump into mysql [13:33:27] the dump was mysql [13:33:57] mmm i'm not understanding [13:34:15] all wikibase has to generate the site itself is the db [13:34:30] (and code, but the data is in mysql) [13:35:01] so ALL the data in the dump must be in the production DB [13:35:08] wikidata runs in a wikibase mediawiki [13:35:12] yes [13:35:14] i see [13:35:34] the wikireplicas should have all public data [13:35:45] is the data indexed and can be retrieved? [13:35:49] so if it is present in the dump, it should be present in the replicas [13:36:16] we are talking about the wikidata dump [13:36:25] i don't use any other dump than the wikidata one [13:36:41] for wikipedia language editions i use the replicas [13:36:54] yes, that is the step I'm talking about [13:37:18] as I mentioned in the task, this is not urgent [13:37:21] to index all the data from wikidata into a mysql? [13:37:38] the properties? [13:37:49] it is indexed somewhere in the db [13:37:55] has to be [13:38:04] i would imagine it there, but not indexed or not queriable [13:38:10] as plain text for each page [13:39:06] there would need to be a table with properties similar to the 'outlinks' table [13:39:16] then we need to discuss if such an index should or not be created [13:39:18] !log rcm Oxygen: Package updates [13:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Rcm/SAL [13:39:23] !log rcm CAC: Package updates [13:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Rcm/SAL [13:39:24] origin, target, property [13:39:32] !log rcm Tin: Package updates [13:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Rcm/SAL [13:39:37] !log rcm Xenon: Package updates [13:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Rcm/SAL [13:39:41] !log rcm Neon: Package updates [13:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Rcm/SAL [13:39:52] !log rcm Xenon: Phab update [13:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Rcm/SAL [13:40:20] !log rcm Neon: Restarting webservice [13:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Rcm/SAL [13:41:09] !log rcm All hosts: apt-get autoremove [13:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Rcm/SAL