[20:59:45] Urbanecm: ping [20:59:54] Platonides: pong [20:59:57] what's up? [21:00:22] I was seeing that you had https://ticket.wikimedia.org/otrs/index.pl?Action=AgentTicketZoom;TicketID=11456251 pending, I think it can be directly closed? [21:01:03] not exactly [21:01:11] it was reverted, see the page [21:01:28] oh :( [21:49:59] Hi, I've been working on a Rust parser for the SQL dump files, and noticed that the cat_pages field in English Wiktionary's category table is sometimes negative. For all of these rows, the category page doesn't exist and has never been created. I wondered if anyone has noticed this and knows why it might be. [21:54:10] (The negative cat_pages part, not the page not existing.) [21:56:44] do you have an entry to search? [21:57:53] What do you mean? [21:58:01] 369285 entries in the category table [21:58:07] 21 with cat_pages < 0 [21:58:25] Right, I did that query on toolforge. [21:59:29] I'm guessing it's some drift [21:59:36] In the same way it under/over counts [21:59:47] well, it's an undercount [21:59:55] it's an odd one [22:00:01] it's purposely allowed [22:00:11] >These are signed to make underflow more obvious. [22:00:41] Ahh, I see. [22:01:12] I think there's an open task about running recountCategories [22:01:27] So the fields representing cat_pages, cat_subcats, and cat_files need to be signed in my Rust library. [22:01:31] * Reedy looks [22:01:37] Yeah [22:02:12] I made the structs a while back, and now I wonder how carefully I checked whether the table definitions said unsigned. [22:02:34] As opposed to using unsigned where it was logical. [22:06:38] If I were smart I'd parse tables.sql to figure this out... but that's a lot of work... [22:08:14] the database schema is moving to a more structured format [22:08:22] but will take a while before it's all done [22:15:49] Hmm, what sort of format? [22:16:34] json! [22:16:35] Another oddity was imagelinks, which has a different field order in tables.sql and in the imagelinks.sql dump file. [22:17:03] https://github.com/wikimedia/mediawiki/blob/master/maintenance/tables.json [22:17:04] Excellent! :) That would make my life a lot easier. [22:17:13] It will happen occasionally depending on how schema changes were applied [22:19:31] (huh, one `notNull` among all the `notnull`s) [22:20:20] Let's fix that [22:21:04] Hmm, will the format distinguish UTF-8 and other binary fields (for instance sortkeys, which have sometimes been truncated in the middle of a byte sequence)? [22:21:42] For Rust, that is important because those would translate to two different types. [22:23:00] I think that is set at the database level... [22:23:49] Oh, cl_sortkey can still be truncated UTF-8, while newly inserted cl_sortkey_prefix is supposed to be UTF-8. [22:24:19] +2ed the change [22:27:08] Well, I get that UTF-8 isn't enforced by the database, but still some fields are going to always be UTF-8, assuming no errors, right? [22:28:14] I think all modern textual fields will have utf-8 [22:28:28] except some really old page revisions [22:28:35] which wouldn't be exposed, anyway [22:30:01] I guess it would be a bad idea to officially document this in the schema unless it's guaranteed to be enforced somehow... [22:30:29] as in you can't insert this or that value into a field if it's not valid UTF-8 [22:30:58] mediawiki supposts a few encodings [22:31:03] *supports [22:31:12] you can use utf8 or utf8mb4 [22:31:22] in which case the database would bark at non-utf8 [22:31:31] but you could also use a binary field [22:31:41] or even a binary column disguised as latin1 [22:32:26] ahh [22:33:24] utf8 being the one with only up to 3-byte code points if I remember right [22:33:57] only the bmp [22:34:14] mysql story [22:34:24] yeah [22:39:52] So, so far the only fields in SQL files from the dump that I represent as non-UTF-8 (raw bytes) are cl_sortkey and cl_sortkey_prefix. The rest are expected to be UTF-8. Any that I'm missing?