[00:18:15] Krinkle: not sure why that would be slow (maybe wrong index usage?) [00:20:16] I assumed it's not that the query is slow. I'm assuming it's slow because we're sleeping 0.5s every 100 deletes and we presumably have sufficients shards and data size that it just takes that long if you add up the reasonly quick deletes + all those sleeps [00:44:15] AaronSchulz: looks like I didn't write down why > was "easier to reason about" than >=. Thinking about it now since the last patch set kept the additional code pay for a subset operations where I guess the N ms isn't enough. In thought it was. But if it wasn't and we want to build on top of other writes on the same key within that time frame, perhaps we should just use >= in that case. Or throw an exception if we think it won't be used [00:44:15] for any services that need sqlbag with multi-master config [00:46:10] (Or help me remember/understand why it is better not to) [00:48:16] Krinkle: query seems fast in terms of index use on prod...not sure why the replag happened when there wasn't sleep() [00:48:39] AaronSchulz: no replag afaik [00:48:51] we can try lowering it down again [00:49:02] Oh you mean back then [00:49:39] I assume the lag happened then due to too many delete queries to process quick enough for some replicas? [00:51:01] Although given a single thread issuing all deletes, and each delete being waited to be applied on the master, I suppose a replica should generally be able to apply it just as fast [00:51:15] It's not like there are multiple threads competing [00:52:05] Maybe we can ask Jaime if he has a theory. [00:52:44] Btw I liked how you handle incrWithInit by keeping the same token and sending increment to the DB server. [00:56:07] https://mariadb.com/kb/en/parallel-replication/#conservative-mode-of-in-order-parallel-replication might help [00:57:57] though, like you said, the DELETEs where pre-batched and coming from one connection...so not sure how much it would help [00:58:33] row-based replication wouldn't help if the scan is fast (and we need statement-based anyway) [01:21:30] Okay, there's a lot of information there. I think this means there's an option someone could have chosen to use, that opts-out of ordered replication. Is that right? Or is this describing interval strategies all which Mariadb may decide to use? If the latter, would that affect us? I would think there are "normal" options that would willfully replicate statements limited to a primary key out of order unless the option is known to be [01:21:30] unsafe for most normal web apps and something very specialised that a third party DBA could decide to enable and expect a working outcome. [01:21:45] I don't have a good sense of where this sits on that spectrum [01:22:37] ... would think there are *not* "normal" ... [01:25:31] ugh, too many typos. Let me try that again. [01:28:23] The sense I get from that page is that there exist options that opt-out of ordered replication. What I'm not sure of is if there are options a sysadmin/DBA would choose or if it describes internal strategies MariaDB may decide to use in different scenarios. If the latter, I would assume it does not break things like what we're thinking of doing, e.g. simple primary key operations and replicating them out of order seems rather hostile and [01:28:23] broken. If the former, how "normal" are these in your opinion? Is it the kind of thing a sysadmin might choose and would normally work with existing apps and so MediaWiki could/should perhaps (continue to?) "support" that. Or are they the kind of options that a sysadmin might lean into as a deep optimisation strategy based on specialised knowledged for narrow cases where it is safe. [01:56:23] Krinkle: there is also https://phabricator.wikimedia.org/T85266 [02:11:52] AaronSchulz: I've read the mariadb page a bit more throughly now, but I still don't see anything worrisome there. Even the aggressive and out-of-order sections look surprisingly reasonable, but even so they are non-default and come with notable caveats that I think mean nobody would just happen to have these enabled and expect general web apps to just work. [02:12:15] is your thinking to support these as opt-in future optmisation for WMF? [04:24:27] Krinkle: mostly just wondering why key purging was slow...maybe it's not anymore [22:18:41] AaronSchulz: ah sorry, I thought that link was in reply to sqlbagostuff modtoken work [23:47:44] Krinkle: I hope it it's not an issue there (we also have to decide on script vs randomized purges)