[07:25:52] godog: thank you ! [15:58:23] phedenskog: what problem are you having with Jekyll? I've always found it just works straight up. [15:59:21] I'm not a fan of 11ty due to its undisciplined dependency graph. I hope upstream improves that. Day to day it's almost the same in terms of markdown and layout files [19:32:24] AaronSchulz (or anyone in perf team really), can i / we get your input on https://phabricator.wikimedia.org/T259855#6398435 wrt how we should handle retries when rev fetches fail? [19:50:56] subbu|lunch: page views and edits are both started based on replica query to rev-latest by title afaik. Upon submission the base revision should be passed so as to allow edit conflict detection and avoid overwriting edits. Once within that pipeline you deal with revIds, not titles. revIds are fetched from replica first, but if not found, we query the master; this is of coursee if and only if the chain started with a known revID otherwise [19:50:56] there is no reason to assume the master will have it. Also, there is ChronologyProtector that makes it very hard to not get a revID from the replica if you made it yourself as we will explicitly wait for the replica in that case. [19:51:15] which part of this is not honoureed in DT and/or is suspected to not work correctly? [19:54:24] (and if the master doesn't have it, and/or we decided not to check; then that results in an exception) [19:54:27] should* [19:54:41] sounds like maybe something is unhelpfully trying to proceed regardless? [19:55:33] Krinkle, it would be helpful if you left a comment on the phab task. But, TLDR Tim felt that there was a scenario where we might get a revid if we tried against master explicitly ... scott references that in his long comment. [19:55:59] so, i suppose the qn for you all is: is that true? if so, how should we retry? [19:56:30] subbu|lunch: we generally don't do retries, but there are two closely related things we do do: 1) if upon sending a db query it turns out the db connection was lost, then if it's a safe select query, we reconnect to another replica and send it again and 2) in RevisionStore we fallback to master if replica doesn't have a rev ID that we expect to exist. [19:56:58] Yeah, that's how RevStore has always worked for probably 10y I guess. [19:57:04] Is this code path bypassing that? [19:57:10] we are dealing with scenario 2 .. revision store ... when parsoid uses getRevisionById(). [19:58:42] https://gist.githubusercontent.com/subbuss/11eb7810b72eab3d84be67be997530a0/raw/4280a20c3b419c6655e222282a2d05bbb70926b8/gistfile1.txt [19:59:55] this is also a serious replication lag scenario ... Scott found logstash entries that indicate 60 hours unless he is misintepreting the logstash entry .. probably related to db server pooling / maintenace work [20:00:04] all info in the phab task. [20:00:36] The lag time reporteed in logstash is meaningless [20:00:46] there's known problems with that [20:01:01] ok :) [20:02:27] Hey all, I have a question about whether something should go thru a performance review or not. Can I drop a few lines here or is there a better channel for this? [20:03:14] subbu|lunch: so what' the observable issue right now? I don't remember us limiting the master fallback to POST requests, but that sounds plausible indeed. In general it doesn't make much sense for a request to have a known rev ID that 1) isn't its own (chronologyprotector addresses that) and 2) isn't far enough into the past to be public and replicated [20:03:35] so, parsoid is currently not checking the return value of getRevisionId() for null. [20:03:58] so, we need to figure out what Parsoid should do when it gets null from revision store (which is the source of corruption in that phab task). [20:04:05] subbu|lunch: but assuming that it is indeed now limited to POST, and that your requset is not a POST, it would throw an exception right? So how could it cause corruption? [20:04:15] do we retry (since) or give up and return a http 4xx or 5xx to the client? [20:04:52] if it doesn't check it, does that not cause a fatal later with method on null? [20:04:58] corruption is from parsoid-specific details / mechanics .. but one of the causes is Parsoid buggily not checking return value of getRevisionById() by assuming revision exists. [20:05:07] right. [20:05:19] and later code incorrectly falls back to a different revision (latest revision) and serializing against that causes corruption. [20:05:37] yeah, I would expect it to behave the same way as if a bogus 'foobar' or other really high/unknown rev ID was passed in, 4xx seems appropiate indeed. [20:05:38] ( i suppose I should stop pretending i am still eating lunch ) [20:06:24] ok .. so, this would be a case where while the revid actually exists (because DT fetched it), Parsoid would return a 4xx because of db issues. [20:06:31] and so you are saying, don't retry. [20:06:41] Note that MW also discourages edits on the edit page if after trying N differnet replicas it could not find any with low replication lag. [20:07:12] Yeah, there's nothing to retry anyway, short of bypassing rdbms you can't get a conn with less lag [20:07:32] rdbms already retries multiple replicas to ensure a conn with lag below the 10s threshold. [20:07:43] if it can't then that should be considered final and not bypassed. [20:07:53] i don't fully grok this but if you leave a summary of your recommendation on the phab task, that will help us take next steps. [20:18:40] subbu: done, let me know if that covers it [20:28:06] thanks, that is good enough for us. we'll not retry and just return a 4xx error and hopefully the user will retry their edit and hope for better luck. [20:38:44] subbu: it's probably worth measuring this with a counter and an alert somewhere, perhaps not specifically thus but the broader of "rejected save attempt" or something like that. [20:39:10] oh yes, we'll log them. [20:44:37] Krinkle: pickReaderIndex() even tries to find a db with <= 1 sec of additional lag since any recent ChronProt update before setting on ones with more lag (<=6 sec or so) [20:45:23] (mostly an optimisation to avoid blocking in doWait() given the recent CP replica positions to check)