[20:07:28] ugh, phpunit firejail failures are annoying when running as www-data [20:56:56] Krinkle, the part I had not fully understood what Tim was saying and which Scott remphasized in his phab comment is the fact that "hasOrMadeRecentMasterChanges" will always be false for Parsoid .. so we'll never retry master in scenarios of db lag. [20:57:37] anyway, check https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/621621 since Scott added code for retries and I figure someone from perf team should chime in as well. I'll add AaronSchulz to reviewer list as well. [20:58:55] subbu: ack, I assumed these were requests as part of DT saving an edit, hence a post request. [20:59:24] which I suppose is still the case, but these are, I guess, either child http reqs or preflight checks before the actual post [20:59:41] and thus don't work the way MW is written to work when receiving an edit. [21:00:20] so, it is part of DT saving an edit, but that request is spread across a http api request to restbase / parsoid. [21:00:29] so, not the same request and so different db connections. [21:00:46] I'm still a bit skeptical on how it is that you get a handle on a rev ID based on a title but then later in the same user session not be able to find that same rev ID. ChronolyProtector should generally make that impossible. [21:00:48] so, yes, child http reqs. [21:00:54] s/skeptical/puzzled by [21:01:35] Krinkle, yes, so, yes .. we don't know either and generally put it on the db lag problem ... but that seems like the only possible explanation in combo with parsoid bug that triggers this corruption. [21:01:46] see scott's long discussion and matmarex's comment and then scott's response there. [21:02:00] whatever be the case, it expose a bug in parsoid wrt failures on rev fetches. [21:02:09] does DT pass the rev id to parsoid, or does parsoid get it on its own from elsewhere? [21:02:16] dt passes it [21:02:22] does it pass on CP data? [21:02:28] what is cp? [21:02:34] (and does parsoid honour that?) [21:02:37] ChronologyProtector [21:02:44] oh, no, cp data. [21:02:57] parsoid doesn't get cp data from dt. [21:03:23] the REST API that it uses doesn't have allowance for that. [21:03:36] in that case, my guess would be that DT is perhaps querying the rev id from the master and/or got it from a very up to date replica, passes it on to parsoid without CP data, and then parsoid gets a random replica that is fairly up to date but slightlty more lagged. [21:03:56] right, I suppose the new REST api would overcome that issue as well [21:03:58] would/could [21:04:21] so, the bug replication also requires DT edit request having gotten a stale revid (i.e not latest, but one before) in the first place. [21:05:26] subbu: using _LATEST seems fine here, although ideally this woud behave like POST, which is, we try replica first and fallback to master (once). I don't know if _LATEST does that or if that goes straight to master.either way that is something we can fix. [21:05:31] this bug requires a couple of exceptional conditions (but both dependent on DB replica lag) to trigger the parsoid bug. [21:06:01] stale rev id does not require lag afaik. it just means someone started and saved their edit after you started it. [21:06:16] I recall encountering this or a similar bug a few months ago in Beta with DT as well by using two browsers. [21:06:24] and it would eat the outer/second edit. [21:06:25] in this case there is 20 mins gap between them ... so, that is why we went with lag. [21:07:04] that much lag would mean DBA have been woken up and depooled the server already and likely also woken up half the SRE department with it as that seems highly improbably in prod. [21:07:15] :-) ok [21:07:38] so, we are still left to explain how this whole thing happened .. but, scott surmised via SAL log that some depooling / repooling of servers was going on. [21:08:34] but, anyway, you really should read that comment since it is a puzzle and we spent 2-3 days reading code and trying different things and this is the only scenario that we arrived at. but, independent of that we have a parsoid bug which we'll fix. [21:08:37] Note that READ_LATEST will not unconditionally query the master, it just means it will consider a master fallback [21:08:51] so the double call scott suggested (I think?) is not needed [21:09:21] actually, no, scott is right. [21:09:24] bit flags are confusing [21:09:46] i read the code and afaict read_latest uses master as primary and no fallback. [21:10:06] I guess we coudl add a flag for what I described basicaly saying "act like post, include a master fallback becuase the ID is believed to exist" [21:10:09] yeah [21:10:48] e.g. turn ( && hasOrMadeRecentMasterChanges ) into ( && ( hasOrMadeRecentMasterChanges || FLAG_KNOWN ) ) or some such [21:10:59] WFM. that is along the lines of what tim was suggesting that we check with you all about. [21:11:08] but for now what scott describes would accomplish the same [21:11:36] ok. so, then, i am going to +2 the patch once he fixes my other review comments without needing a +1 from you all. :) [21:12:09] and once you add a flag if you think it i snecessary (we are also adding logging to see how often this happens), we can change that code. [21:12:34] in any event, this woudl likely add master load especially if the revID is not user provided at the start of editing but rather programmatically at the last miniute by DT server-side (which I suspect is the case, because it's very smart about avoiding edit conflicts, which means it's generally always based on the latest edit). Which, being a POST request, menas it's not unlikely for it to use the master, and thus likely replicas don't [21:12:34] have it. [21:12:57] That should really be addressed with CP, I think. [21:13:47] so that parsoid's fallback will be rare enough that either the master load is almost never actually happening (only when CP fails once a week for 1 user), or that we can simply turn it into an error because it's rare enough. either way. [21:13:57] ok. [21:14:30] is there a roadmap towards parsoid being called without restbase in the middle? [21:14:55] and/or for DT to invoke Parsoid directly programmatically as a MW service? [21:15:21] https://phabricator.wikimedia.org/T250500 [21:15:32] and depends on cpt [21:15:51] and right now have indicated that end-of-q3 would be a good target time for to get there. [21:16:09] Calling out to an external service during a POST request to obtain information derived from something the POST handler got from the master doesn't seem very viable as it generally just mean you spend most/all of your req latency waiting for CP. [21:16:17] and i am supposed to fill out https://phabricator.wikimedia.org/T259485 for that which i'll do early next week. [21:16:36] Krinkle, one step at a time integrating parsoid into core. :-) [21:16:39] in the old editPage path it kinda just works becuase it's all in the same process so the data is shared. [21:16:52] right [21:16:54] we wish it were easier and quick ... but, we are getting there. :) [21:17:08] is this one of those calls that currently benefits from caching? [21:17:41] i am going to quickly swithc to a qn. Tpt was asking me about the extension api ... and come back here after i finish with that. [21:17:49] k [21:57:37] Krinkle, ok, here now again. [21:59:37] so, calling parsoid programmatically is also something that will happen as part of the integration. needs some vrs work .. we thought we would do that as part of the LTS 1.35 release, but didn't get to it. [22:00:31] it is currently also blocked by the php 7.2 gc bug that leads us to turn off GC in parsoid. but, when parsoid gets integrated into another request, that is probably a dicey thing to do. [22:01:11] so, it seem till we move to php 7.3 we are stuck with having parsoid still being invoked via HTTP instead of via internal library calls, so we can isolate the GC-bug within the parsoid boundary. [22:37:15] subbu: ah yeah, that makes it difficult to do inprocess currently