[00:06:51] <AaronSchulz>	 Krinkle: also, are we really doing onhost for wancache soon?
[00:07:06] <Krinkle>	 AaronSchulz: well, it's up to us I suppose.
[00:07:40] <Krinkle>	 but yeah, once we've confirmed PC is working fine and maybe one more double check on the settings, I suppose we could start enabling it for functional testing e.g. on group0 wikis.
[00:11:25] <AaronSchulz>	 Krinkle: any luck with https://phabricator.wikimedia.org/T268308 ?
[00:11:49] <Krinkle>	 where we go from there, I'm not sure. we have wancache latency metrics already which are fairly low-level and include presumably any memc latency in a fairly visible way, e.g. Sqlblobstore miss latency. but we could do more. we could e.g. do a bechmark from CLI on an app server for 10 commonly used keys and for 10 absent keys, and hammer the local mcrouter with and without on-host prefix for those and see how they perform.
[00:12:05] <Krinkle>	 what else should we consider?
[00:20:30] * AaronSchulz is just thinking about purges
[00:22:33] <Krinkle>	 right, those will take upto 10 seconds to apply, not unlike the tolerance we have cross-dc
[00:22:40] <Krinkle>	 although no where near as high of course
[00:23:08] <Krinkle>	 two overlapping sliding windows, worst case scenario <=10s
[00:24:41] <Krinkle>	 and as part of general tolerance toward db lag, on the pattern that we dont' query cache when we can't query replica 
[00:27:08] <Krinkle>	 AaronSchulz: is there a kind of test, trial or study you think we should do as part of this? e.g. in isolation somewhere, or on a test wiki etc.
[00:42:38] <Krinkle>	 AaronSchulz: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/643383/1/includes/libs/CSSMin.php
[00:57:19] <AaronSchulz>	 Krinkle: not sure, it would need more MW coding for when to use the prefix
[00:57:58] * AaronSchulz is not comfortable with using that generally for wancache
[00:58:26] <Krinkle>	 AaronSchulz: not sure I follow. I would assume we'd enable it for WANCache: v: only of course, if that isn't implemented yet (I thought we did) we'll need that of course.
[00:58:53] <Krinkle>	 not for the rest of raw memc bago and non-v wancache, that much is clear
[00:59:12] <Krinkle>	 per-wiki can be done in wmf-config
[00:59:18] <AaronSchulz>	 I mean even just for WANCache✌️
[00:59:42] <AaronSchulz>	 heh
[01:01:51] <Krinkle>	 I know that WANCache-v is the vast majority of all traffic, but I thought we had already covered that in meetings with SRE to be the plan. I don't recall any specific ways within the WANCache contract that it wouldn't work, but anyways, any edge cases or unkonwns we want to test for, should be written down at https://phabricator.wikimedia.org/T244340
[01:03:14] <Krinkle>	 I think overall we already suffer from too many options, so to the extent possible, at leat mid-long term I don't think we should make this opt-in or opt-out. But it depends of course on what issues might or might not happen otherwise. I think we can solve this generally as part of the contract, feeling optimistic :)
[01:05:15] <AaronSchulz>	 the routing prefix would have to be used only for "v" keys by MW (prefixroute does not work since coalesceKeys uses suffixes)
[01:05:36] <AaronSchulz>	 WANCache could do that, but it's not in the code now
[01:06:37] <Krinkle>	 ah, right, the commit I was thinking of did it directly in BagO for parser cache.
[01:06:59] <Krinkle>	 yeah, WAN does already have support for adding mcrouter prefixes in some cases, but not yet for this one
[01:45:12] <AaronSchulz>	 ah, Roan already +2 CSSMin
[01:45:37] <RoanKattouw>	 Yeah although the patch isn't working locally for me for some reason
[01:45:50] <RoanKattouw>	 I'm still getting the weird https:/w/extensions/... stuff
[01:50:15] <RoanKattouw>	 (while working on something unrelated I kept hitting this bug, my images aren't showing up)
[01:56:33] <Krinkle>	  RoanKattouw maybe a caching thing?
[01:56:49] <Krinkle>	 it's apcu cached by hash which I didn't bump
[01:57:45] <RoanKattouw>	 Oh I had the image path wrong lol
[01:57:48] <RoanKattouw>	 Needed another ../
[01:59:53] <Krinkle>	 looks like the apcu / MemoizedCallable is no longer used where I thought it was
[02:00:04] <Krinkle>	 only in WikiModule now
[02:00:17] <Krinkle>	 I guess it wasn't worth the ovehead anymore 🤷
[02:02:54] * AaronSchulz finally gets https://phabricator.wikimedia.org/T193565
[02:04:58] <AaronSchulz>	 in theory, a similar thing could effects lots of random stuff...basically a Two Generals problem
[02:06:00] <Krinkle>	 I've put my money on wikibase somehow causing this
[02:11:51] <Krinkle>	 although it seems that apart from the installer, there is literally nothing in prod using changeDomain/selectDB/ or otherwise USE queries
[02:11:52] <Krinkle>	 https://codesearch.wmcloud.org/deployed/?q=(%5CbUSE%5Cb%20%5B%5EOI%5D%7CselectDomain%7CselectDB%7CdoSelectDomain)&i=nope&files=php%24&repos=
[02:12:06] <AaronSchulz>	 I think ExcimerTimer can fire the WMFTimeoutException callback at very inconvenient places...e.g. after "USE db" finishes but before Database updates the domain field
[02:12:15] <Krinkle>	 so that kind of suggests it has got to be an rdbms bug or external corruption in PHP or TCP
[02:13:35] <Krinkle>	 Hm.. yeah, execution continuing after a timeout could leave state in weird ways
[02:14:21] <Krinkle>	 I don're call if we see this e.g. from deferred updates after a highlevel try-catch rendering an error page
[02:14:31] <Krinkle>	 the last few on the task though seem to be from regular pre-send code
[02:15:31] <AaronSchulz>	 it's from deferredupdates, which currently catches those the timeout and keeps going with other updates
[02:15:32] <Krinkle>	 and if it's an issue local to a specific db connection, then presumably it should be able to find any other db on the same host during the point where this exceptino happens?
[02:16:14] <Krinkle>	 oh, I misread your phab comment
[02:16:25] <Krinkle>	 that's not from the error the task is about but about the correlated timeout
[02:16:27] <Krinkle>	 I see
[02:17:05] <Krinkle>	 yeah, the ones where the query fails, seem to be from deferreds
[02:17:58] <Krinkle>	 maybe it needs some kind of neutral state where we say before dispatching the USE that the dbcon is now unusable / presumaed dead / not on any DB .
[02:19:31] <AaronSchulz>	 run() still calls doPostOutputShutdown() after catching stuff
[02:19:43] <Krinkle>	 AaronSchulz: to confirm you suspect that pre-send something is connecting to s7 to a non-centralauth table, then something else pre-send is reusing that connection and selecting from centrauth, times out before the USE completes, and then post-send somethign uses that connection a third time, with a second centralauth query, and we find we already selected it so we just use it again?
[02:19:45] <AaronSchulz>	 Krinkle: something like that, yes
[02:19:57] <AaronSchulz>	 lots of things in theory could need that
[02:20:56] <AaronSchulz>	 I think USE does finish, the thread wakes up, then the timer fires and doSelectDomain() is left complete
[02:21:24] <AaronSchulz>	 if it thinks it failed, then it's still listed in LoadBalancer as free for use under the old domain
[02:22:54] <AaronSchulz>	 centralauth -> metawiki -> missued as centralauth again
[02:23:11] <AaronSchulz>	 $wgGlobalUserPageDBname is probably metawiki, right?
[02:24:21] <AaronSchulz>	 Hence, << table metawiki.globaluser' doesn't exist >> since it meant to be on a handle with centralauth selected (centralauth.globaluser)
[02:37:27] <Krinkle>	 I see. Even worse/better :D
[02:37:34] <Krinkle>	 it does complete but outside our view
[02:37:39] <Krinkle>	 that makes sense yeah
[02:43:17] <AaronSchulz>	 reminds me of an interrupt handler project I did in C for a uni project ;)
[02:44:08] <AaronSchulz>	 having flags/queues set with manual checkpoints is one thing, but this is more annoying
[02:44:58] <AaronSchulz>	 Krinkle: probably that kind of exception should not be catcheable? Is that easy to do?
[02:46:17] <Krinkle>	 AaronSchulz: depends on what you mean - you mean for MW to not continue more generally after a timeout? e.g. no localised/skin navigatable error page and/or no deferred updates?
[02:46:42] <Krinkle>	 the wmf config handler could exit() at the most extreme end of possible options
[02:50:42] <AaronSchulz>	 I guess the practical problem the we have is DeferredUpdates still being tried. If core had an interface for interrupting exceptions/timeouts, then mediawiki-config could use it to signal that things should just wrap up.
[02:52:41] <AaronSchulz>	 Krinkle: if the extension/config introduces this kind of async stack preemption, then it should cooperate to manage it IMO
[02:55:12] <AaronSchulz>	 rollbackMasterChanges() is decently robust (within reason)...maybe it can have some extra cleanup checks
[02:56:28] <Krinkle>	 AaronSchulz: perhaps a static methon on MediaWiki.php class that excimer callback can call (conditionally, if defined by then) to signal that MediaWiki::main should treat it the same as it does when e.g. rendering error pages fails (nested/repeat error) and thus do a quicker shutdown
[02:56:59] <Krinkle>	 ah right, if we can catch/handle it within rdbms that'd be even better for this specific case
[02:57:11] <Krinkle>	 e.g. to disable the dbloadbalance server or smth like that 
[02:57:16] <Krinkle>	 service*
[05:38:34] <Krinkle>	 in other news, the dep graph for RL now fits onto one page for me
[05:38:34] <Krinkle>	 https://doc.wikimedia.org/mediawiki-core/master/php/classResourceLoader.html
[05:38:43] <Krinkle>	 mainly due to the local Hooks container
[05:38:55] <Krinkle>	 (previously pulled in 500+ hook interfaces)
[06:01:49] <Krinkle>	 gilles: I recall something about navtiming from team meeting, did that get resolved?
[06:01:53] * Krinkle signs off for the day
[08:44:21] <gilles>	 Krinkle: yes, andrew removed the mediawiki-config schema version override
[08:53:37] <gilles>	 coal graphs not rendering anymore on https://performance.wikimedia.org/ ...
[08:54:02] <gilles>	 getting empty results through the API call, it seems
[08:54:16] <gilles>	 I'll file a task, I presume it's related to the Python 3 migration
[08:55:39] <gilles>	 I'll look into it later today
[12:42:40] <gilles>	 https://fairinternetreport.com/research/usa-vs-europe-internet-speed-analysis
[16:37:44] <gilles>	 AaronSchulz: did you see https://phabricator.wikimedia.org/T266502#6648828 ?
[18:35:46] <Krinkle>	 Amsterdamn, joke or typo?
[18:35:52] <Krinkle>	 anyhow, great article :)
[20:03:01] <AaronSchulz>	 gilles: I'll see how hard those are to just patch
[20:33:34] <AaronSchulz>	 dpifke: want to CR https://gerrit.wikimedia.org/r/c/mediawiki/core/+/641574/ ?
[21:30:43] <AaronSchulz>	 Krinkle: anything else for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/589465 ?