[00:06:51] Krinkle: also, are we really doing onhost for wancache soon? [00:07:06] AaronSchulz: well, it's up to us I suppose. [00:07:40] but yeah, once we've confirmed PC is working fine and maybe one more double check on the settings, I suppose we could start enabling it for functional testing e.g. on group0 wikis. [00:11:25] Krinkle: any luck with https://phabricator.wikimedia.org/T268308 ? [00:11:49] where we go from there, I'm not sure. we have wancache latency metrics already which are fairly low-level and include presumably any memc latency in a fairly visible way, e.g. Sqlblobstore miss latency. but we could do more. we could e.g. do a bechmark from CLI on an app server for 10 commonly used keys and for 10 absent keys, and hammer the local mcrouter with and without on-host prefix for those and see how they perform. [00:12:05] what else should we consider? [00:20:30] * AaronSchulz is just thinking about purges [00:22:33] right, those will take upto 10 seconds to apply, not unlike the tolerance we have cross-dc [00:22:40] although no where near as high of course [00:23:08] two overlapping sliding windows, worst case scenario <=10s [00:24:41] and as part of general tolerance toward db lag, on the pattern that we dont' query cache when we can't query replica [00:27:08] AaronSchulz: is there a kind of test, trial or study you think we should do as part of this? e.g. in isolation somewhere, or on a test wiki etc. [00:42:38] AaronSchulz: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/643383/1/includes/libs/CSSMin.php [00:57:19] Krinkle: not sure, it would need more MW coding for when to use the prefix [00:57:58] * AaronSchulz is not comfortable with using that generally for wancache [00:58:26] AaronSchulz: not sure I follow. I would assume we'd enable it for WANCache: v: only of course, if that isn't implemented yet (I thought we did) we'll need that of course. [00:58:53] not for the rest of raw memc bago and non-v wancache, that much is clear [00:59:12] per-wiki can be done in wmf-config [00:59:18] I mean even just for WANCache✌️ [00:59:42] heh [01:01:51] I know that WANCache-v is the vast majority of all traffic, but I thought we had already covered that in meetings with SRE to be the plan. I don't recall any specific ways within the WANCache contract that it wouldn't work, but anyways, any edge cases or unkonwns we want to test for, should be written down at https://phabricator.wikimedia.org/T244340 [01:03:14] I think overall we already suffer from too many options, so to the extent possible, at leat mid-long term I don't think we should make this opt-in or opt-out. But it depends of course on what issues might or might not happen otherwise. I think we can solve this generally as part of the contract, feeling optimistic :) [01:05:15] the routing prefix would have to be used only for "v" keys by MW (prefixroute does not work since coalesceKeys uses suffixes) [01:05:36] WANCache could do that, but it's not in the code now [01:06:37] ah, right, the commit I was thinking of did it directly in BagO for parser cache. [01:06:59] yeah, WAN does already have support for adding mcrouter prefixes in some cases, but not yet for this one [01:45:12] ah, Roan already +2 CSSMin [01:45:37] Yeah although the patch isn't working locally for me for some reason [01:45:50] I'm still getting the weird https:/w/extensions/... stuff [01:50:15] (while working on something unrelated I kept hitting this bug, my images aren't showing up) [01:56:33] RoanKattouw maybe a caching thing? [01:56:49] it's apcu cached by hash which I didn't bump [01:57:45] Oh I had the image path wrong lol [01:57:48] Needed another ../ [01:59:53] looks like the apcu / MemoizedCallable is no longer used where I thought it was [02:00:04] only in WikiModule now [02:00:17] I guess it wasn't worth the ovehead anymore 🤷 [02:02:54] * AaronSchulz finally gets https://phabricator.wikimedia.org/T193565 [02:04:58] in theory, a similar thing could effects lots of random stuff...basically a Two Generals problem [02:06:00] I've put my money on wikibase somehow causing this [02:11:51] although it seems that apart from the installer, there is literally nothing in prod using changeDomain/selectDB/ or otherwise USE queries [02:11:52] https://codesearch.wmcloud.org/deployed/?q=(%5CbUSE%5Cb%20%5B%5EOI%5D%7CselectDomain%7CselectDB%7CdoSelectDomain)&i=nope&files=php%24&repos= [02:12:06] I think ExcimerTimer can fire the WMFTimeoutException callback at very inconvenient places...e.g. after "USE db" finishes but before Database updates the domain field [02:12:15] so that kind of suggests it has got to be an rdbms bug or external corruption in PHP or TCP [02:13:35] Hm.. yeah, execution continuing after a timeout could leave state in weird ways [02:14:21] I don're call if we see this e.g. from deferred updates after a highlevel try-catch rendering an error page [02:14:31] the last few on the task though seem to be from regular pre-send code [02:15:31] it's from deferredupdates, which currently catches those the timeout and keeps going with other updates [02:15:32] and if it's an issue local to a specific db connection, then presumably it should be able to find any other db on the same host during the point where this exceptino happens? [02:16:14] oh, I misread your phab comment [02:16:25] that's not from the error the task is about but about the correlated timeout [02:16:27] I see [02:17:05] yeah, the ones where the query fails, seem to be from deferreds [02:17:58] maybe it needs some kind of neutral state where we say before dispatching the USE that the dbcon is now unusable / presumaed dead / not on any DB . [02:19:31] run() still calls doPostOutputShutdown() after catching stuff [02:19:43] AaronSchulz: to confirm you suspect that pre-send something is connecting to s7 to a non-centralauth table, then something else pre-send is reusing that connection and selecting from centrauth, times out before the USE completes, and then post-send somethign uses that connection a third time, with a second centralauth query, and we find we already selected it so we just use it again? [02:19:45] Krinkle: something like that, yes [02:19:57] lots of things in theory could need that [02:20:56] I think USE does finish, the thread wakes up, then the timer fires and doSelectDomain() is left complete [02:21:24] if it thinks it failed, then it's still listed in LoadBalancer as free for use under the old domain [02:22:54] centralauth -> metawiki -> missued as centralauth again [02:23:11] $wgGlobalUserPageDBname is probably metawiki, right? [02:24:21] Hence, << table metawiki.globaluser' doesn't exist >> since it meant to be on a handle with centralauth selected (centralauth.globaluser) [02:37:27] I see. Even worse/better :D [02:37:34] it does complete but outside our view [02:37:39] that makes sense yeah [02:43:17] reminds me of an interrupt handler project I did in C for a uni project ;) [02:44:08] having flags/queues set with manual checkpoints is one thing, but this is more annoying [02:44:58] Krinkle: probably that kind of exception should not be catcheable? Is that easy to do? [02:46:17] AaronSchulz: depends on what you mean - you mean for MW to not continue more generally after a timeout? e.g. no localised/skin navigatable error page and/or no deferred updates? [02:46:42] the wmf config handler could exit() at the most extreme end of possible options [02:50:42] I guess the practical problem the we have is DeferredUpdates still being tried. If core had an interface for interrupting exceptions/timeouts, then mediawiki-config could use it to signal that things should just wrap up. [02:52:41] Krinkle: if the extension/config introduces this kind of async stack preemption, then it should cooperate to manage it IMO [02:55:12] rollbackMasterChanges() is decently robust (within reason)...maybe it can have some extra cleanup checks [02:56:28] AaronSchulz: perhaps a static methon on MediaWiki.php class that excimer callback can call (conditionally, if defined by then) to signal that MediaWiki::main should treat it the same as it does when e.g. rendering error pages fails (nested/repeat error) and thus do a quicker shutdown [02:56:59] ah right, if we can catch/handle it within rdbms that'd be even better for this specific case [02:57:11] e.g. to disable the dbloadbalance server or smth like that [02:57:16] service* [05:38:34] in other news, the dep graph for RL now fits onto one page for me [05:38:34] https://doc.wikimedia.org/mediawiki-core/master/php/classResourceLoader.html [05:38:43] mainly due to the local Hooks container [05:38:55] (previously pulled in 500+ hook interfaces) [06:01:49] gilles: I recall something about navtiming from team meeting, did that get resolved? [06:01:53] * Krinkle signs off for the day [08:44:21] Krinkle: yes, andrew removed the mediawiki-config schema version override [08:53:37] coal graphs not rendering anymore on https://performance.wikimedia.org/ ... [08:54:02] getting empty results through the API call, it seems [08:54:16] I'll file a task, I presume it's related to the Python 3 migration [08:55:39] I'll look into it later today [12:42:40] https://fairinternetreport.com/research/usa-vs-europe-internet-speed-analysis [16:37:44] AaronSchulz: did you see https://phabricator.wikimedia.org/T266502#6648828 ? [18:35:46] Amsterdamn, joke or typo? [18:35:52] anyhow, great article :) [20:03:01] gilles: I'll see how hard those are to just patch [20:33:34] dpifke: want to CR https://gerrit.wikimedia.org/r/c/mediawiki/core/+/641574/ ? [21:30:43] Krinkle: anything else for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/589465 ?