[03:04:09] <ori>	 Krinkle: wow, that is awesome
[03:06:32] <Krinkle>	 ori: we've got a number of unresolved regressions in terms of networking, tcp, and some html/css transfer issues; but the JS side of things has gotten significantly lighter and more optimised, in no small part also due to browsers and js engines getting better I'm sure 
[03:09:23] <Krinkle>	 I haven't tried to correlate with this graph specificlaly, but first paint regressions of 2019 and 2020 correlate with T170567 and T264398 last I checked
[03:09:23] <stashbot>	 T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398
[03:09:23] <stashbot>	 T170567: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567
[03:09:38] <Krinkle>	 but it's good to know that at least at the median, it seems overall we're actualy still better year over year
[03:10:02] <ori>	 is it possible to do the same graph for p95?
[03:10:04] <Krinkle>	 but then again, that also ignores 50% of the data :)
[03:10:19] <Krinkle>	 I think that's too noisy to do stacked, last time I tried that didn't work very well.
[03:10:26] <ori>	 ah
[03:10:27] <Krinkle>	 one minute is not enough
[03:10:40] <Krinkle>	 for rum data, p75 is the highest usable one generally
[03:10:48] <Krinkle>	 and the default now for most of our panels and alerts
[03:11:16] <Krinkle>	 once we're on prometheus we'll be able to use histograms and get estimated percnetiles in a more meaningful way over larger periods of time with buckets etc.
[03:11:32] <Krinkle>	 (yeah, we still haven't done that..)
[03:13:21] <Krinkle>	 it takes some getting used to let go of exact numbers (with buckets the "percentile" people get in grafana is really just an estimate based on two bucket populations, kinda fake and staggered; better to use percentages e.g. rather than trying to guess where the p95 is, you instead say with confidence that over X amount of time, we have Y % of data within the Z ms threshold)
[03:13:22] <Krinkle>	 https://grafana.wikimedia.org/d/000000580/apache-backend-timing?orgId=1
[03:13:39] <Krinkle>	 that's where I've played with histograms for our backend timing which is on prometheus now
[03:13:47] <Krinkle>	 approaches it from a different persective
[03:14:41] <ori>	 you made all these?
[03:15:07] <ori>	 this is so fucking cool
[03:16:21] <Krinkle>	 yep :) - inspired by how chrome and archive.org approach their perf buckets on big data
[03:16:44] <Krinkle>	 once I looked into why they do it so weirdly and don't get timing numbers out of it but percentages, a light went off
[03:17:26] <Krinkle>	 and it also connects very well with how promethues work etc, very aggregate friendly and you don't lose any precision over time, other than the precision you lose in the first place from defining buckets etc. 
[03:17:40] <Krinkle>	 its's been 2y since I wrote this: https://timotijhof.net/posts/2018/measuring-wikipedia-page-load-times/
[03:17:47] <Krinkle>	 but would be nice to get part 2 out there for navtiming
[03:21:55] <Krinkle>	 you can see how effective this approach is in detecting small but significant changes in last weeks false negative when we temporarily switched some services from eqiad to codfw, https://phabricator.wikimedia.org/T278274#6939988 https://phabricator.wikimedia.org/F34183703
[03:22:13] <Krinkle>	 the "quantiles" (aka percentile guestimates) didn't really look like they changed much
[03:25:22] <ori>	 yes, this is brilliant, and a lot of it is new to me
[03:28:42] <Krinkle>	 basically rather than trying to improve/reduce the value at the Nth percentile, you increase the % of stuff in one of your target buckets. has the benefit of also playing well with being budget/goal-oriented. And rather than e.g. having p25/p50/p75, you'd e.g. look at % of <100ms, <250ms and <1s or something like that. those were later realizations however, my main drive was to get to a place where I can plot an hour/day/month and have 
[03:28:42] <Krinkle>	 it still be meaningful, which averagves of minutely percentiles are not (or so I was told by analytics people after they patiently explained toit to me)
[03:29:41] <Krinkle>	 I haven't quite figured out how to make heatmaps useful. what I got there seems accurate, but it doesn't feel useful.
[03:30:06] <Krinkle>	 my "old" apprach is to plot percentiles together: https://grafana.wikimedia.org/d/000000085/save-timing?viewPanel=15&orgId=1
[03:31:19] <Krinkle>	 which is a very crude approximation that I improved years ago, and still feels useful. I'm not sure what the equivalent of that would be with bucket. in theory heatmaps, but I'm not sure they're clear enough visually. oh well :)
[03:32:05] <ori>	 I've been staring at the wall time flame-graphs a lot lately
[03:32:34] <ori>	 which reminds me, do we know why DatabaseMysqli::mysqlConnect 1 - 1.5% of index.php wall-time? that seems unreasonable
[03:33:41] <ori>	 is mediawiki using persistent connections or a pool?
[03:34:13] <Krinkle>	 ori: nope
[03:35:47] <ori>	 I assume the reasons why are complex
[03:36:16] <Krinkle>	 client-side load balancing based on statically configured weights of replicas and query groups (https://noc.wikimedia.org/dbconfig/eqiad.json - used to be wmf-config db.php, now in etcd) and combined at runtime with current load and health periodoically checked and stored in a local apcu cache key. 
[03:37:11] <Krinkle>	 persistent connections, I'm told but don't know very well, are historically buggy in php, both redis, memc and mysql; generally "avoid" territory it seems
[03:37:31] <Krinkle>	 direct sharing also tricky because of slow queries and transaction snapshots
[03:37:39] <Krinkle>	 but pooling could in theory work
[03:37:55] <Krinkle>	 so long as it is abstracted at the level of a specific replica, so MW still decides which one it wants each time.
[03:38:16] <Krinkle>	 so that means every app server would have a pool of N connection for every replica of every db section.
[03:39:19] <Krinkle>	 if it is true that under current load we'd have about N open at any given time any way, that might be fine.
[03:41:18] <Krinkle>	 and it'd need to be transparent to MW and somehow hand off to the pool and claim etc in a way that's fault tolerant and can "boost" out of it as needed, I think? There's various db proxies off the shelve that probably could be used for this. I know we've been thinking for a while about something like dbproxy, vitesse or HA proxy even.
[03:41:29] <Krinkle>	 so far it's been pushed back due to lower priorization etc
[03:42:01] <Krinkle>	 it came up again this year because we're blocked on MW using TLS for mysql between DCs before we can do active-active reads (e.g. for post-send master queries, and jobs/maintenance scripts)
[03:42:20] <Krinkle>	 and were thinking such proxy with a pool could help mask the added latency
[03:42:49] <Krinkle>	 it's unresolved as of yet :)
[03:43:20] <ori>	 https://people.wikimedia.org/~ori/wall_time_apr_2021.txt
[03:43:44] <Krinkle>	 s/dbproxy/ProxySQL
[03:43:51] <Krinkle>	 https://www.mediawiki.org/wiki/Wikimedia_Performance_Team/Active-active_MediaWiki#MariaDB_cross-datacenter_secure_writes
[03:45:11] <ori>	 I have more changes for cutting down Lua time, and some vague intention of looking at FST at some point
[03:45:20] <Krinkle>	 http 10%
[03:45:39] <Krinkle>	 I assume some of that is Elastic and sesssion store and other "meaningful" read queries over http
[03:45:58] <Krinkle>	 but probably also some of that eventbus which should be mostly pre/post-send
[03:46:06] <Krinkle>	 might be possible to filter out the post-send category
[03:46:45] <Krinkle>	 anything that descends from doPostOutputShutdown
[03:46:49] <Krinkle>	 https://performance.wikimedia.org/arclamp/svgs/daily/2021-04-04.excimer-wall.all.fn-PostSend.svgz
[03:46:50] <ori>	 oh yeah, that's a good idea
[03:47:04] <Krinkle>	 it has a dedicated graph now as well thx to aaron
[03:47:38] <Krinkle>	 since it mostly had its details cropped off in the general ones due to minwidth to keep the SVGs reasonbly small
[03:47:49] <Krinkle>	 same for save timing, has a dedicated EditAction graph now
[03:48:49] <Krinkle>	 might also filter out JobRunner perhaps, but depends on what we care about of course
[03:49:12] <Krinkle>	 RunSingleJob.php
[03:49:47] * ori re-runs report
[03:52:22] <ori>	 https://people.wikimedia.org/~ori/pre_send_apr_1.txt < this excludes stacks w/ JobRunner or doPostOutputShutdown
[03:53:31] <ori>	 FST::* is 1.67%, that seems low-hanging
[03:59:38] <ori>	 Google matches hours spent volunteering for a nonprofit with monetary donation so the WMF actually gets paid when I stare at this stuff :)
[04:00:37] <Krinkle>	 nice
[04:01:08] <Krinkle>	 MultiHttpClient::runMultiCurl went down from 10 to 3%
[04:01:21] <Krinkle>	 mysqlConnect up by 0.01%
[04:01:47] <Krinkle>	 makes sense I suppose since it's generally paid for pre-send
[04:01:57] <Krinkle>	 was expecting a bigger difference perhaps
[04:02:12] <ori>	 can it be paid pre-request
[04:02:17] <ori>	 that would be nice
[04:02:19] <Krinkle>	 :)