[07:13:05] Krinkle: yeah, I haven't started yet but I'm thinking about writing about the device lab. [11:31:46] Krinkle: yeah, for index.php [14:37:54] https://techblog.wikimedia.org/2020/11/23/web-performance-case-study-wikipedia-page-previews/ [15:12:38] quick question: have you ever worked with https://github.com/Netflix/flamescope? It seems interesting and I've been thinking of how Excimer could be used to produce output that this tool understands. Looks like we'd have to produce nflxprofile formatted messages, but the nflxprofile package seems like a classic Netflix OSS repo without any docs, so I'd first need to reverse engineer what goes into each field :/ [15:13:39] The primary usefulness of this tool seems to be that it could produce visualizations for smaller outages or blips that'd otherwise be smoothened out in the current flamegraphs [17:03:14] mszabo: hm i wonder how that plays with sample data and significance, and monitoring [17:03:37] Presumably you can't really query or alert on such thing [17:04:49] We would (and do) use specific time spans for that, stored in Prometheus as histogram with buckets. We alert on that. And then investigate based on Grafana dashes and hourly /daily flame graphs [17:15:34] yeah, I guess the hourly interval is probably small enough for disruptions to stand out in the flamegraph for that interval [17:16:29] since it's easy to find/alert on response time fluctuations (using metrics in the way as you described), but finding out the reason is the hard part :D [18:18:26] mszabo: to be clear, I think that tool looks awesome, but yeah looks like we might have enough of it covered in other ways for now [18:19:03] which reminds me, we need to finish the prometheus migration for mw core metrics [18:19:23] (away from statsd, to fully embrace the flat structure, tags, and histogram buckets) [18:49:38] it's one of dave's goals this quarter [18:56:42] for MW? [18:57:07] I mean https://phabricator.wikimedia.org/T240685 and subtasks [18:57:29] ah, no, navtiming [20:17:33] AaronSchulz: in terms of rdbms read-only for several minutes, where are we on that? did you find the root cause of that one? [20:26:45] Krinkle: I think more testing/info is needed [22:31:42] AaronSchulz: made this in reply to some questions from serviceops - https://commons.wikimedia.org/wiki/File:Wikipedia_Memcached_flow_2020.png#/media/File:Wikipedia_Memcached_flow_2020.png [22:31:57] if that looks alright, I'll add it to the wikitech doc pages [22:38:51] * AaronSchulz looking [22:43:59] Krinkle: some of the arrows seem to make things seem layered (SqlBlobStore -> Title -> MessageCache). Also, rdbms does not use localclustercache afaik (it does use localservercache).