[00:01:26] gilles: anyway, I'm @timotijhof:matrix.org if you want to make me moderator. [00:47:50] Krinkle: is there a window for the coalesce keys thing? [00:49:49] AaronSchulz: there is not [01:00:32] AaronSchulz: do you agree a single master for all would work fine? https://phabricator.wikimedia.org/T269324#6794656 [01:00:42] AaronSchulz: if there is no other deployment slot, fine to do whenever [01:00:45] going offline in an hour though [01:08:48] Krinkle: you mean a single shard/dataset? Sure. If you mean one master out of 6 servers, that would be codfw would be read-only, which is not what I want. [01:11:09] AaronSchulz: indeed, 1/3 per dc. [01:11:37] so we don't need to thnkm about hashing or global vs some set of wikis [01:11:58] it does mean the other two aren't really used normally, but I guess that's just the cost of doing business. [01:12:12] and warm-ish standby is of course neat [01:12:19] and simplifies failover dramatically [01:23:09] right [07:42:12] XML Parsing Error: not well-formed [07:42:12] Location: https://performance.wikimedia.org/arclamp/svgs/hourly/2021-02-02_06.excimer-buster.api.svgz [07:42:12] Line Number 2012, Column 23: [07:42:12] class@anonymous [07:42:38] <legoktm> is that known? Firefox won't load the SVG [08:35:17] <godog> FYI webperf_arclamp is reported by prometheus as down since 8h, meaning prometheus can't scrape the metrics [08:35:21] <godog> https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1 [09:46:02] <gilles> https://twitter.com/simonhearne/status/1356210683646521344 [14:41:16] <gilles> the TechSoup/AWS grant has already been processed, I applied the $2000 AWS credit to our account [16:05:41] <dpifke> godog: Prometheus error is T273565. Fix is coming later today. [16:05:41] <stashbot> T273565: Invalid /metrics output after adding new pipeline - https://phabricator.wikimedia.org/T273565 [16:06:57] <dpifke> legoktm: Interesting, Chromium complains that it's an errant null: `error on line 2012 at column 23: Char 0x0 out of allowed range` [16:07:19] <godog> dpifke: ah! sweet, thank you for following up [16:08:12] <dpifke> I had thought it would fix itself after a few SVG runs. The real fix is pretty easy. [16:10:35] <dpifke> Unparseable SVG frame is `class@anonymous\0/srv/mediawiki/php-1.36.0-wmf.27/extensions/CirrusSearch/includes/CirrusSearch.php0x7fbdf64f0551::parse` [16:11:37] <dpifke> Since that's a C extension, I think it's probably an off-by-one bug, which has my buffer overflow spidey sense tingling. [16:12:20] <dpifke> We can add escaping to `flamegraph.pl` but I'm also going to take a look at why that's showing up. [16:16:22] <dpifke> Nm, that's pure PHP. So the embedded NUL is probably an API quirk we're supposed to work around and/or a bug in Excimer. [16:34:02] <dpifke> Filed https://phabricator.wikimedia.org/T273640 to track. [17:31:40] <Krinkle> legoktm: dpifke afaik the null comes from some wikidata code meaning with unnamed classes. It broke logstash as well. Not excimer related, and unfortunately not due to trimming or off by one, if it's the same issue as last year, I think? [17:32:16] <Krinkle> I'll try to find it. I guess whatever we did broke again [17:43:42] <Krinkle> Depending on where the cause is, that'll inform which layer should defend against it for the future [18:21:25] <dpifke> We have some special handling already for closures: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/php/excimer/+/refs/heads/master/excimer_log.c#319 [18:25:45] <dpifke> For anonymous classes, my thinking is we do something similar. And then also add a check when writing arclamp-log to escape any non-printable characters which slip through in the future. [19:19:32] <Krinkle> dpifke: https://phabricator.wikimedia.org/T245295 [19:20:43] <Krinkle> I don't see the null mentioned there [19:20:47] <Krinkle> might be a different task I'm thinking of [19:20:51] * Krinkle keeps looking [19:38:52] <legoktm> I would think that excimer shouldn't be outputting the NUL [19:45:02] <Krinkle> yeah, that's fair, but I'm unsure where it's coming from. [19:45:12] <Krinkle> it's possible that we might be injecting it from PHP by doing something strange [19:45:17] <Krinkle> I vaguely recall that happening beefore [19:45:24] <Krinkle> which caused various logstash entries to be cut off after that point [19:45:30] <Krinkle> but I can't find it anymore. [19:45:56] <Krinkle> it too involved class@anonymous, which is the name PHP internally uses for classes created using inline 'new class {}' constructs. [19:50:04] <Krinkle> addshore: do you remember this? ^ [19:57:10] <addshore> *reads up a bit* [19:57:58] <addshore> yes, we had some class defined in a class (twas an exception), and I think that showed up something like that? [19:58:11] <Krinkle> addshore: a while ago we were getting logstash entries where the stacktrace was just "class@anonymous" and nothing else, a null byte afaik caused things to get cut off. where did the null come from? [19:58:49] <Krinkle> I see Wikibase/SimpleBag was fixed to not concatenate the class name into cache keys anymore, and to not use 'new class', but did we ever find where the null came from and/or fixed that? [19:59:30] <addshore> https://phabricator.wikimedia.org/T245295 [20:00:38] * addshore is half distracted at the moment but will try to follow along [20:11:43] <addshore> thats the only task i remember though [20:12:29] <Krinkle> yeah, but I don't see any mention there of cut off traces or null bytes. [20:14:37] <Krinkle> https://bugs.php.net/bug.php?id=74999 [20:14:41] <Krinkle> OK, so it is upstream php [20:14:54] <Krinkle> for reasons unknown to me, there is actually a null byte by default in the class@anonymous string [20:15:08] <Krinkle> I don't know where we recorded that finding [20:15:29] <Krinkle> but I guess we can't really fix that, other than to have avoided it in Wikibase by not using 'new class {}' [20:17:42] <legoktm> but now CirrusSearch is using it [20:18:07] <Krinkle> right [20:18:19] <Krinkle> but I guess it wouldn't be that weird to filter it out in excimer then, since PHP's api is exposing it there [20:18:26] <Krinkle> and at least for now that is known and perhaps even intentional [20:19:31] <Krinkle> or if we want excimer to remain true to PHP's weird traces (seems fair), then arclamp client in wmf-config would be a good barrier since beyond that it is meant to be PHP/MW/WMF-agnostic [20:20:19] <Krinkle> but it's kinda asking for trouble to expose it in excimer, and I can't imagine it being useful to anyone [20:20:26] <Krinkle> maybe we can find out why upstream did that :D [20:20:35] <Krinkle> since it seems to breka their own trace formatting as well [20:20:39] <Krinkle> e.g. exceptino object toString [20:20:46] <Krinkle> which is why logstash records break as well [20:22:46] <legoktm> I remember Tim saying that he wanted excimer to be actually easy to use, and I think stripping null bytes qualifies >.< [20:23:02] <legoktm> but fixing it in arclamp would be faster than deploying new excimer [20:23:59] <legoktm> for reference: https://codesearch.wmcloud.org/deployed/?q=new%20class%5C(&i=nope&files=%5C.php%24&excludeFiles=&repos= [20:35:42] <dpifke> Yeah, if third parties are using Excimer, they're likely going to get bit by this as well. [20:57:55] <dpifke> I tried rendering the same log using Speedscope and it handles the NUL without issue: https://www.speedscope.app/#profileURL=https%3A%2F%2Fperformance.wikimedia.org%2Farclamp%2Flogs%2Fhourly%2F2021-02-02_06.excimer-buster.api.log [20:58:54] <dpifke> Which kinda now makes me think the place to fix it is in flamegraph.pl. [21:06:52] <Krinkle> dpifke: yeah, it certainly would be imho justifiable defense in depth to strip it there as it won't have any purpose in an SVG. Replacing with "?" would ensure visibility and reduce any theoretical wrong merges [21:07:24] <Krinkle> Upstream FlameGraphs could argue for garbage in-out. Up to them I suppose [21:07:41] <Krinkle> So yeah flame pl and excimer seem like the right places for now [21:10:00] <legoktm> I just learned about http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html which seems exactly what we're looking for [21:10:33] <legoktm> I got it working locally, I'll generate some once the daily logs are available [21:11:40] <dpifke> Cool! [21:13:54] <Krinkle> legoktm: https://phabricator.wikimedia.org/T120686#4463096 [21:14:02] <Krinkle> I may or may not agree with my past self [21:14:08] <Krinkle> I'd forgotten all about that [21:15:44] <Krinkle> I didn't find it more useful than just opening to regular ones side by side and going bottom up looking for notable things that differ [21:15:57] <Krinkle> Note that we also have reverse flame graphs deployed [21:16:56] <Krinkle> Which can help identify lower level issues that are fragmented all over. Eg apcu, memc, or db becoming slower would be easier to see in reverse where all leafs are added up [21:18:26] <Krinkle> That's actually how we found that php7 apcu access is much much slower than hhvm (because it's doing a recursive copy instead of direct reuse, made possible since hhvm didn't do garbage collection and did some additional ref counting for ongoing reqs) [21:35:26] <legoktm> ooh gotcha