[08:52:57] Amir1: :D [08:53:12] I think it has something to do with our prioritization process ;) [08:53:30] shouldn't we both enjoy our holiday? [08:53:35] its raining [08:53:46] no enjoyment allows [08:53:47] addshore: have you seen this? https://phabricator.wikimedia.org/T234002#7084538 [08:54:22] oh interesting, no [08:54:27] I wonder what makes them slow [08:54:39] trying to understand now, nothing obvious so far [08:55:01] it doesnt look like that ticket was tied to my ticket? where i seem to pin it down to vm memory stuff [08:55:39] i dont think it has anything to do with any of the tests really [08:56:36] https://phabricator.wikimedia.org/T281122#7035294 [08:58:33] not 100% sure tbh [08:59:07] I'm sure i could look at more logs today that would have this same correlation to convince people ;) [09:00:55] Run the same set of tests on less stressed hardware runs much faster, even using the exact same quibble container etc [09:01:04] the only difference is the hardware / virtualization [09:01:26] then eventually I found that whenever these bridge tests would timeout this memory compaction would be happening [09:01:44] it seems to just cause everything to grind to a halt for a period of time [09:04:42] this is not the case for TCC AFAICS [09:04:50] it's just slow all the time [09:04:57] got a link to a run for TCC in CI? [09:05:05] https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/94431/consoleFull [09:06:03] aah for tcc its probably just setup of the test looking at how many api.php calls there are [09:07:00] but even in that test I see this [09:07:03] 09:08:32 INFO:backend.PhpWebserver:[Thu May 13 08:08:32 2021] 127.0.0.1:56722 [200]: //api.php?format=json [09:07:03] 09:08:44 INFO:backend.PhpWebserver:[Thu May 13 08:08:32 2021] 127.0.0.1:56724 [200]: //api.php?format=json [09:07:13] 12 second gap between api logs [09:07:25] *goes to look at memory compaction events* [09:08:12] The grafana dashboard is gone D: [09:17:08] I'm still not fully convinced why TCC takes more than core and wikibase (including data-bridge, tainted refs, client, repo, etc.) combined in every test run [09:17:59] I mean, it does have 40 tests [09:18:16] and each test does like 17 web requests in them [09:18:45] I imagine if you compare 680 total calls, to the calls done in wikibase tests that might have the answer? [09:19:18] some of them seem to do even more calls [09:19:29] yeah that's too many api requests [09:19:42] well, is it? or is our CI just slow? :P [09:20:12] if it is testing something that needs to be tested, then it should be tested, and we ideally shouldnt cut corners [09:21:15] I think this shouldn't be in the gate [09:21:39] I think we should just run all our browser tests in github actions, make everything go much faster :P [09:21:40] not that tests should be removed. I also think there are some work that can be done to reduce api calls in the tests [09:22:04] but is the work worth the time? or should we just spend a little extra $ on the hardware running them? :P [09:29:41] I really want to know if 1007 have 8GB or 16GB or 24GB memory [09:29:52] this seems to be an interesting issue [09:31:57] addshore@integration-agent-docker-1007:~$ free -mh [09:31:57] total used free shared buff/cache available [09:31:57] Mem: 23G 1.0G 18G 251M 4.4G 21G [09:33:50] I really can't find this compact_fail vmstat metric now :( [09:39:07] oh, maybe it didn't even come from graphite any more, maybe it is in prometheus [14:03:10] addshore: remember that fresnel uses the same ci workers and quibble/docker and is doing hundreds of page loads quickly. And parser tests in under 2 min. Given MySQL and tmpfs. Both of which are much slower if I do them locallly on eg docker dev [14:03:26] I can reproduce slow core browser tests locallly with docker dev and fresh [14:03:42] Usually CI specific issues would not happen locally like that [14:04:02] Caveat: I last reproduced this locally in early 2020 [14:04:08] Lots of things changed since [14:05:08] hmm, okay, there are probably multiple things at play, but I certainly see the same pattern re memory issues in the CI logs for two col conflict [14:06:23] Ack, I've not heard of that one before. [14:07:12] I see it consistently happening at "random" times in any browser related tests. Everything that has been attributed to "flakey" browser tests I'm 90% sure has a root cause of this issue [20:02:02] Krinkle: re: T282761, I assume we've considered and rejected using table partitioning? That would make a purge as simple as dropping the old partition(s). [20:02:03] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [20:05:28] I had to look it up, but the docs claim it works with replication so long as each replica uses the same partition scheme. [20:51:41] dpifke: I don't think we've explicitly considered it as-such. I do note that in terms of non-trivial quick fixes we haven't resourced anything, so the space for it hasn't really been opened up yet to think about what we could/should/would do. [20:51:56] If we have a few days/weeks of engineering time, I think there's various options indeed. [20:52:08] I do note that parsercache is not strictly append-only [20:52:32] the keys are canonical in nature (with runtime verified staleness in terms of propagated changes) [20:52:40] so rows do get replaced/updated in-place under the same key. [20:53:39] if a table were to represent a day, we'd have to know which table it is in when selecting it, as well as add/remove upon update [20:54:16] I know native partisions would remove the need for select complexity, but I guess in that case we'd need to a new "last modified" field for the native partition to fragment by. [20:54:44] My thought was partition on expiry. [20:55:01] ah, that makes sense [20:55:09] Then we just drop the oldest partition to expiree. [20:55:29] In terms of mid-long term, we'll likely turn off replication in multi dc and do mostly unthrottled purges as local dc script. no replicas to worry about. just master db congestion. [20:55:36] I think there would still be index updates involved, but a lot cheaper than deleting individual rows. [20:56:55] also keeping in mind that the same service abstraction we use for this is also used by third parties and stock MW for a self-cleansing key-value cache table (as fallback when no Memc) where any key can have any expiry, and GC happens ad-hoc every N key writes for upto N expired rows. [20:57:16] it's just that for WMF prod we then invoke the same logic from a daily cron with N=INF. [20:57:54] and also rdbms-agnostic [20:58:38] In that case we'd probably have to maintain both partitioned and non-partitioned setup/purge logic. Which isn't ideal, but the benefits might be worth it. [20:59:41] I think once this doesn't need to support any form of replicas, we can mostly just run large delete < exptime batches without much throttlign and potentially more than once a day as well. [20:59:50] Agree that turning off replication (or doing it lazily at a different level than the DB) might also have benefits. [21:00:25] I beliece the column is indexed already [21:00:26] Yeah, if we still wanted replication, we could do something have a script that copies the X most used keys over, and doesn't worry about replicating deletes. [21:03:18] historically one thing that's counter intuitive perhaps is that parsercache does seem to be one of (or our only) cache where we found long-tail retention to be surprisingly useful. E.g. most anything else (Memc, APC, varnish, etc.) we're pretty aggressive on LRU and popularity, as well as in scaling vertically and shortening max retention. On parser cache, not so much. We've always raised it back up to 30 days and found measurable perf [21:03:18] impact when dropping older stuff. [21:03:50] Do we ever update the expiration, or does a row always live for 30 days after insertion? [21:04:34] Because if the latter, that's an idea case for a table partition by day, and the we can just do `ALTER TABLE foo DROP PARTITION foo_20210413`; [21:04:44] we don't renew blobs by application requirement. a blob will either expire after 30 days and be deleted no sooner than that, or it will be replaced after an edit (or cascading template update) when a newer rendering is available. [21:06:18] so a page not modified for 30 days should (must?) generally be re-parsed after 30 days to ensure that any untracked dependencies are pruned after at most 30 days. E.g. things dependent on site configuration that are mostly static but not forever. [21:06:54] (and between exiry and purging, we already ignore it per select doing a where < expired clause.) [21:07:36] yeah, partitions could be a neat mysql/large farm optimization [21:07:46] I assume on update, it'll move it as-needed? [21:08:17] Yes, although that makes updates (that change the column(s) used for partitioning) expensive. [21:08:29] right [21:08:36] pc is fairly write heavy [21:08:42] But if we're replacing the row anyways, not that much more expensive than current. [21:09:22] In Postgres, an update re-writes the entire row anyways, so in practice it doesn't matter. Not sure about MariaDB. [21:10:10] right, but maintaining the partisions is presumably itself some form of cost/overhead that would need to happen on every inser or update. [21:11:08] I don't think much (the database has to look up where to write the row anyways), but I don't know the MariaDB internals. [21:11:47] right, I dont know if the "normal" storage logic is signifinatly cheaper than for a partition. [21:12:00] I believe indexes do generally add notable overhead e.g. having to maintain one vs not. [21:12:21] It's basically just saying "rows matching this expression belong in this physical storage area" which we do some form of anyways. [21:12:40] right, it's not in addition to the normal storage. [21:12:46] Where it might be more expensive is if other indexes (e.g. cache key) have to store extra data about which partition an entry is in. [21:14:08] For the short-term mitigation (next week) I'm leaning towards bringing forward the musical chairs sequence (for the compaction cycle, during which each shard will be briefly depooled and served by a mostly empty placeholder), and then during the depool run a mostly unbounded delete. [21:14:17] I think I'm going to suggest this on the ticket and let the DBAs shoot it down. :) We're quickly exceeding my knowledge of MariaDB internals. [21:14:50] That would probably work as a one-off to clear the backlog.