[09:41:20] errand+lunch [12:56:27] \o [13:00:10] noticed friday...lucene does expose the regex AST as public final members of the regex class. Decided to skip all this extra complexity and went back to the original algorithm. The DFA stuff has too many heuristics trying to de-compile the DFA. It's all written, just going over tests now [13:01:12] (original == blog post) [13:01:36] o/ [13:03:05] feel kinda silly to spend all that time trying to make the DFA work, when lucene can do the easy version [13:03:31] what blog post? [13:03:47] https://swtch.com/~rsc/regexp/regexp4.html [13:04:59] nik basically implemented the same concept, but starting from the DFA which doesn't have the mathematical soundness of the original approach since the DFA loses details while being determinized [13:05:30] maybe lose is the wrong term, but it gets spread out through the graph so much [13:05:47] sure [13:07:26] one benefit, this should now be ~ O(1) on the number of input tokens in the AST, instead of based on the number of tansitions it walks. I don't think it was really slow enough to matter, but nice that this now only fails if lucene can't determinize, we don't have a second layer failure anymore [13:07:33] for the expresion generation [13:13:54] so you work at the RegExp level, inspecting all different Kind? [13:14:07] .o/ [13:15:17] dcausse: it's surprisingly simple, the is the core of the algo in java: https://phabricator.wikimedia.org/P92459 [13:16:09] nice! [13:43:17] The top level in the paste looks very straightforward, but I bet some devils are hiding in those details! [13:46:20] the extractor is still ~600 lines, so indeed it's not the simplest thing ever :) But it's at least all strict rules based. And it's a port of a known-working implementation [14:00:23] hm switching to the dp-sre opensearch images changed path.data from /usr/share/opensearch/data to /var/lib/opensearch... unsure what to do, perhaps force to /usr/share/opensearch/data to avoid confusion on cirrus devs moving to opensearch 2 [14:01:12] dcausse: sounds reasonable, i suppose i constantly delete and re-create my local opensearch data but if it resolves an annoyance, seems easy enough and minimal maintenance overhead [14:01:14] We can update the image if it's easier [14:01:52] sure, inflatador the cirrus image has a custom opensearch.yml that I can change [14:02:25] yeah, or we can copy stuff into the expected path if you prefer. Since it uses OS packages there are a few of those rough edges ;( [14:03:49] hmm, i have a valid kerberos ticket till may 13, but org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 41610414 for ebernhardson) can't be found in cache [14:03:55] always fun :P [14:09:50] meh...i wasn't thinking abot how the query strings in the logs are full cirrus things, pulling the regex queries out will take a moment of work (not that much, but i had thought this would be trivial for some reason) [14:11:07] you mean filtering insource queries from backend logs? [14:11:18] yea, i wanted to pull like 10k random queries and make sure nothing blows up [14:11:29] feeding the regex's through [14:11:51] sure, I would not expect that to be too difficult but with the cirrus syntax you never know :) [14:11:56] i guess i have to make a small test case in cirrus that reads a log file and emits them after using the parser [14:16:02] also opened https://github.com/opensearch-project/OpenSearch/issues/21604 upstream, will try and get a PR up soon [14:17:04] i suppose this is running a few hundred random lucene regexs each time, might catch things, but i noticed the lucene random regex generation looks specialized to generating awkward and edge-casey syntax constructs (ex: `[]]` matches `]`) [14:22:48] yes... possibly interesting to catch weird bugs but not useful for real-world use-cases... [14:35:51] huh, interesting idea (that i'm not going to do :P). This routine has a thing where it notices that it's going to generate way too many trigrams and simplifies things down. In theory if the expansion ran per-query-node it could access an IndexReader for termStats and make better decisions about which trigrams to keep/simplify. But sounds complex [15:00:52] I’ll be 5’ late for triage [15:58:07] dcausse: the general idea is the runner just needs to know what image to use, would could have some form of env file in CirrusSearch that provides the image name. But now i'm realizing after saying that that it only helps if we branch [16:03:01] could we have top-level loop and run create-env varying the env file to alter the image? cindy runtime might double but perhaps still OK? [16:03:47] I see that we docker destroy on every run so it's not like re-creating the env is adding much overhead [16:05:36] for cirrus that means we start accepting opensearch 2 now without branching [16:07:13] yea it should be ok [16:07:26] it takes longer, but cindy is idle most of the time [16:09:59] hm.. just looking at first-run.sh wondering if that's going to be annoying [16:11:26] or we just setup a separate cindy instance... [16:14:08] dcausse: first-run shouldn't matter, cindy basically never runs it [16:14:28] iirc that was about cloning repos, and putting LocalSettings.d pieces in place [16:14:58] but it has some "$MW docker env set ELASTICSEARCH_IMAGE" so wondering if not running that will have side-effects if switching ELASTICSEARCH_IMAGE after the fact [16:15:44] dcausse: oh, thats probably unnecessary. We also set it in create-env.sh [16:15:53] so it can apply env overrides [16:15:57] ok [16:16:19] locally i run ELASTICSEARCH_IMAGE=... ./create-env.sh for testing regularly [16:16:35] mwcli does pull opensearch simply because it has CirrusSearch as extension? [16:17:15] it's from the `$MW docker elasticsearch create` line in create-env.sh [16:17:37] ok I missed that [16:26:28] going to try this first https://gitlab.wikimedia.org/repos/search-platform/cirrus-integration-test-runner/-/merge_requests/25, couple months ago I added the opensearch service in mwcli [16:27:00] was very confusing to re-use "elasticsearch" for pulling opensearch [16:27:01] dcausse: lgtm! [16:27:06] yes, i agree [16:27:27] i'm not great at cleaning up behind me...trying to skip the yak shaving but some of that is necessary [16:28:04] :) [16:31:21] sigh... cindy is not passing currently, I'll need to investigate why before tweaking anything... [16:34:44] super handy to have screenshots linked from gerrit https://cirrus-integration.wmcloud.org/20260511-153543-1285791/screenshots/When-you-search-for-text-that-is-in-a-file%252C-you-can-find-it%21.png [16:34:51] esp. when they have a stack trace :) [16:42:52] looks like there is a way to run arbitrary services with docker in our Puppet repo, ref https://gerrit.wikimedia.org/g/operations/puppet/+/46e1f2c53e61fed9ba7c85b77acce26add866d22/modules/service/manifests/docker.pp . Might be a good thing to try in Relforge so y'all could switch OS versions more easily [16:42:58] Let me know what you think [16:43:22] nice, that does help out [16:44:49] lunch, back in ~1h [16:47:18] dinner [17:59:11] i wonder who is doing what, from a random sample of 15k insource/intitle queries in april, 2500 of them are the form `XХ век` but with ~60 unique combinations of roman numerals and suffixes [17:59:16] not important, just curious :) [18:02:52] Maybe the artificial is not so intelligent? [18:06:55] hard to say, maybe. Certainly looks algorithmic: `XIХ\u00A0в\.` `XIХ\u00A0вв\.` `XIХ\u00A0век` `XIХ в\.` `XIХ вв\.` `XIХ век` [18:13:34] surprising to me...no failures running 15k random user queries through the new system. Hopefully good enough [18:14:19] it feels like an analysis of expression differences between the old and new system would be valid...but seems unnecessary and time consuming [19:14:45] i suppose on the upside, i see lots of usage of the new syntax in query logs. Especially the shorthands [20:20:16] * ebernhardson wants prometheus counters in search-extra, would be nice to increment things like when it tries to use a degraded disjunction because the expression is too large [20:20:28] don't know what i would do with it, but i'm curious how often this hits