[14:21:37] ebernhardson, dcausse, is T410681 ready for fkaelin to start writing to it? [14:21:38] T410681: Setup opensearch 3 on relforge servers - https://phabricator.wikimedia.org/T410681 [14:49:38] \o [14:49:41] pfischer: yea [14:50:26] if he wants to use the build-in ml inference it might still require some config [14:50:33] but otherwise, should be ready [15:21:52] sheesh, somehow the spark git repo is only 900M, but the spark-nlp repo was a 4GB clone [16:22:52] :S a google search for `"spark.nlp.remote.model.dir"` returns only three results, they are job postings :P [16:30:43] ebernhardson: so now you know where you're working next! [16:30:58] lol [16:54:04] taking dog out [16:54:26] uninspired...but the only solution i can find to spark-nlp shipping the model from driver->executor (instead of hdfs->executor) is to slow down executor creation so it has time to ship the model... [16:54:56] well, maybe, it hasn't finished without failures yet :P [18:18:12] * ebernhardson wonders how terribly inefficient the _source field in opensearch is with vectors...it's going to store the entire thing as a json string, and i'm going to guess those float strings don't compress well [18:19:45] noticing that my thing to generate vectors for each ~500 byte window in simplewiki results in 550GB of parquet...and that should be encoded as floats [18:20:04] probably need smaller vectors, i haven't even checked it's generating the defaults [18:26:50] a full encoding of simplewiki only takes ~20min with 512 cores though, suggests models of a similar complexity could generate larger datasets in a plausible time frame [18:28:06] (content only, not the whole thing) [22:08:48] ebernhardson: what was the dimensionality of the vectors you used? AFAIK research had already figured out feasible size, just can’t find the source right now. [22:09:18] pfischer: it turns out..there was a problem with it emitting 768*128 vectors, rather than just a 768 vector. So probably not such a big problem :) [22:10:03] also thats just the default this thing had, not trying to make it work great, just show vectors getting in and being queried out with some semblance of them matching [22:23:49] Oh, two orders of magnitude… that could explain it. And which model do you use in spark? Is that still the default OpenSearch just used externally? [22:24:52] pfischer: in sparknlp i was initially using BertSentenceEncoder, in part because B is near the top of the list. But it really wanted to generate un-pooled tokens (odd for a sentence encoder). I've switched it to RoBertaSentenceEmbeddings which generates the pooled vectors as expected [22:25:20] err, it was probably BertSentenceEmbeddings, but same idea. [22:26:07] I more want to see how opensearch behaves as we add more vectors, in-memory data structures usage, how on-disk (and optionally quantized) behaves, etc. Could load it with random numbers, but seemed silly :P [22:28:30] one unfortunate thing i kinda saw before but hadn't thought about...having multiple vectors means using nested documents. Will have to see how that behaves, but it's the recommended way. [22:50:02] Oh, so using an array to represent article content for lexical search as array (T407521) would only save us from nested documents as long as we do not want vectors for each embedded. Should we explore changing the the document structure to nested documents or is that a can of worms we want to avoid? [22:50:03] T407521: Represent text in cirrus as an array of sections, rather than a flat string - https://phabricator.wikimedia.org/T407521 [22:51:37] pfischer: i imagine we would want to keep normal full text on the existing fields, it would mean we would have some field like `text_embedding` thats a nested type, and that would have some `vector` sub-field that has the data [22:54:29] plenty of open questions there, havn't really thought about how vector results get highlighting. [23:34:35] Okay, thanks for looking into this! Let's discuss the options at the Wednesday meeting. Heading out.