[10:54:40] dcausse: Is T414066 a blocker or could we try to request it from DPE (see https://phabricator.wikimedia.org/T403298#11499023) [10:54:40] T414066: Download enterprise structured content snapshots in hdfs - https://phabricator.wikimedia.org/T414066 [10:56:15] pfischer: T414066 is not a blocker, we'll be using our credentials for this, some code has already be written https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/blob/main/discolytics/cli/import_enterprise_dumps.py [10:56:24] we need to schedule it from airflow [10:56:45] broader question is perhaps where to land these in hdfs in case others are willing to use them [10:57:11] but my understanding is that folks are mainly interested in html dumps not the structured ones [11:00:47] Okay, but we have a legit use case here too, so once DPE has a pipeline in place, it could/should download both. Since we do not have a timeline for our own parsing/tokenizing, we’ll rely on those Enterprise dumps for as long as necessary. [11:02:02] We could hand over the import_enterprise_dumps so it no longer is our responsibility to maintain that pipeline. [11:04:05] pfischer: definetly [11:04:33] *definitely [11:40:52] lunch [15:00:45] \o [15:02:40] .o/ [15:08:21] o/ [15:20:51] sigh..mjolnir feature collection has been failing for awhile [15:22:39] Exception: Did not collect equal number of rows per feature [15:22:53] not sure how important it really is to get that going right now thoug [15:23:50] yes... I've been ignore glent failures as well... :( [15:23:59] *ignoring [15:27:32] we can file a task and ponder that for later... I filed T412673 for glent [15:27:33] T412673: Glent generate_query_similarity_candidates fails with NPE - https://phabricator.wikimedia.org/T412673 [15:27:44] yea i suppose thats the right course [15:28:52] moving T414066 & T414070 to the workboard [15:28:53] T414066: Download enterprise structured content snapshots in hdfs - https://phabricator.wikimedia.org/T414066 [15:28:53] T414070: Chunk, trim and generate passage embeddings from enterprise structured content snapshots - https://phabricator.wikimedia.org/T414070 [15:32:09] filed T414095 & T414091 as well [15:32:09] T414095: Configure opensearch ML connectors/models - https://phabricator.wikimedia.org/T414095 [15:32:10] T414091: Import passage vectors into opensearch - https://phabricator.wikimedia.org/T414091 [17:17:36] lunch, back in ~60m [18:15:05] * ebernhardson is writing docs for all our debug apis...it's way longer than i was expecting :P [18:31:35] back [18:34:15] I can imagine... thanks for taking care of this! [18:34:51] i'm cheating, claude wrote the first draft. But it's still going to take hours to get it into a good place [18:35:49] maybe hours is underselling :P [18:41:03] :) [18:48:25] dinner [20:01:57] appointment, back in ~90 [21:04:02] hmm, after writing "All these debug features are publicly accessible. Wiki is a public resources, and these APIs don't expose any private data" i wonder...maybe we should have some config flag that turns them off? Our use case is that everything is public, but that might not be true of external cirrus users. [21:04:22] but i dunno, it's been fine. It's just becoming better documented now [22:05:09] getting close: https://phabricator.wikimedia.org/P86859