[08:07:55] y0 piippöl and Thanks Thanks for the great semantic database and UI [08:08:40] http://develop.consumerium.org/wiki/Consumerium:Executive_summary needs a semantic database [08:08:59] I've been compiling some nice lists with some skill in list compiling [08:09:20] Pay-day loan hell? Go p2p: http://develop.consumerium.org/wiki/Lists_of_alternative_financial_services [08:09:47] Need to get somewhere ecologically and economically? http://develop.consumerium.org/wiki/Lists_of_shared_and_alternative_transport_and_traveler_routers [08:10:33] p2p information to base consumption decisions on? http://develop.consumerium.org/wiki/Lists_of_review_aggregators_and_review_sites [08:10:59] The Invisible Green Hand at it again and again and again: http://develop.consumerium.org/wiki/Lists_of_price_comparison_services [08:12:46] but I am trained in computers so I know this meddling around with lists ain't gonna cut it and is only a headache when one knows one should database driven service.. preferably SPARQL queriable database.. possibly on top of SQL or from-scratch https://en.wikipedia.org/wiki/Triplestore [08:13:49] And Wikivoyage needs to get with the program [08:13:53] program: wikidata [08:17:52] Instead of articles redundantly stating "There is a transport link thingy to place X" and the other article stating "There is a transport link thingy to place Y" we could just have "connection" item that also includes the instructions to "get to and get from" for both terminals and then == Get in == and == Go next == could be generated reducing 4x redundant manual editing [08:18:57] Here is an article that arranges its content clockwise starting from west [08:18:59] https://en.wikivoyage.org/wiki/Boating_on_the_Baltic_Sea [08:19:43] We should generate the "To the west is Place X", "To the north is Place Y" etc. etc. from wikidatadatadata [08:20:02] for each Wikivoyage article [12:44:50] moin :) [12:48:29] jzerebecki: gut an idea what'S wrong with https://gerrit.wikimedia.org/r/#/c/287657/ ? Wrong version of Scibunto? [12:48:35] hey aude [12:53:06] DanielK_WMDE__: https://en.wiktionary.org/wiki/%E7%8C%AB if you want to see example of how kanji works :) [12:56:00] also sometimes see https://en.wiktionary.org/wiki/%E3%83%8D%E3%82%B3 [12:57:37] * nikki wonders why you're talking about kanji [12:59:22] * aude learning japanese :) [13:00:33] cool :) japanese is fun [13:00:57] it is [13:02:16] is english wiktionary useful for these things ? I just use google translate (for chinese characters) as a dictionary [13:03:02] rom1504: often things are explained in more detail in wiktionary [13:05:26] I tend to use jisho.org most of the time, but wiktionary does have more detail about some things, like for cat it has the pitch accent and the counter [13:14:43] aude: yea, we have transliterations there, and "readings". We could treat readings as forms, but different readings actually carry different meanings. So another obvious option is having a separate lexeme for each reading. Needs more thought... [13:16:31] oh, are you actually talking about how japanese would work with the proposal for fitting wiktionary data into wikidata? [13:20:07] nikki: ultimately yes [13:20:18] and chinese [13:20:49] DanielK_WMDE__: tests fail for me with https://gerrit.wikimedia.org/r/#/c/287657/ [13:23:38] sounds interesting [13:24:14] on a related note, I was wondering what would happen with the translingual bit [13:24:20] DanielK_WMDE__: the latest run uses f4501ccd224f1d8dbc764e12f94db132e03ffb96 of scribuntu, what do you mean wrong version? [13:25:50] since that's information about the character itself, not about a word in a language... which then made me wonder about unicode characters in general [13:27:26] nikki: that's what i was asking DanielK_WMDE__ :) [13:27:49] oh, right :D [13:30:54] information about the character itself would go into the item about the character, I guess. [13:31:54] just like we have an Item about "A" and "Z" and "Д" and "Π". [13:32:36] aude, nikki --^ [13:32:40] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1666 bytes in 0.626 second response time [13:33:12] But modelling "readings" is still a bit open. Johl also talked to me about that yesterday. [13:37:08] heh, figuring out which items we're missing will be fun [13:38:31] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1678 bytes in 0.425 second response time [13:45:40] there's other things which use translingual sections which aren't single characters, like https://en.wiktionary.org/wiki/40 where we have an item for the number but that's only one of the meanings listed in the translingual section [14:00:52] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1667 bytes in 0.591 second response time [14:03:55] nikki: it seems to me that translingual things would go either into items, or stay on wiktionary. [15:00:31] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1667 bytes in 0.192 second response time [15:14:34] Hi speciesATtm and zorglub27! [15:14:35] Hello, does anybody know how to generate a JSON dump of a wikibase instance that can be used by PropertySuggester-Python? [15:20:18] speciesATtm: dumpJson.php maintenance script in wikibase repo [15:27:07] Thanks a lot, found. [17:49:42] in this example: http://pastebin.com/eubuy8fL does anyone know what the v prefix is supposed to be? I'm guessing vcard, but that doesn't return any results. [17:51:36] perhaps psv: ? [17:52:03] aude: Thanks again for helping. speciewATtm: signing off here [17:52:56] yeah, I think v is the old prefix that was used for psv [17:53:13] presumably stands for "value" [17:54:11] let me try that. [17:55:34] hmm. https://www.mediawiki.org/w/index.php?title=Wikibase/Indexing/SPARQL_Query_Examples&oldid=2054945 this random old version of the examples page has v as http://www.wikidata.org/prop/statement/ which seems to match ps rather than psv [17:57:35] and I just tried with the query you pasted and it seems using ps returns results and psv doesn't [18:24:34] aude: are you around? [19:33:06] SMalyshev: ? [19:33:31] aude: hi! [19:33:41] aude: wanted to talk about the search mapping thing [19:33:47] sure [19:34:12] esp. about Parser output. So is it an assumption now that every page has some parser output? even if it's not a wikitext? [19:35:36] SMalyshev: i think it is an assumption [19:35:54] so what that output would contain? [19:39:09] sorry, had to switch to tethering... [19:39:29] i think it's expected to be able to get html from parser output [19:39:47] also extra data, like category links, external links [19:40:42] it can be up to the content type to take contents of parser output and handle view / edit / etc. actions differently [19:42:22] i could imagine someday there be an interface for ParserOutput with certain things expected always [19:44:08] but if it's not wikitext then category links etc may be meaningless [19:45:31] and do all content handlers require to have html representation? what if it depends on user, config, etc.? [19:45:33] yeah [19:46:18] there now is option in ContentHandler to check if a content type supports categories and some other things (sections?) [19:47:04] parser output is sometimes split by options, like language (on wikidata and commons), that are user specific [19:47:31] other times, maybe a marker is added and then user specific stuff assembled into parseroutput on view [19:47:42] aude: the thing is to index wikidata we probably don't even need parser output. why waste time on generating it? [19:47:57] for wikidata, we don't, at least not the html [19:48:02] right [19:48:09] we still neeed parser output for geodata [19:48:19] and other meta information that we index [19:48:22] e.g. pagelinks [19:49:59] that sounds kind of weird that we need HTML representation that we never use [19:50:09] we don't need the html. [19:50:22] it's an option in Content::getParserOutput [19:50:23] can we generate parseroutput without html? [19:50:27] we can [19:50:33] ah, ok [19:50:36] we have a ticket for this [19:51:05] not highest priority, but could possibly be done [19:51:52] https://phabricator.wikimedia.org/T134547 :) [19:52:15] then the question is - suppose we have each handler decide to generate or not generate output. But the indexer does not know it. So should we have some way to extract the generated output from the handler/content? [19:52:24] is it cached automatically [19:52:25] ? [19:53:47] we have getParserOutput on Content, but that one doesn't allow for caching as far as I can see [19:53:50] then have getParserOutputForIndexing ? [19:54:02] or something where each content type can decide? [19:54:23] if requested directly from content then i think it is not cached [19:54:36] if requested via Article or some other places, i think it is [19:55:08] Cirrus accesses it via ParserCache class [19:55:40] right, but I want to remove that from Cirrus. Cirrus should not do any content handling at all beyond calling getDataForIndex... [19:55:56] but then it has that hook that wants ParserOutput... [19:55:57] * aude nods [19:56:25] as mentioned in gerrit, there are cases where we need it uncached [19:56:28] but normally cached is good [19:56:48] right, but after we got that uncached one we need to keep it to pass it to handler [19:57:03] i think so [19:58:00] so maybe I'll just add getParserOutputForIndexing. But then the problem is it'd have to talk to PageCache, and IIRC Daniel was not happy with Content dealing with this stuff [19:58:09] maybe on ContentHandler then [19:58:41] ContentHandler is where these things (anything that deals with services) should go [19:58:42] ah, but ContentHandler is common for all pages, so I can't store page-specific data there [19:58:51] some of the things in Content were a mistake to put there [19:59:08] or is ContentHandler object page-specific? [19:59:18] i don't think it is page specific [19:59:41] it just connects Content with other services and stuff [20:00:53] right. And ParserOutput is page specific, so we have to keep it inside Content... [20:01:27] or maybe WikiPage? [20:01:34] ugh wiki page? [20:01:58] Content might be ok, though maybe Daniel or someone has better idea [20:03:37] aude: ok, another question I am still missing - is WikiPage object representation of specific revision or general page? [20:04:11] i think general page [20:04:38] there are methods for obtaining latest revision from WikiPage [20:05:06] then what $page->getRevision() means? [20:05:22] is it always latest revision? [20:05:23] i think it gets latest, afaik unless a revision id is specified [20:05:44] it's latest [20:06:33] hmm... then this part: https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/includes/Updater.php#L383 is not clear [20:06:49] why we need to get a revision - wouldn't it default to latest anyway? [20:08:38] i suppose to be clear which version of the content is being indexed [20:09:01] as said in comments, small details like REVISIONUSER should be accurate [20:09:08] but it's always the latest? [20:09:18] not always [20:09:48] it says $page->getRevision()->getId() - isn't that always the latest? [20:09:50] if there a re bunch of edits in close sequence, cirrus might be indexing something that is not th elatest [20:10:16] oh i see [20:14:08] not sure i totally understand and the necessity of specifying revision here [20:14:43] right, me neither... if WikiPage is not specific revision, then I don't see a point of specifying the version on that stage [20:15:17] i thin $page->getContent() gets the latest [20:15:41] maybe chad remembers [20:16:00] https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/commit/18820d426187908b1aaee3bb2dbdf4c4d2363519 [20:17:04] ok, I think there's a good point about pool counter... [20:18:01] also I see that WikiPage's parse does much more work than Content's parse... [20:19:53] will try to ping Chad [20:19:57] ok [20:20:25] or maybe DanielK_WMDE__ would know how it works? [20:20:31] think it would be useful if he also reviews these changes, if he's interested [20:20:37] he might [20:21:14] DanielK_WMDE__: do you know what this patch does: https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/commit/18820d426187908b1aaee3bb2dbdf4c4d2363519 [20:43:48] SMalyshev, aude: added some thoughts. but i came in late, what did you discuss before i joined 20 minutes ago? [20:44:45] DanielK_WMDE__: Basically how to handle ParserOutput. The problem is as follows: we need to let handlers decide how to produce ParserOutput (and if to do it at all). [20:45:12] OTOH, we need ParserOutput in CirrusSearchBuildDocumentParse hook [20:46:08] and also, we do not want to deal with ParserCache in Content, but we can not store per-page data (e.g. generated ParserOutput) in ContentHandler [20:46:52] so we need to 1) generate ParserOutput inside ContentHandler, somehow using ParserCache also and 2) somehow make it available to the hook CirrusSearchBuildDocumentParse [20:47:05] without running parser twice of course [20:47:24] but you could cache the generated ParserOutput in the Content object. probably just the last one, and the options that were used to generate it. [20:47:51] DanielK_WMDE__: right, I can keep ParserOutput in Content. The problem is I can't generate it there without using PageCache [20:47:57] SMalyshev: for the hook, it's annoying that you have to generate ParserOutput before you know whether it is actually needed. [20:48:00] and you didn't want PageCache in Content IIRC [20:48:16] How about a lazy proxy for ParserOutput? [20:48:19] DanielK_WMDE__: that's exactly what I want to avoid [20:48:31] I mean geenrating it when I don't need it [20:48:39] DanielK_WMDE__: what kind of lazy proxy? [20:48:54] SMalyshev: i don't want ParserCache in content, but a local cache of what this Content object had done before would be ok [20:49:30] DanielK_WMDE__: right, but how then ParserOutput will get into Content? make a setParserOutput there? [20:49:42] SMalyshev: LazyParserOutput would know a callback to construct an actual ParserOutput. If would do that on demand, and then forward all calls to the actual thing. [20:49:58] That way, you can pass a LazyParserOutput as a parameter without doing too much work up front [20:50:47] SMalyshev: as to caching: Content will just remember hte last ParserOutput it returned from getParserOutput. So if the next call has the same params, it will just re-use the previous result. [20:51:25] DanielK_WMDE__: but we can't call Content::getParserOutput directly, since it doesn't cache [20:51:25] too bad we can't easily make ParserOutput an interface :/ [20:51:56] SMalyshev: we could make it cache at least the last result. that's what i'm saying [20:52:09] well, there's two caches. I mean ParserCache [20:52:21] so when we call it first time, it should be able to use ParserCache [20:52:29] Content? [20:52:32] no, i don't think so. [20:52:37] well, something... [20:52:37] WikiPage, yes [20:52:57] the problem is that WikiPage also does a lot more than parsing, including pool counters [20:53:26] so we avoided that by having this code: https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/includes/Updater.php#L382 [20:53:38] which checks cache directly and then calls $content->getParserOutput [20:54:02] yea, i just looke dat that code. it seems to be an ok solution [20:54:31] DanielK_WMDE__: so the problem there is that we need ParserCache, which can't be in Content, but we do want to keep the result of it in Content :) [20:54:57] I can just make Content::setParserOutput but that would be a bit dirty I feel... [20:55:45] once we have multi-content revisions and virtual slots, we'd have something like RevisionHandle::getSlot( 'main-html' ) which would know how to get or create the html... [20:55:48] also means we'd have to put it on the interface... [20:56:09] SMalyshev: why do you want to put the output of ParserCache into Content? [20:56:21] DanielK_WMDE__: I have to keep it somewhere... [20:56:22] I#d prefer not to have any setters on Content. [20:56:41] There is a notion of immutable Content objects that allows some optimizations. [20:56:50] Me too, but I have to keep ParserOutput object somewhere so that later the hook could access it [20:56:52] if you add setters, we no longer have immutable Content... [20:57:14] right. So the question is - where I keep the ParserOutput then? [20:57:20] SMalyshev: just put it into the ParserCache?... [20:57:38] DanielK_WMDE__: that's an option... [20:57:50] and when you call the hook, you fetch it from there. (if it's no longer there, you'd have to re-generate it, i guess) [20:58:11] that would be a kind of a bummer to have to parse a page twice [20:58:27] yes, but it would only happen if the parser cache suddenly vanishes [20:58:41] not likely. but since it's a cache, it *can* happen, and needs to be handled nicely [20:59:16] right... and there's another problem - some handlers just may have no useful parser output [20:59:44] so what they do now? what should they do in general? [21:00:10] Also, some - like wikibase - may have different parser output for indexing. Namely, we don't need HTML for indexing [21:00:22] but we may need outgoing links and geodata [21:00:42] SMalyshev: that's why getparserOutput has a flag to turn off html generation [21:01:02] set it to true, and you get all the meta-data, but generating html is optional. [21:01:07] DanielK_WMDE__: right... but what with the caching? how would be know which flags to use to fetch the cached object? [21:01:27] that would have to be in the cache key somehow [21:01:55] right, it probably is. But how the hook would know what is the right cache key? [21:02:14] i don't think it is, actually, because that flag in't in ParserOptions. it probably should be. [21:02:37] I guess we need something like Content::getParserOutputForIndexing, which is cached [21:02:42] yea, good question [21:03:16] i'd still like to remove access to services from Content, not add more :/ [21:03:34] or maybe even both ContentHandler::getParserOutputForIndexing and Content::getParserOutputForIndexing [21:03:36] but getParserOutputForIndexing sounds sensible. [21:03:39] hm... actually... [21:03:58] you only need the PO for calling the hook, right? [21:04:18] otherwise, whether PO is used, and which one, is up to the ContentHandler [21:04:27] DanielK_WMDE__: right. Well, except fot when I need it for actual indexing :) [21:04:35] right [21:04:55] so maybe the lazy proxy solves this problem [21:05:03] so ContentHandler builds PO (or doesn't if it doesn't need it) but then I need to somehow keep it around [21:06:30] SMalyshev: maybe getFieldsForSearchIndex could somehow also return it. The caller should know how to pass it on, right? [21:07:07] DanielK_WMDE__: hmm... not sure how. In general, two-return functions kind of suck [21:07:07] not quite sure how to do this nicely. maybe we need a SearchFields object instead of a simple array. We could put the PO (or a promise) there. [21:07:38] there are things in parer output that we want for indexing [21:07:42] like pagelinks [21:07:58] aude: of course [21:07:59] i think these would be fields, but then maybe some caller of the hook wants to modify, etc. [21:08:03] I am thinking about having ContentHandler::getParserOutputForIndexing that would deal with all cache things etc. [21:08:06] and there is geodata.... [21:08:09] but what those are and how to get them depends on the ContentHandler [21:08:16] which appends things to parser output [21:08:29] which of course we can change how that works in case of geodata [21:09:07] aude: attaching things to ParserOutput should continue to work, if we manage to use the same ParserOutput object consistently [21:09:09] it would be nice if ParserOutput had an interface, and there are some common things for all content types [21:09:20] and others like categories only apply to some [21:09:32] SMalyshev: sounds ok to me [21:09:40] DanielK_WMDE__: that's one of the reasons we still need parser output at this point [21:09:48] but I don't see where to keep the result... Maybe in WikiPage? [21:10:00] is WikiPage immutable too? [21:10:04] ick [21:11:25] SMalyshev: if we don't have a good way to directly pass the PO around, let's just put it into ParserCache, if it wasn't already there. [21:11:29] Doesn't hurt, right? [21:11:30] the problem here seems to be we have an immutable object and we have to cache something that depends on it [21:11:50] DanielK_WMDE__: I'm kind of concerned about no-HTML option though [21:11:51] indeed [21:12:00] it doesn't seem to be part of the cache key [21:12:18] though I guess I could fix it [21:12:28] DanielK_WMDE__: there's also cases where we want uncached ParserOutput when indexing [21:12:32] SMalyshev: yes, true. But thinking about the hook.... we can't use it. We don't know whether the hook handlers need the html, spo we have to generate it. [21:12:34] just to keep in mind [21:13:18] hmm... that kind of sucks since we'd have to generate tons of useless HTML in fear that some hook somewhere may want it [21:13:28] aude: like, when? [21:13:28] even though that HTML makes no sense for indexing [21:13:38] DanielK_WMDE__: like enabling geodata for wikidata :) [21:13:52] SMalyshev: that brings us back to the LazyParserOutput thing. it would generate the html on the fly. [21:14:11] and suddenly there are coordinates in parser output (but not in ParserCache) [21:14:17] but you can't really cache the lazy thing. can't serialize a callback. [21:14:18] DanielK_WMDE__: hmm... not sure how would that work. [21:15:04] making it skip the cache is not that hard IMHO, I can add option [21:15:19] SMalyshev: the idea is tha tParserOutput doesn't need to know the HTML in advance if it knows hot to make the HTML (via a callback). [21:15:21] storing it is a bit of a hassle... [21:15:32] SMalyshev: we have the option, but maybe needs to be moved around [21:15:35] basically, it would be a future/promise [21:15:39] maybe I need to think about it differently [21:15:39] it would be used from a maintenance script [21:15:55] yea, let's think differently. [21:16:01] can we ditch the hookP? [21:16:03] have ContentHandler::getParserOutputForIndexing and then just pass that output into getDataForIndexing? [21:16:17] DanielK_WMDE__: probably not suddenly [21:16:21] I would be happy to ditch the hook, but we can't do it right now [21:16:29] aude: what uses it? [21:16:36] DanielK_WMDE__: wikibase :) [21:16:41] bah [21:16:46] for one example, and geodata [21:16:53] and who knows what else [21:16:55] well, wikibase won't be using it for long when we're done :) [21:17:00] yeah [21:17:04] but wikidata and who knows what else is a problem :) [21:17:08] I mean geodata [21:17:56] can we modify it to not expect a ParserOutput object directly? existing handlers could be made agnostic [21:17:56] so if I split it into 2 methods, I introduce explicit ParserOutput parameter which kind of sucks, but OTOH I don't have to worry that much about where to keep it [21:18:20] the hook already provides content, so if some case really needs [21:18:28] then they can already get ParserOutput (uncached) [21:18:44] right but I don't want to break them if I can avoid it [21:18:48] not that nice though [21:18:52] right [21:19:02] does the hook already get a PO as a param, or are we adding that? [21:19:11] already [21:19:33] right [21:19:35] so maybe just have something like $output = $contentHandler->getParserOutputForIndexing(); $data = $contentHandler->getDataForIndexing($output) and then use $output for hook [21:19:59] DanielK_WMDE__: hook already has it, so maybe other handler should have it too [21:20:11] yea [21:20:26] but that means we *always* have to generate it [21:20:30] the downside is that we're exposing stuff that is really implementation detail, but the upside is we solve the "who keeps it" problem [21:20:40] DanielK_WMDE__: new ParserOutput() is a valid output :) [21:20:56] but yeah new function in interface, but we add a bunch anyway [21:21:39] that would clear up the information flow, yes! [21:21:40] so if you don't care about parser output you can just add a dummy one [21:21:51] if the only problem is the hook, we could also do this: [21:22:14] introduce a new hook that doesn't take a PO directly (but a POPromise or a WikiPage or something) [21:22:34] SMalyshev: i think some usages of the hook do care [21:22:38] sadly [21:22:59] maybe we can try to change most of them [21:23:01] we detect if something is using the old hook, and only then pre-generate a parserOutput with full html. Then we migrate the users of the old hook to the new hook, and kill the old hook [21:25:03] aude: well, if some handler doesn't supply data for some extension they'd have to deal with it [21:25:45] DanielK_WMDE__: we already have new hook to add data to search fields, SearchDataForIndex [21:25:50] Hooks::run( 'SearchDataForIndex', [ &$fields, $this, $page, $engine ] ); [21:26:09] so that should replace that Cirrus hook eventually [21:27:27] but I see it doesn't have ParserOutput... so I wonder what if the extension does need it [21:28:02] it does take WikiPage but if it asks for PO it may re-generate it. Then again, if we cache it it's ok I guess [21:29:20] SMalyshev: what is $this ? [21:29:22] Content? [21:29:28] ContentHandler [21:29:42] Content is inside $page (which is WikiPage) [21:29:43] but still can get Content, if needed [21:30:13] so can get ParserOutput if really needed (but probably should be discourged) [21:30:44] right... but I wonder if we can replace it [21:30:49] e.g. what we do for GeoData [21:32:15] * aude needs to think about it [21:32:34] right, me too :) [21:33:16] just keep in mind that wikitext + wikibse have geodata content [21:33:37] but we can add PageOutput to that handler too btw, since now it's a parameter to getDataForSearchIndex [21:33:46] once they get to cirrus, they should be handled the same as geodata [21:41:26] DanielK_WMDE__: another question I just remembered - should every content have text field? [21:41:41] right now only TextContentHandler and above have it [21:41:49] but I can move that code to base handler... [21:42:09] though for some handlers that field may be useless [21:42:44] maybe it should only be in TextContentHandler [21:42:49] that's where it is now? [21:43:11] oh https://github.com/wikimedia/mediawiki/blob/master/includes/content/MessageContent.php [21:44:01] a non-TextContent that has text for search [21:44:24] aude: why it's non-text then? [21:44:34] maybe I don't understand what Text is... [21:44:34] Message is weird [21:44:44] i think almost everything else is TextContent [21:45:49] aude: which content handler it uses? [21:46:23] not sure it uses content handler [21:46:51] aude: hmm... then how one would index it? Indexing uses ContentHandler... [21:47:06] not sure it is indexed in the same way [21:47:20] well, if so then maybe we don't care :) [21:47:22] mesages are usually part of other content [21:47:46] but it has getTextForSearchIndex [21:47:49] so not sure [21:48:04] well, some other handler could still use that function [21:48:12] we don't remove it [21:58:09] SMalyshev: i think for backwards compat, every content model should have a text field. That allows us to keep using getTextForSearchIndex [21:59:01] aude: EntityContent is non-text, and provides has text for search [21:59:22] SMalyshev: the text field is also the one thing all search eingines support. [21:59:26] DanielK_WMDE__: yeah :) [21:59:32] DanielK_WMDE__: well, shoud it really be so? EntityContent does produce text, but should it really? [21:59:52] SMalyshev: it should for search eingines that don't suppor tanything else [21:59:52] SMalyshev: it predates JsonContent [22:00:04] for cirrus, we could skip it, if we have that knowledge about cirrus. [22:00:16] DanielK_WMDE__: other SEs can still take data from fields and merge it... [22:00:35] like mysql search [22:00:52] i'm not sure how much of the data fields we can support in dumb search [22:00:53] SMalyshev: true. and if we didn't already have getTextForSearchIndex, that might be the way to go. though they woouldn't know how to merge it best. [22:01:35] I'd leave it to the Content(Handler) to decide what to feed to which search eingine [22:01:54] the baseline default would be the text field, populated from getTextForSearchIndex [22:01:58] DanielK_WMDE__: I guess I can move the text part to ContentHandler, but it feels a bit weird and then there's no real way to not generate the text field even if you don't need it [22:02:15] then also source_text and text_bytes [22:02:47] not sure about source_text... [22:03:08] content models that know e.g. that cirrus doesn't need it can skip it [22:03:31] you have to know about the SE *and* the content model to be able to make that decision