[00:00:54] not easy [00:00:58] I can do it in a few hours probably [00:01:14] so no phab ticket needed then do you need the user login? [00:01:15] no worries, I can wait. [00:01:28] User:chris.deless [00:01:45] no, there's a script that does this, but it's slow [00:02:45] So is it a glitch in the account creation process then that causes this? [07:55:36] [[Tech]]; ArchiverBot; Bot: Archiving 1 thread (older than 30 days) to [[Tech/Archives/2016]].; https://meta.wikimedia.org/w/index.php?diff=15749963&oldid=15739278&rcid=7996534 [14:06:59] whee [14:18:06] DanielK_WMDE: sorry for the lateness, we just got the cat back from vet last night and i'm madly catching up on everything [14:18:12] good news is he's feeling better :D [14:18:31] looking over a few things from last mail; blob storage batch loads are an interesting case [14:18:55] i'm thinking we should have a multi-lookup interface similar to file lookups on file repos [14:19:39] that should allow both sql-based and rest-based blob backends to parallelize, using a 'in (...)' in sql or MultiHttpClient/VirtualRESTClient for http/rest [14:19:56] in most cases, an individual revision's blobs will probably be on the same service but not always [14:20:35] in the case where they're not all on the same place, a multiplexer can use the same BlobLoad interface to split them up to backends based on the type in the address [14:20:44] this may mean we need fancier blob addresses :) [14:20:59] lemme double-check what's in there now [14:21:37] case where they wouldn't all the on the same for a single revision might be where an edit was made updating one slot but not all, and the previous rev was stored on a different service [14:24:54] just thinking of how to transition things wrt multiplexers or switching service backends safely [14:41:05] ok good we already have a multiplexer in there :D [14:41:08] * brion adjusts [14:41:49] brion: we have a multiplexer for blog storage. at least for virtual slots, we will also need on on the slot level. [14:42:05] yep [14:42:16] brion: there are a few todos in the RevisionBlobInfo class that you way want to look at, too. [14:43:11] so.. my goal for today would be to a) have a rough idea fro all points in my mail and b) get confirmation from you that the refactoring of WikiPage and Revision is on the right track. [14:44:03] great :) [14:44:04] The introduction of PageUpdateController makes this step more complex than it would be if we just added multi-content support to wikipage as it is. but it should make further refactoring a lot easier. [14:44:38] *nod* [14:45:17] ok good, the prefix-routing blob lookup stuff looks about right. should adapt well to batchable lookups similar to file repos [14:45:35] and stores too if one wishes :D [14:46:47] heh i was gonna say something about 'is blobstore a clear enough name or should we call it contentblobstore' but we have namespaces now :DDDDD yay php 5+ modernity [14:46:49] yea. should eb easy to group all blobs that come from the same store, and then do a batch lookup if the store supports that. [14:47:50] brion: there's no Blob class so far. I don't think we need one. a blob is really just a plain string. if you want to look into it, parse it into a Content object. If you need meta info, look into the Slot (aka RevisionBlobInfo) [14:48:17] yeah, strings and Content wrappers serve for the actual blobs :) [14:48:27] i have some more experimental code for this sitting at home. i'll push it later [14:48:31] nice [14:49:07] DanielK_WMDE: did we get a prefix-multiplexing store in there as well or just the loader? [14:49:23] if i'm reading correctly we'll need a matching store to add the prefixes [14:53:41] brion: there is no prefix multiplexing blob store, because the blob store does not take an address to write to. the blob store generates an address. [14:54:02] right, but the store won't know how to prefix itself will it? [14:54:22] it'll return its own address only, with no prefix [14:54:24] brion: for writing, we multiplex based on the slot. so you would say "use blob store x for slot y". [14:54:56] * brion looks [14:55:13] yes, that's correct. the thing that does the multiplex on writing needs to agree with the multiplexing blob lookup on the address format [14:55:24] that bit may actually still be missing [14:56:27] ahhh i see [14:56:42] so SlotMappingContentStore does the prefix [14:56:48] that feels weirdly asymmetric to me [14:58:05] it looks like BlobStore and BlobLookup should match up :) [14:58:12] but here, they don't [14:58:41] ok we can fiddle with those a bit [14:58:43] brion: yes. i have thought about this quite a bit. I see no better way to do this [14:58:50] not if BlobStore is free to assign addresses [14:59:40] DanielK_WMDE: what's envisioned for the slot-based store assignment? are we thinking separate stores for text, js, css, categories, etc? or is that meant more for alternate storage for derived slots? [14:59:46] brion: one solution would be to tell the BlobStore what prefix to use. It would prepend it when generating an address, and expect it when resolving an address. [14:59:53] in which case i understand why it's at that level [15:00:34] for text, css, js etc, i would probably use the same (default) store. For derived, we may use something that has a different performance/rebustness tradeoff. [15:00:40] *nod* [15:00:48] but think of uploaded media or tabular data [15:00:58] we would want specialized storage for these too, right? [15:01:05] an alternative is for the high-level prefixed lookup to be done with a different interface from the low-level lookup; that way nobody goes straight to a low-level blob store with a prefixed address by mistake [15:01:24] we might, or we might not :) [15:01:31] for media heck yeah ;) [15:01:32] brion: so PrefixRoutingBlobLookup would not implement BlobLookup? [15:01:38] that would also work for me [15:01:38] for data, ideally we'd store as non-blobs ;) [15:01:49] but i know for now we're looking at stuff like yuri's json blobs [15:01:49] yes, absolutely [15:01:57] which are more like uplaoded files than they ar elike text [15:02:28] with the current desig, we'd still serialize/deserialize, but the storage backend could be aware of the json structure, index parts, etc [15:02:49] DanielK_WMDE: yeah i'm thinking we might want the prefixed loader to sit at a level more similar to SlotMappingRevisionContentStore so it's a bit clearer [15:02:57] yeah that's actually very possible! [15:03:10] as long as the content type is well-defined the store could indeed be "smart" :D [15:03:35] ok let me adjust my comments [15:04:19] i just commented on PrefixRoutingBlobLookup.php [15:23:42] saved first set of notes; will dive into WikiPage and the update controller in a few minutes [15:23:47] brb [15:25:45] brion: cool, thanks [15:26:21] i'll be in the office for another 2 to 3 hours. will be online a couple more later tonight, but not longer in work mode [15:31:18] ok [15:31:31] gotta grab some break fast then i'll be back in a few :D [15:49:13] ok, writing up some notes from the email [15:49:50] DanielK_WMDE: do we have any notes on the sub-slots/sub-revisions idea you mentioned in the mail? [15:51:37] brion: my conversation with gabriel on the rfc [15:51:41] ok [16:02:11] hello! What team works on the notifications panel? It appears the icons have been swapped for notifications / messages [16:03:34] musikanimal: the collaboration team in editing [16:03:54] they hang out in #wikimedia-collaboration [16:04:07] thanks [16:18:37] DanielK_WMDE: ok sent you an email reply with some notes on the questions to address you listed [16:19:33] we'll probably want to formulate a few 'need some feedback' proposals for the mailing list on the api, dumps, and ui issues [16:19:36] but i think we're coming along well :) [16:19:53] still have to wrap my head around sub-stuff though, that sounds like it could get very complex [16:26:52] brion: sub-slots are fairly simple. lets say we have an html slot for parser output. we'll want to be able to have that in multple languages (or more generally, split on parser cache key) [16:28:15] brion: my solution to this is structured slot names: we'd store the parser output in a slot called html.en, html.de, etc. These "sub-slots" would use the configuration (content model, blob store) for "html", but would be named/addressed as "html.en", "html.de", etc [16:28:31] ah nice [16:28:37] yeah that's quite workable [16:28:44] . or : or whatev as common separator [16:28:59] 32-char slot name should be plenty for most cases we envision [16:29:44] brion: yes... except for the next one: [16:29:49] I'm proposing to handle "sub-revisions" in the same way, though i'm not sure yet whetehr that is correct. Let's say we want to store all "renders" of a page. A new "render" is generated when the page is re-rendered due to a template change or some such. [16:30:29] in restbase, each such render is identified by a timestanmp (page_touched) plus a uuid. [16:30:42] heh that gets big ;) [16:30:45] i'm thinking we can map this to the sub-slot mechanism: [16:30:57] html.en.. [16:31:15] yes, doable with a longer field [16:31:19] yea, it gets big, but these slots don't need to go to the text table, or be pre-generated at all [16:31:28] think of this as an interface for RestBase [16:31:56] yep [16:31:57] basically, we already have a storage facility for this. we just need to integrate it. [16:32:16] i'm still not 100% convinced about derived slots for renders (rather than a separate long-term cache interface), though doing them outside of the content/slot system might mean replicating common infrasttructure which sounds un-ideal too [16:32:22] we'll need to model the "slot storage" interface a bit more nicely, i think [16:32:39] i'm also not 100% convinced about storing derived slots. [16:32:45] i'm very convinced about virtual slots [16:33:04] yes, virtual slots are also a forward/backward-compat opportunity [16:33:16] which is good :D [16:33:16] virtual slots are not recorded in the slots table at all. they cannot be enumerated. [16:33:19] indeed :) [16:34:17] for instance a 'categories' virtual slot could read from a json list, or from the parsed text data, and consistently return data no matter whether an old or new rev [16:34:22] so many happinesses possible :D [16:36:49] brion: i could even read from the categorylinks table [16:37:03] well, only for the current revision. but still [16:39:04] i'm not quite happy with the RevisionContentStore/RevisionContentLookup interfaces yet. They need a little more though, I suppose. [16:39:41] The main issue I see is the fact that you can't get slot meta-data for a virtual slot right now. it would not be listed by getRevisionSlots. You could get a Content object for it, though. [16:39:49] later today & this weekend i'll write up some sample update code, api usage use case, dump usage, etc on a wiki page and just make sure they look right, then solicit some more feedback from folks who would be affected [16:40:04] hmmmm, yeah [16:40:04] that would be excellent! [16:40:21] so no enumeration for virtuals, but you should still be able to ask about it without invoking the loader right? [16:40:27] (ideally) [16:40:38] Also, for derived or virtual data, the Content interface is a bit heavy. In particular, in an HTML slot, you really want to have a ParserOutput object [16:40:58] yeah, and things like the sha1 or length aren't amenable for virtuals really [16:41:06] could be [16:41:09] but not required [16:41:20] so here we really want a more general 'give me an object for this info about this revision'? hmm [16:41:44] i wonder if virtual slots is actually making our life harder and it should be alightly more separate [16:41:55] one thing i have been thinking about is to just use an assoc array for blob meta-data. some stores coudl provide extra data efficiently. would be a nice optimization vector. [16:42:00] allow providing a Content but don't require it... [16:42:22] perhaps we could have a SlotData interface [16:42:30] Content would extend that [16:42:44] SlotData would really only have very few methods. getType(), copy(), equals(). [16:43:02] Not generic access to any data - you have to know the concrete type for that [16:43:16] (i would love to deprecate getNativeData on Content anyway) [16:43:35] or we could support arbitrary objects for virtual slots [16:43:49] hm... [16:44:02] hmmmm, i kinda like the SlotData idea [16:44:07] do we want to treat virtual slots like content slots at all? is there any context in which they should be treated the same? [16:44:26] i hate losting what little type-safety php has in terms of type annotations on params :D [16:44:46] i think unless the virtual slots can actually take the place of a non-virtual slot, it may make more sense to separate them [16:44:56] Maybe we should treat the separately. RevisionContentLookup could have a getMetadata( $name ) method that returns an arbitrary object. [16:45:13] if we always use virtual slots as a front-end which might have a slot backend, or might pull from somewhere else, then there's no real need for it to be the same interface i think [16:45:16] yeah [16:45:21] i'm leaning towards that too :D [16:45:39] basically, virtual slots can take the place of derived slots. and for derived slots, it would be nice if they could use the same storage mechanism as primary slots. [16:46:13] well reusing the BlobStore/BlobLookup & friends should work for that [16:46:21] yes [16:46:48] one concern is whether batch-recompression or similar stuff has to look in multiple places for things to compress, but that's not super tricky [16:46:54] i'd still want the access to the virtual thingies to be defined by the RevisionContentLookup interface. [16:47:01] and actually we'd probably use very different storage options on the derviced stuff [16:47:08] it shoudl be clear that these are associated with revisions in the same way as slots are [16:47:14] *derived [16:47:14] yeah [16:47:14] they go together, tied to the rev [16:48:19] yeah [16:48:40] so, for a baseline impl, i'd go for including virtual "extra info" (not slots), but not derived slots. [16:48:54] i'm not sold on storing (and updating!) derived info on revisions [16:49:04] *nod* [16:49:11] well, at least not in the same way as "primary" info. [16:49:16] lots of potential for confusion there [16:49:22] i see the use for keeping revison-related data, but it does feel it should be distinct [16:50:17] i'll try to catch up with gwicke later on thoughts & prefs & requirements [16:50:23] brion: well... one argument for handlign them like slots is to allow extensions that don't want to think about storage to associated arbitrary derived data with a revision. [16:51:34] *nod* say, graphs that should store data not in the page props ;) [16:51:41] still a bit torn on that. in order to be future proof, we should probably have some kind of flags field in the db schema for slots, so we can filter primary vs secondary or whatever... or would the slot name be sufficient? [16:51:45] lots of questions :) [16:52:18] the big question about names/registration/marking is what happens if you disable an extension that stored data [16:52:33] brion: do you have a good name for the associated, virtual, non-slot data? i want to add a getter to RevisionContentLookup. it would be read-only for now i guess. [16:52:48] i want to say 'metadata', though that might be overloaded :) [16:53:02] oh, of there is no handler for a given slot, it would just be ignored, i guess [16:53:26] we might end up having meta-metadata, such as the size of the html :) [16:53:43] haha [16:53:50] "secondary"? "associated"? "auxilliary"? [16:54:01] hyperdata? :D [16:54:18] could also go 'attachment' [16:54:20] or 'attached data' [16:54:24] let's not get into fractal data storage [16:54:33] though that could confuse with things like per-page images :D [16:55:04] "attachment" is actually a term i have been using to describe extra primary slots. attachments are use supplied and editable. [16:55:17] fair enough [16:55:45] maybe just getDerivedData / setDerivedData [16:56:09] use 'derived' but avoid 'slot'? [16:56:11] * brion hmms [16:58:50] it's not necessarily derived [17:05:59] have to go poke the cat briefly, back in a bit [17:10:02] brion: i'll go through your comments and your mail. then i'll go home and eat, will be back online around 22h CEST. [17:14:31] excellent [19:39:58] * robla reads the backlog discussion and is happy that it happened, even though he doesn't entirely understand it yet. [20:22:17] brion: here's the alternative approach to the slot laoding interface that i was experimenting with a year ago: [20:22:18] https://gerrit.wikimedia.org/r/#/c/246433 [20:22:34] there's a few more patches that depend on this one [20:22:40] I'm curious what you think of them [20:22:55] i'm thinking of folding that code into the current refactoring effort [20:23:33] brion: btw - for code experiments, I like to do One Big Patch, but for detailed review, I'd suggest to split it into several parts [20:27:20] *nod* [20:41:20] replying to mails [21:08:25] replied :D [21:08:55] lemme update on the comments on the rev [21:26:18] DanielK_WMDE: ok made a couple brief replies [21:55:19] ok i gotta take the cat in for one more quick test, then i'll do some more passes on documentation & mailing list checkins over the weekend [22:00:53] brion: best wishes to the cat [22:04:35] James_F: here's the wikitext spec conversation I was talking about: [CommonMark thread on "Talk:Requests_for_comment/Markdown"]() [22:04:51] robla: Thanks, will read.