[07:41:26] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10MoritzMuehlenhoff) There should also be a number of additional test cases in Phab: https://phabricator.wikimedia.org/tag/wikimedia-svg-rendering/ IMHO it m... [08:26:19] 10serviceops, 10Operations, 10TechCom-RFC, 10Wikidata, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) > the request to index.php is conditionally routed directly to the SSR service. In our world, the SSR service is there, so we configure... [10:31:32] 10serviceops, 10Operations, 10TechCom-RFC, 10Wikidata, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Joe) Let me state it again: the SSR service should not need to call the mediawiki api. It should accept all the information needed to render the term... [10:36:42] 10serviceops, 10Operations, 10TechCom-RFC, 10Wikidata, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Joe) >>! In T212189#4838090, @Addshore wrote: > > The "termbox" is more of an application than a template. > Only it knows which data it needs - act... [10:40:33] 10serviceops, 10Operations, 10TechCom-RFC, 10Wikidata, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Joe) Also, if we're going to build microservices, I'd like to **not** see applications that "grow", at least in terms of what they can do. A microser... [10:48:31] 10serviceops, 10Operations, 10TechCom-RFC, 10Wikidata, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Joe) >>! In T212189#4839848, @WMDE-leszek wrote: > The intention of introducing the service is not to have a service that call Mediawiki. As discuss... [11:09:03] 10serviceops, 10Operations, 10TechCom-RFC, 10Wikidata, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) To avoid misunderstandings: I was not questioning MediaWiki's action API being performant. By "lightweight" I was referring to "PHP has... [11:13:08] 10serviceops, 10Operations, 10TechCom-RFC, 10Wikidata, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Joe) >>! In T212189#4840039, @WMDE-leszek wrote: > To avoid misunderstandings: I was not questioning MediaWiki's action API being performant. By "lig... [11:16:39] <_joe_> ok now that I'm done spamming that task [11:17:00] <_joe_> I was thinking of tied services like zotero and citoid, or apertium and cxserver [11:17:24] <_joe_> in the context of kubernetes, might it make sense to bind them in a single pod? [11:17:54] <_joe_> so have a single deployment which includes both containers, speaking to each other only locally? [11:18:08] <_joe_> I can see arguments both way [11:25:21] Let's discuss arguments, in my case I see clearly that each service should be a different pod [11:25:34] Specifically if you take into account sidecars [11:25:59] Is also relevant for scaling policies [11:26:09] Services may scale differently [11:31:22] <_joe_> so, monitoring sidecars would ofc become more complex, and that's manageable but it was one of the arguments against it [11:31:37] <_joe_> as far as "services may scale differently" [11:32:04] <_joe_> apertium and zotero only exist to be the backend of cxserver and citoid respectively [11:32:21] <_joe_> so you could balance the number of workers in each to withstand the same load [11:33:22] <_joe_> advantages include having less moving parts between the two components, lower latency, less complexity in the configurations of e.g. the load balancers [11:35:03] <_joe_> it would solve the issue of zotero having no readiness probe available - if zotero doesn't work, the citoid instance in the same pod won't either, and so the whole pod will be marked not ready as citoid (which can be checked significantly) will be marked not read [11:36:21] <_joe_> so, there are some advantages; as disadvantages i see mostly the added complexity of the deployment and the fact it would be an oddball compared all other deployments [11:37:16] <_joe_> let me make you an example - suppose you had a pool of X servers, and a service that needs to connect to those [11:37:23] <_joe_> to render some graphics [11:37:58] <_joe_> would you create an X server deployment, and an application deployment, separated from each other? [11:38:32] <_joe_> (before you ask: yes, we have a service that uses an x server) [11:46:31] No, in that case it makes sense that the X server lives in the same pod as the render application [11:47:10] However I would argue that we should aim for simplicity and that means not deploying a whole stack of applications in the same pod [11:47:48] <_joe_> I agree in general, I'm just trying to find arguments against the idea that convince me [11:47:50] <_joe_> :P [11:47:53] I can concede that maybe zotero citoid makes sense because they are in the end the same feature but still I don't see much disadvantages of having them splitted [11:48:05] Since deployment lifecycle are different [11:48:08] i'd ask if the apertium/zotero backends can do more per instance than cxserver/citoid? if there's a mismatch in req/sec between both (so it wouldn't be a 1:1 relation to provide the needed performance), from the point of view of starting only the necessary to fulfill the workload... also, memory usage would be a concern for each of those components? [11:48:09] <_joe_> yeah my suggestion was limited to those two cases [11:48:24] as vo.lans said, /me got nerd-sniped [11:48:45] <_joe_> gtirloni: so in the case of citoid/zotero, that's almost 1:1 [11:49:04] <_joe_> and zotero is much more expensive than citoid in terms of resources consumption [11:49:09] <_joe_> so one counterpoint I see [11:49:14] interesting, strengthens the single pod argument [11:49:21] <_joe_> not necessarily [11:49:48] <_joe_> actually I think in the case of zotero/citoid, you can consider citoid as a sidecar container of zotero [11:49:52] <_joe_> that proxies requests [11:49:54] <_joe_> more or less [11:50:02] ah, got it [11:50:20] <_joe_> so adding citoid to the deployment vs adding a sidecar that provides a readiness probe for zotero [11:50:26] <_joe_> ... looks pretty similar to me [11:50:45] <_joe_> not sure that holds at all for cxserver/apertium [11:51:24] <_joe_> so I would concur with you gtirloni that if we're not in the same kind of situation, using a single deployment won't work [11:52:01] <_joe_> but I don't know the details of apertium / cxserver well enough, akosiaris might have an answer though [11:54:06] I think zotero citoid might be an exception since they are too coupled [11:54:20] I know next to nothing about these services, so the scenario that I was entertaining was a need to increase capacity quickly for a service that wasn't doing enough req/sec but also dragging along a secondary service that could be using too much memory and just making things worse.. but all hypothesis and extremes.. sometimes the less complex deployment of a single pod would justify itself [11:54:35] * akosiaris reading backlog [11:55:03] Having different deployments makes easier to debug this services and reason upon them [11:55:16] <_joe_> I want to be clear, I'm just trying to explore an idea, not advocating for it [11:55:41] same here :) i find the discussion very interesting [11:57:57] both citoid and cxserver can and do work independently of zotero/apertium [11:58:08] they are just one of the possible backends in the latter [11:58:31] and a failure is considered fine in the former as citoid is able to fill in basic data on its own [11:58:49] so, no the shared health sidecar approach would not work [12:00:15] in the case of cxserver vs apertium cxserver also talks to yandex, youdao, google translate and so on [12:00:24] so it's not a 1:1 request ratio there either [12:00:25] and if citoid acts as health probe for zotero, you will kill citoid even when the only fautly part is zotero [12:00:34] fsero: exactly! [12:01:01] google translate is not yet deployed btw, it's under security review AFAICT [12:02:32] <_joe_> akosiaris: isn't citoid a very thin layer on top of zotero, basically? [12:02:55] <_joe_> or that happens only for wiki citations? [12:08:22] it can function without zotero [12:08:34] with greatly reduced functionality, but it still can [12:08:41] btw https://phabricator.wikimedia.org/T207200#4840158 [12:08:58] _joe_: fsero ^ [12:09:17] functionally, the 2 approaches have pretty much the same capabilities for what we currently care [12:10:23] fluent-bit has some extra niceties like being able to send cpu usage, memory usage and disk usage to logstash [12:10:31] but we already send those to grafana [12:10:56] <_joe_> fluent-bit is the C daemon [12:11:19] yeah fluentd is the ruby+c and it essentially offers aggreation over fluent-bit [12:11:39] <_joe_> ok one thing that's not clear to me is [12:11:50] <_joe_> how are we going to deal with existing applications? [12:11:55] https://fluentbit.io/documentation/current/about/fluentd_and_fluentbit.html [12:12:10] existing applications? [12:12:16] <_joe_> yes [12:12:24] care to elaborate? [12:12:49] <_joe_> the idea was to collect the logs that our applications spit (now posting them to logstash directly) [12:13:01] <_joe_> and have a more robust way to aggregate those [12:13:03] if current applications logs to stdout [12:13:07] <_joe_> no they don't [12:13:13] they can be configured to [12:13:17] they do support it [12:13:20] fluentbit/fluentd will collect log output and be sent to several backends [12:13:22] including ELK [12:13:23] <_joe_> oh service-runner does? [12:13:27] yes ofc [12:13:33] <_joe_> fsero: yes, ofc [12:13:35] in fact it can both send to logstash AND log to stdout [12:13:36] <_joe_> ok wait a sec [12:13:47] <_joe_> let's recap a bit [12:14:00] <_joe_> rsyslog case [12:14:07] <_joe_> what happens to the logs of mathoid [12:14:18] the same thing happens in both cases [12:14:20] <_joe_> we log to stdout, that goes to disk? [12:14:27] yes, /var/log/containers/*.log [12:14:28] <_joe_> then rsyslog picks it up? [12:14:39] the exact same files and both solutions do the exact same thing [12:14:40] <_joe_> ok, my idea was to avoid touching the disk [12:14:49] no, that breaks kubectl logs [12:15:03] which is useful to have [12:15:09] also kubelet does pretty sane log gc [12:15:19] <_joe_> so app logs to fluent-bit, that relies the data to {rsyslog,fluentd} that manages them [12:15:21] its docker if im not wrong akosiaris [12:15:30] (could be wrong) [12:15:40] fsero: yeah it's docker [12:15:43] <_joe_> writing them to disk if needed, and posting them to kafka [12:16:04] _joe_: [12:16:05] <_joe_> having rsyslog tail files on disk is a bit backwards [12:16:06] _joe_: no wait [12:16:08] not exactly [12:16:15] app logs to stdout [12:16:29] kubelet (via docker json log driver) saves them to disk [12:16:39] <_joe_> ok yes that I know [12:16:46] fluent-bit or syslog using inotify reads them [12:16:59] enhances them via kubernetes API metadata [12:17:02] sends them to kafka [12:17:29] it's the exact same pattern in both cases, no need to mix and match them [12:17:29] <_joe_> no that wasn't my plan, fluent-bit can accept logs via tcp and enrich them, which was what I wanted to do [12:17:44] <_joe_> specifically to avoid any too chatty application from creating i/o pressure [12:17:55] that breaks kubectl logs however [12:18:03] <_joe_> I'm ok with the change, for this reason ^^ [12:18:33] <_joe_> but ideally I'd like kubelet to forward the logs to rsyslog [12:18:39] <_joe_> and have rsyslog write to disk :) [12:18:45] that's not supported [12:18:52] <_joe_> yeah I just saw the docs [12:18:54] <_joe_> sigh [12:19:05] and it's not kubelet's fault btw [12:19:11] i dont think we have so chatty application [12:19:15] regarding logs [12:19:23] it's the job of the container runtime engine that's responsible of that [12:19:26] namely docker in our case [12:19:38] <_joe_> fsero: the only really chatty one is parsoid [12:19:59] <_joe_> and well, mediawiki [12:20:12] note btw that kubelet forcefully starts docker containers with log-driver=json-file [12:20:23] although docker does support log-driver=syslog [12:20:30] <_joe_> yeah I was about to say [12:20:53] <_joe_> that's why I assumed it was possible [12:21:04] you can change that for a given kubelet [12:21:15] but thats added complexity [12:21:22] IMO [12:21:26] AND breaks kubectl logs [12:21:29] yes [12:22:06] also I think you can't change it [12:22:12] I don't see any setting for it [12:22:45] <_joe_> akosiaris: so what you're saying is we don't need the logging sidecar in the pods [12:22:50] <_joe_> that's great [12:22:57] ah yes, I was about to point that out [12:23:06] <_joe_> and in that case, I'd probably use rsyslog? [12:23:07] but you need to maintain the daemonset [12:23:17] <_joe_> just not to add another thing [12:23:26] <_joe_> fsero: only if we use fluentbit AIUI? [12:23:27] fsero: that's why I am leaning like _joe_ towards just using rsyslog [12:23:33] to keep complexity low [12:23:39] at least for starters [12:24:00] if we find out at some point the fluent-bit/d offer something we can't do in rsyslog the architecture is the same [12:24:04] we 'll only change that component [12:24:24] <_joe_> and for very chatty apps we don't want to log on disk [12:24:30] <_joe_> we can think of the sidecar [12:24:31] the caveat is that all of the kubernetes community is more or less using fluent-bit and similar approaches [12:24:52] _joe_: very chatty apps can be configured to log directly to logstash [12:24:58] i don't really see deploying fluentbit/fluentd too much hassle [12:24:59] <_joe_> akosiaris: via kafka? [12:25:03] but i get your point [12:25:13] however i do appreciate the flexibility could offer [12:25:28] I did like fluent-bit syntax and docs way better btw [12:25:40] <_joe_> fluent-bit is a nice small tool [12:25:54] rsyslog docs are a weird mess with a lot of technical debt [12:26:07] thing is i would use rsyslog for cluster components, apiserver scheduler and kube-system namespace [12:26:10] it did take me a considerable amount of time to understand the 3 config formats [12:26:13] and i would use fluentbit for applications [12:26:22] that way if we have a chatty application doesnt break cluster logging [12:26:24] but at least between me and godog now we do have some experience [12:26:52] fsero: rsyslog does support queues [12:27:01] and prioritize 1 over another [12:28:33] main_queue( [12:28:34] queue.workerThreads="4" [12:28:34] queue.dequeueBatchSize="1000" [12:28:34] queue.size="10000" [12:28:34] ) [12:28:42] This would be a small in-memory queue of 10K messages, which works well if Elasticsearch goes down, because the data is still in the file and rsyslog can stop tailing when the queue becomes full, and then resume tailing. 4 worker threads will pick batches of up to 1000 messages from the queue, parse them (see below) and send the resulting JSONs to Elasticsearch. [12:28:48] from https://www.rsyslog.com/tag/parsing/ btw [12:29:06] and there is even dist-assisted memory queues per those docs [12:29:23] and that's before we even send to kafka which also supports spooling stuff [12:30:28] i think the selling point for fluentd/fluentbit is enriching [12:30:46] enriching with that beyond kubernetes metadata? [12:31:21] well you can always add more metadata :P i.e on the rsyslog output i don see the cluster name [12:31:32] we can infere that because the server that sends the log is kubernetes1001 [12:31:36] so eqiad cluster [12:31:42] but you might want to add that to logs [12:32:08] that's doable with rsyslog too however. [12:32:38] really, I think functionality wise there are more or less equivalent. fluent-bit's selling point for me is it's better structured and more clear documentation [12:32:48] and the fact the kubernetes community seems to use it way more [12:33:10] fluentd is the standard defacto yes [12:34:10] but that is somewhat offset by the work infra foundations is doing. If they can support it well, it is to our benefit to just re-use the solution [12:34:51] 6months ago btw I would not have even thought about this comparision. I would have gone fluent-bit/d hands down [12:35:17] <_joe_> yeah the only reason to do this is to standardize [12:35:30] <_joe_> else I'd have gone with fluent* [12:35:57] it's also like 2-3 hours work now to fully deploy this to all the clusters. They 've done all the puppet work as well [12:37:11] im a bit unaware of that work, but if we send logs to kafka shipped from fluentd that will use the same logging pipeline, right? [12:38:35] <_joe_> more or less, yes [12:38:47] <_joe_> just not the added filtering/tagging they might do in rsyslog [12:40:09] i do see also value going for fluentd, community support and a more "standard" solution, and i think that filtering/tagging is probably also doable on fluentd [12:40:16] in any case [12:40:36] as long we have logs we can revisit it later, im more inclined to fluentd since im more comfortable with it [12:40:46] but lets not get stuck into this [12:42:55] I don't feel very strongly either. I am leaning towards rsyslog just to reuse the work and standardize [12:43:50] as long as we all agree that we want kubectl logs :P [12:44:06] so logging contract is you log to stdout, something will take care of it [12:45:19] yeah agreed on that [12:51:47] oooh [12:52:00] well my 2 cents is [12:52:28] to use rsyslog for a while [12:53:06] se how this is working out, lay down our requirements [12:53:15] and revisit the fluentd option [12:53:22] _joe_: this is probably also the correct place to ask about a baseline nginx image for use with blubber then ;) [12:53:23] basically what akosiaris said [12:53:46] <_joe_> addshore: yeah [12:54:27] <_joe_> addshore: guilty as charged, I didn't even bring it to the team :( [12:55:32] Thats okay, always lots of other things for us to be getting on with, but it would be great to decide a clear path to get it unblocked, even if that is me make a patch for the image and get review, but i have a feeling the creation of said image might go faster being done by someone that knows the wmf requirements [13:07:45] <_joe_> well, we don't intend to use a docker image to host static files :) [13:08:30] <_joe_> but yes, I can make the image do a few smart things [13:24:07] _joe_: which other services currently follow the antipatern of MW -> service -> MW ? [13:24:30] <_joe_> lemme see [13:24:46] <_joe_> all that get called from mediawiki? [13:24:49] <_joe_> :P [13:25:24] <_joe_> but you have the alternative antipattern: restbase -> service -> restbase -> other service -> mw_api [13:26:33] <_joe_> I think cxserver does, for example [13:27:02] <_joe_> but admittedly that's a big fat service doing a specific job that's not part of serving the user-facing data on each page [13:28:00] <_joe_> sorry, I'm called to lunch [13:31:24] (reading scrollback) if fluentbit is easier to manage on the k8s side a middle ground would be to have fluent be another syslog client and write its (structured?) logs to /dev/log and ship them through the standard pipeline [13:53:19] godog: thats a possibility but at least for now looks like unnecesary complex, we should start with rsyslog and reeavluate later :) [14:28:50] _joe_: that what I was thinking ;) [14:29:15] fsero: ack, sounds good to me! [14:32:10] <_joe_> addshore: but in the case of SSR, it's in the critical path [14:32:39] <_joe_> and I didn't say that debugging problems with translate is not the stuff of nightmares :P [14:33:02] 10serviceops, 10Operations, 10TechCom-RFC, 10Wikidata, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Milimetric) @WMDE-leszek ok, we're on the same page, except the crazy part of my proposal. I was saying **directly** routed to SSR service as in, wi... [14:33:56] Thinking of it as critical path is interesting, as if the SSR fails, everything will continue working for 99.9% of users, just it won't work for nojs users. [14:35:08] we will definitely discuss making MW push the data to the service, but as the SSR side of things grows the data needing to be sent will become more complex, and rather than keeping that complexity in the ssr service, it will then have to be split between the MW php code, and the ssr service [14:52:48] _joe_: I think we are worried as this seems like premature optimization from our side causing more complexity, given the service should be increasing the number of calls to the mw api by no more than 1 a second, its one of the reasons that we have started just with mobile, to try out SSR in wikibase without scaring everyone [14:53:14] <_joe_> Yeah, no. [14:53:28] <_joe_> a premature optimization is adding compression to the http calls [14:53:55] <_joe_> getting the architecture to be simple, reliable and not having logical loops is definitely not an optimization [14:54:14] <_joe_> I tried to explain how it's all much more efficient [14:54:54] <_joe_> but yes, that's only tangential to the basic problem [14:56:09] <_joe_> which is you're thinking of SSR as a standalone application, while AIUI it's just a rendering service for data stored in mediawiki [14:56:24] <_joe_> so the logical way to do it would normally be having the client call your service directly [14:56:33] <_joe_> and your service do the mw api calls [14:56:46] <_joe_> if it was a "standalone" service [14:57:14] <_joe_> if it's not, it needs to be a lambda-like service, or something that needs not to call its caller [14:58:10] so, that is the longer term goal, but not needed for our current MVP, which is one of the reasons the SSR is currently structured in the way it is now, so it looks like we miht just have to end up maintaining both cases for the SSR, one route or option for mediawiki to send data to it, but also the other route of allowing the service to call the mw api when the render call comes from a client that is not mediawiki. [14:59:03] <_joe_> so lemme make you an example. Say I wanted you to check the lotto numbers for me. It's more convenient if I tell you "can you check if my lotto numbers are correct? they are 3, 22 and 86" [14:59:12] <_joe_> or if I tell you [14:59:48] <_joe_> can you check my lotto numbers? and you answer: "sure can you read the numbers that came out on the newspaper to me? And then again, which numbers you picked?" [15:00:29] <_joe_> anyways, clearly, more context is needed, and I think that should stay on phabricator [15:01:00] yup, I think the next comments on phab from our side will come on the 3rd, as everyone essentially runs away for vacation from today. [15:01:18] <_joe_> the 2nd [15:01:22] thanks for the chat, it's definitely been of useful :) [15:01:27] <_joe_> but yes, that's probable :) [15:01:36] <_joe_> I'm planning to take some days off, finally [15:01:46] woo! happy holidays :) [15:16:13] 10serviceops, 10Operations, 10TechCom-RFC, 10Wikidata, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10daniel) @Joe said: > the SSR service should not need to call the mediawiki api. It should accept all the information needed to render the termbox in... [16:13:35] 10serviceops, 10Continuous-Integration-Infrastructure, 10Developer-Wishlist (2017), 10Patch-For-Review, and 3 others: Relocate CI generated docs and coverage reports - https://phabricator.wikimedia.org/T137890 (10hashar) All publish jobs now also trigger [[ https://integration.wikimedia.org/ci/job/publish... [18:01:18] 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), 10User-Clarakosi, 10User-Eevans: Plan/design a session storage service - https://phabricator.wikimedia.org/T206015 (10Eevans)