[06:39:30] 10serviceops, 10Prod-Kubernetes, 10Release-Engineering-Team, 10Kubernetes: CI pipeline/job to build and release helm chart artifacts - https://phabricator.wikimedia.org/T257333 (10JMeybohm) I'm writing a (python) tool that is able to perform the three steps described above in first place. [07:40:59] <_joe_> James_F: around? I am not finding the dockerfile we used for the mediawiki/core images [07:41:26] <_joe_> context is I'm trying to collect some ideas about how to run mediawiki on k8s [07:53:57] <_joe_> jayme, akosiaris do you know if it's possible to ask kubernetes to run pods in a numa-aware way? [07:54:06] <_joe_> so every container and its memory are tied to one numa zone [07:55:57] There is this typology manager thing which might be able to _joe_ [07:56:23] https://kubernetes.io/blog/2020/04/01/kubernetes-1-18-feature-topoloy-manager-beta/#so-what-is-numa-and-why-do-i-care [07:56:29] <_joe_> context is I want php-fpm to run numa-aware given how much mmapped memory it uses [07:56:36] <_joe_> oh great [07:56:55] so the answer is probably: Yes, but not in our case :-/ [07:57:46] Have not tried it I must say. Just remembered that I had read about it [07:57:59] <_joe_> so ok we need k8s 1.18 [07:58:03] <_joe_> we can get ther e:) [08:03:05] The dockerfile for core is in core. [08:03:16] <_joe_> ack thanks :) [08:04:12] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/.pipeline/blubber.yaml [08:04:35] <_joe_> James_F: as soon as I've written down a strawman I'll ask for your feedback and proably we should start planning work inter-team [08:04:53] Excellent. (Though I’ve rolled off the EngProd team.) [08:12:57] James_F: it's official? [08:12:59] where to? [08:13:08] congrats btw! [08:13:30] * akosiaris wasn't aware of NUMA in kubernetes at all [08:13:46] Yeah, Abstract Wikipedia. [08:14:32] oh, lovely! So this moves forward faster than I anticipated. [08:25:16] Well, we'll see. [08:25:32] Lots of fundamental "what are we building exactly?" questions. [09:04:12] oh my [09:04:20] who's taking your spot? and congrats onthe mvoe [09:04:21] move [09:28:08] apergos: No back-fill because I was only seconded to RelEng, supernumerary staff rather than a EngProd-specific req. [09:28:19] oh my [09:28:23] But I'll keep around to help out with whatever explodes. [09:28:30] that is good to hear! [09:28:39] The first job is to keep production up, after all. [09:28:44] :-) [09:41:01] Pchelolo: I love it when a plan comes together! [12:23:13] o/ old team [12:23:23] i was reading this https://about.gitlab.com/blog/2019/07/23/anomaly-detection-using-prometheus/ and i thought this might be interesting for you [13:49:30] _joe_ and elukey when you have time [13:50:04] rzl and I were wondering what is the urgency of onhost memcached [13:50:20] <_joe_> what do you mean? [13:50:46] as in, if it sounds ok if we have this implemented on mwdebug* this Q [13:50:52] <_joe_> fsero hey, yes I am dubious about such stuff but it's interesting [13:51:16] while aaron and timo are workinhg on the mediawiki side of this [13:51:32] we can work on the config and puppet side of this [13:51:49] and aim to have this on mwdebug* by the end of Q1 [13:52:13] and aim for a rollout in Q2, if the mediawiki part is done [13:53:19] im wondering why dubious _joe_ ? i suspect you dont believe in the statistical background? :) [13:53:38] i've heard some people around that have applied it and they are quite happy with results [14:00:41] <_joe_> no I mean I do believe in the background [14:00:50] <_joe_> I think it's complicated to get a good S/N [14:00:54] <_joe_> but I'll read that post [14:33:07] what happened to the SRE/CPT sync today? [14:35:52] I am wondering the same thing [14:36:52] it clashes with the tuning thing [14:41:01] yeah I think that's it [15:26:38] <_joe_> yes [15:26:55] <_joe_> and will told me you had just to point us to your envoy work hnowlan [15:27:01] <_joe_> which I was already aware of :P [16:36:10] effie: sorry for the lag - so myc: I think that rollout in Q2 is totally ok. The urgency really depends on the SRE goals in my opinion, namely what level of reliability etc.. we want to achieve and when. Currently the gutter pool seems a ok solution even under severe scenarios.. (even if not completely ideal of course) [16:36:23] *so my 2c :) [16:37:16] and as you mentioned there is a mediawiki part is will probably be long, so even Q2 seems a very good/optimistic timeline [16:37:32] ok today I can't write [16:37:44] *that will be probably long [16:38:20] I would, personally, try to configure the "proxy-gutter" this Q [16:41:08] rzl: your thoughts? [16:41:23] regarding proxy-gutter [16:47:11] re proxy-gutter I don't think I know enough to have strong feelings, certainly defer to the two of you [16:52:31] maybe we can ping timo and aaron about [16:52:50] ok, let me try and find the appropriate task [18:05:25] rzl: do you have time to further discus about the onhost memcached goal? [18:06:32] sure [18:07:51] rzl: effie and I were chatting about how much we can implement on-host memcached without mediawiki changes. [18:08:34] rzl: https://docs.google.com/document/d/1wY4qcryJKTSg2TxoEqXoTo0hWwVT-5Gq_kJwn3SIPls/edit says there are 2 ways for implementation. [18:08:44] rzl: mcrouter and mediawiki. [18:09:00] rzl: have we decided on which way? [18:09:01] right -- we discussed that with-- ha, speak of the devil :) [18:09:16] we discussed that with Krinkle and Aaron, and we decided the least risky approach is the one that involves MW changes [18:09:56] got it. [18:10:03] * Krinkle rewinds the clock for a few minutes and reads backscroll [18:10:22] the reason being there are a lot of memcache keys that should always bypass the on-host tier, bookkeeping and suchlike [18:10:30] do we have a idea when mw will put these changes in? [18:11:12] the best way to make sure they don't wind up there is to make it opt-in from MW's perspective, but then do the opt-in at an infra level of the code so that we don't have to worry about missing individual codepaths [18:11:56] I'll let Timo speak to that since I think it's his team we'll be relying on [18:12:12] our OKR roughly says we want to do this this qtr, so can we prep the infra for that? [18:12:15] For things like coordination locks and rate limits, we can't allow blind caching of the cache iself. [18:12:30] so it has to be opt-in at the infra level [18:12:48] But it's important that we don't require opt-in from invidiual developers using cache inside regular MW code [18:13:30] since it's our purpose to reduce failure risk on the memc infra, by making it impossible/unlikely that a very hot key (possibly not one we predict to become hot) is going to saturate the network [18:13:56] hmh, if we are confident that this will work we can still roll-out across platform, rather than wait for confirmation in mwdebug? [18:13:58] it is also equally important that it is not opt-in and that ideally there also be no opt-out. [18:14:04] (brb) [18:14:32] the solution we came up with is to manage the opt-in at the abstraction layer in MediaWiki (called WANObjectCache). [18:14:41] even if we're confident about correctness platform-wide, I think we'll want to do some performance evaluations at mwdebug [18:14:54] There it will use the local memc tier for all "normal" value keys, unconditionally. [18:15:20] but for small values about critical metadata, counters and bookkeeping etc it will work the same as tody. [18:15:23] today* [18:15:49] ok, I think I understand. [18:16:35] in terms of rollout, we need to get the setup on mwdebug, then test - is it reasonable to assume that this will be done in the next 2 months? [18:16:53] i.e. summer vacations, covid impacts, etc [18:18:28] wkandek: you're right, the upper section is outdated. We didn't review that in the last meeting. [18:22:31] wkandek: the upper section was the agenda for our second meeting, so it's obsolete now. [18:22:42] the short answer is: neither, but it was useful to explore those two options. [18:23:22] I'll update the doc to reflect this [18:23:50] I think the setup on mwdebug and some testing can be done in the next 2 months [18:24:20] what elukey mentioned earlier taht we should consider [18:24:30] the SRE part certainly can be, I'm interested in a time estimate for the MW part also [18:24:40] is if rolling out proxy-gutter to production is more urgent or not [18:25:01] yeah agree with that, as a question that needs answering [18:25:17] I wasn't really aware of that work when I was thinking about the on-host OKR [18:25:24] Krinkle: what is your opinion ? [18:25:42] regarding proxy-gutter: is there a description of the issue? [18:26:26] wkandek: https://phabricator.wikimedia.org/T244852 [18:27:29] the action plan needs an update to reflect the proxy-gutter rollout [18:27:38] got it - that looks like pure gutter pool though. [18:27:51] so we have identified something that is missing? [18:30:14] in what sense ? [18:31:48] when planned the gutter pool rollout we did not include or think of the need to address this proxy issue? [18:32:14] yes [18:32:17] we missed that [18:32:28] luca pointed it out [18:33:12] how do we capture the work needed? [18:34:04] I have not checked back on this and it is a bit late here, but I think it is not much work [18:34:24] but I have not thought this through [18:36:11] So in terms of local (on-host) memcached roll-out [18:37:00] The main things to look out for from our perspective is to understand/measure any added latency or possible increase in error rate. [18:37:13] I have little reason to suspect either at this point, just being cautious. [18:37:51] given the control is on the MW side, I believe it would be fine to roll this out at your preferred pace on all app servers in production. [18:38:08] we could enable it gradually on a per-wiki or per-appserver level from wmf-config [18:38:41] yeah, that sounds fine to me -- we have a hiera flag too, but we can use wmf-config to control the actual rollout [18:39:07] yeah, as an emergency switch we will also have a hiera flag that would render the new route inert / identical to the status quo. [18:42:29] Krinkle: do you have any sense of when the WanCache/MemcBagOStuff changes will be ready? [18:42:50] couple weeks at most [18:43:03] oh okay, great [18:43:04] fairly trivial, just swaped with (even more urgent) indident prevention right now [18:43:07] swamped* [18:43:30] yeah understood -- still comfortably this quarter, though, right? [18:47:17] Yes, I'd expect so. [18:47:25] 👍