[07:18:39] _joe_: yeah, I know. Unfortunately the debian upstream way of building k8s is not that much better than ours I guess. Although they work around all that ducker building stuff (which is pretty nice) I fear it costs on other ends (like having golang 1.15 and keeping up with internal changes of the build system). [07:19:14] <_joe_> jayme: I don't think they are doing that anymore (working around the docker build stuff) [07:19:38] <_joe_> but it was more as a validation of our approach tbh [07:20:11] <_joe_> I think debian missed *big* on golang and similar languages by choosing to try to fit them into their model, rather than building tooling to handle statically linked languages [07:23:27] _joe_: I was looking at it for a bit on saturday and I think they still do (work around docker) [07:23:40] <_joe_> meh anyways [07:23:54] <_joe_> it was just to add a datapoint to the discussion [07:24:17] 10serviceops, 10Operations, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard per DC to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) >>! In T252391#6592606, @jijiki wrote: > * `mc2036.codfw.wmnet` has been reimaged to buster without redis-server... [07:24:27] Yeah, got that. Was about to post something on the ticket, but it seems I missed out on acutally clicking the submit button :D [07:26:22] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Build kubernetes 1.19 - https://phabricator.wikimedia.org/T266766 (10JMeybohm) An additional alternative could be to switch to a build without docker like the debian upspream does. That would maybe require us to backport golang-1.15 but wou... [07:26:35] there it is... ;) [07:26:58] <_joe_> lol [07:27:17] <_joe_> anyways it seemed to me we all agree about the road forward, right? [07:27:49] for k8s, yeah. For calico I'm unsure [07:28:44] https://phabricator.wikimedia.org/T266893 that is [07:28:53] <_joe_> you're not sure it's not worth building the debian packages? [07:30:17] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Build calico 3.16 - https://phabricator.wikimedia.org/T266893 (10Joe) I typically prefer if we rebuild images from dockerfiles, using our base images. That gives us a tad more control over upgrading in case of a disaster security hole in e.g. alpine linux. [07:30:29] I'm not sure if we can properly use the "binary" release from calico as they not even provide hashes [07:42:35] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Build calico 3.16 - https://phabricator.wikimedia.org/T266893 (10JMeybohm) >>! In T266893#6594832, @Joe wrote: > I typically prefer if we rebuild images from dockerfiles, using our base images. That gives us a tad more control over upgrading in case of a disaste... [10:37:09] hello hello - while reading T252391 I started thinking about the target version of memcached on Buster, that by default is 1.5.6, but we are running 1.6.6 on IDP nodes. Maybe we could think about moving directly to 1.6 so we get new goodies like native TLS support? [10:37:27] The gutter pool is running 1.5.6 atm [10:53:30] I was thinking about one thing at a time, so first get both hosts on buster [10:53:38] then enable the 1.5x features [10:53:49] and then try 1.6 [10:54:05] how does that sound? [11:00:30] <_joe_> it sounds like twice the work [11:01:15] <_joe_> elukey: newer memcached can support TLS natively? [11:01:23] <_joe_> that would remove the need for the proxies [11:01:27] <_joe_> in mcrouter [11:01:32] <_joe_> that would be *great* [11:03:04] our plates are already full, I think we should take it one step at a time [11:04:23] and with the upcoming covid situation, it is unknown how this will be in december [11:06:49] in my opinion our number 1 priority is to have all those hosts upgraded to buster as soon as possible, after this is done, we can upgrade to 1.6, if there is time left for this on this quarter [11:52:26] +1 to that, fully exploring the new 1.6 features at the scale of mc* (like restarts without voiding the cache or TLS) will take some time by itself, the footprint of memcached on the IDPs is really small [11:56:34] <_joe_> I wasn't proposing to explore them all on day one, FTR [11:56:58] <_joe_> anyways, I'm not doing the work myself, so whatever you feel is better [12:27:25] _joe_ yep TLS native [12:28:21] the value added in running 1.6.x is that any issue that we encounter on mc* is easier to follow up and fix with upstream, rather than always lagging behind [12:29:33] effie: about 1.5.x features - we need to enable them straight away, we cannot run 1.4.x settings (tried in the past and it was a mess) [12:29:58] one above all, the number of slabs shrinks to 64 maximum (we use around 160/180 for each shard now) [12:29:58] oh sure, we will do so this week hopefully [12:30:05] for the 1.5 features [12:30:08] ack :) [12:32:18] we can start with the gutter pool settings anyway, since we have tried those settings [12:33:12] while we had a dead mc2* host, the gutter pool handled this traffic, I don't recall any issue surfacing there, so it is a good stp 1 [12:33:17] step 1* [12:34:19] <_joe_> so my biggest doubt is that the gutter pool has a different ram size than the actual memcacheds [12:34:32] <_joe_> also a very short TTL IIRC [12:34:40] <_joe_> so you're not really testing slab allocation [12:34:54] <_joe_> and our experience tells us that every minor release changes everything [12:35:14] one thing that I didn't mention in the task (going to do it now) is that without redis etc.. we can expand the mem allocated for memcached, it is 90G IIRC now [12:35:21] on the host we have 128G of ram [12:35:25] <_joe_> in short: I highly doubt that whatever configuration we have on the failover hosts will work as-is [12:35:48] <_joe_> which means you're planning to re-do that work twice [12:35:52] yeah it will need some tuning, I am worried a little bit about shards distribution (growth factor and all etc..) [12:35:55] <_joe_> hence my comment "double the work" [12:36:11] <_joe_> but again, I'm trying to help, not to impose my view [12:36:39] <_joe_> this is all coming from the experience elukey and I accumulated over the years of memcached releases [12:42:00] we have to start with something and then start tunning [12:44:59] <_joe_> that holds valid wether you start with 1.5 or with 1.6 [12:45:08] <_joe_> without all the fancy 1.6 stuff, even [12:45:16] <_joe_> that's my argument [12:45:45] <_joe_> anyways, I'm going to lunch, ttyl [12:46:06] 1.6's new things are basically mostly opt-ins (TLS, NVMe support, better restart, etc..), the rest is basically 1.5.x [12:46:14] but I agree that a lot of code has changed [16:48:10] 10serviceops, 10Operations, 10ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) [17:19:25] 10serviceops, 10Push-Notification-Service, 10Product-Infrastructure-Team-Backlog (Kanban), 10User-jijiki: High latency on push notification service initialization - https://phabricator.wikimedia.org/T265258 (10MSantos) a:03MSantos [17:53:09] 10serviceops, 10DNS, 10Operations, 10Traffic, 10Services (watching): nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability - https://phabricator.wikimedia.org/T162818 (10Aklapper) 05Stalled→03Open The previous comments don't... [19:04:30] 10serviceops, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services): Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for... [22:32:00] 10serviceops, 10Operations, 10Performance-Team, 10Traffic, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Dzahn) .