[06:53:39] akosiaris: AIUI the static ip thing does work on a per-pod basis, right? So that would probably not work as expected with deployments or alike [06:53:44] (https://docs.tigera.io/calico/latest/networking/ipam/use-specific-ip) [08:13:41] jayme: yeah, I want to test, but it does look like that if we go down that route, it will have to be a few hardcoded pods, not a deployment or anything like that. [08:14:03] hmm...that does not sound super nice either :/ [08:14:31] worse even than pinning them to apiservers IMHO [08:14:49] there is no solution that is super nice overall (at least from any that we already discussed) [08:14:56] apiservers? [08:15:09] ah, k8s control plane [08:15:17] yes, sorry [08:16:00] well, neither are nice for sure. Which is worse... good question [08:26:23] hey folks, qq about dragonfly and docker-registry nodes - shall we upgrade to Bookworm? :D [08:27:04] dragonfly probably is the quickest [08:29:20] for the docker registry no idea, but we could probably upgrade the standby dc first, then failover etc.. [08:38:39] dragonfly should indeed be easy - registry could be a mess [08:40:06] yeah but we are running on EOL, not great [08:43:36] yes, just stating the obvious - sorry [08:45:31] no need, I got what you meant :) [08:46:34] for dragonfly, IIUC we just need to rebuild dragonfly-supernode and then create new VMs [08:46:49] I fear that we (as in serviceops) don't have time in the near future [08:47:09] I can try to set some time for Dragonfly [08:47:13] yes re: dragonfly. Although I think you already build the packages, right? [08:47:23] for bookworm k8s workers [08:47:25] for the workers IIRC, not the supernode [08:47:33] IIRC it's all from the same source package [08:47:51] then I may have not uploaded the supernode deb [08:48:06] I could also be wrong :) [08:49:12] I thing I would rather reimage the existing VMs with bookworm (outside a mw deployment window) [08:49:18] /var/cache/pbuilder/result/bookworm-amd64/dragonfly-supernode_1.0.6-2_amd64.deb [08:49:21] nope you are not :) [08:49:37] ahh right sometimes I forget that we can reimage VMs :D [08:49:42] yes yes then easier [08:49:50] no need to change config then and the clients will fall back to fetching directly in case they need to [08:50:38] dragonfly-supernode | 1.0.6-2 | bookworm-wikimedia | main | amd64 [08:50:47] see past Luca is always better than present one [08:50:51] I always say that [08:50:58] seems like somebody did the right thing :) [08:51:04] ok then we just need to schedule the reimage [08:53:48] also found https://phabricator.wikimedia.org/T332011 [08:53:52] I'll try to do it next week [08:55:11] <3 thanks [12:37:31] A new alert has been added to show workers that have been cordoned for >=24: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=team%3Dsre&q=alertname%3DKubernetesWorkerUnschedulable [12:48:19] nice! [12:53:09] jayme: akosiaris: maybe a daemonset of pods limited to just the control plane servers? [12:53:28] none of this sounds nice but I think daemonset+nodeport is sounding nicer at this point :/ [12:54:18] I think we have more than 3 coredns pods but yes, we could make them prefer the control-planes I think [12:55:42] iiuc the need is to have *some* coredns pods be on reliably stable ips, not all? [12:55:52] yes [12:56:24] yes [12:58:01] I'm probably still waking up and need more coffee, but, I actually don't understand what Alex told me yesterday re: Calico doing BGP advertiements for service IPs with externalTrafficPolicy: Local [12:58:33] if Calico is only advertising the /32 for the service IP just from nodes that have a running and ready pod, shouldn't that just work? [12:59:08] and that avoids SNAT [12:59:27] (https://docs.tigera.io/calico/latest/networking/configuring/advertise-service-ips) [13:00:14] yes, maybe. But I think we haven't tested that. And it also relies on ECMP which would, aiui, lead to unbalanced load [13:00:28] depending on "where" the caller is [13:00:32] I don't think that's a problem at this scale [13:00:48] we only have so many PTRs and they're cacheable [13:01:02] * jayme nods [13:09:45] the probably with calico doing BGP advertisements for service IPs is that we don't have graceful restart in calico open core [13:09:49] problem* [13:10:12] so 1 node being rebooted, still receives traffic until the BGP timers expire [13:10:26] and that traffic gets blackholed [13:10:29] it's DNS, it will retry [13:10:49] do we have BFD? [13:12:05] no it's not in calico open core [13:12:21] lol [13:12:23] IIRC there's a task asking exactly for that from yours truly in calico's github [13:12:37] well I still think it's worth doing [13:12:38] I even had a patch [13:12:50] but it didn't align with their vision (or their business) [13:13:09] they weren't negative or anything btw. Even suggested an approach [13:13:09] we don't need an ideal solution we need something to make debugging without jumping through horrible hoops possible [13:15:07] like really https://phabricator.wikimedia.org/P67401 is the alternative :) [13:20:45] !log homer lsw1-b6-codfw* commit 'T372878' [13:20:45] claime: Not expecting to hear !log here [13:20:45] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [13:20:49] soz