[09:12:24] cdanis: TIL about the pod's labels.. so is it just kubectl edit pod etc.. ? [09:48:44] it should be. I'm surprised that editing labels is enough to make it work. I would that thought we needed to blank out the metadata.ownerReference field [09:52:04] IIRC the labels are the only ones taken into account by the svc in front of the pods [09:56:09] oh that's right. I somehow misread and assumed that, by removing the labels, not only would the traffic not reach this pod (which is the case, due to what you mentioned), but a new pod would be re-created by the ReplicaSet and the old one would be left dangling [10:56:34] the old one would be indeed left dangling [10:56:46] and yes the replicaset will create a new one [10:57:22] which means you can isolate against all deployments and traffic a pod by just delete a few labels and then run whatever it is you want to run against it [11:01:20] we should probably be documenting neat tricks like this one somewhere [11:16:56] https://wikitech.wikimedia.org/w/index.php?title=Kubernetes/Administration&diff=prev&oldid=2210195 [11:16:58] there ^ [12:47:41] elukey: yes, there's also the `kubectl label` built-in, which looks nice for scripting [12:48:34] nice! [13:16:17] I need some help with the knative-serving NetworkPolicies from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1054538 plus https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1057859 [13:16:51] AIUI, the current state in deployment charts should work, but the pusgh fails with (what looks to me like) the activator being unable to talk to the k8s control plane [13:20:36] E.g. from the logs: [13:20:38] 2024/07/29 12:49:09 Failed to get k8s version Get "https://10.194.62.1:443/version": dial tcp 10.194.62.1:443: i/o timeout [13:26:40] I think I just had a brain knot untangle, so don't investigate (unless you're curious) [14:40:47] ok, I give up for today, I am clearly not udnerstanding how these policies (should) work. [14:42:54] klausman: the few times I've had to edit calico selectors I used a mix of https://docs.tigera.io/calico/latest/reference/resources/networkpolicy#selectors and also grep in deployment-charts heh [14:44:59] I have tried selectors (liek app=="foo") before, but I've now created the smallest thing that should work, even if it slightly overmatches. But ti still doesn't work. [14:45:12] let me paste the rendered policy [14:46:02] https://phabricator.wikimedia.org/P67005 [14:46:37] l. 14-19 worked fine in the kserve namespace, so I am pretty confident they should work here. [14:46:55] the labels section shouldn't matter much (AIUI) [14:47:18] that leaves the types: and namespaces: stanza, and I have no idea what is wrong there. [14:48:42] This policy (again, AIUI) should allow everything in the knative-serving NS to talk to the k8s control plane. [14:49:22] and yet,: [14:49:43] # kubectl logs -n knative-serving -f activator-7bcc6664bf-jfg75 [14:49:45] 2024/07/29 14:39:35 Failed to get k8s version Get "https://10.194.62.1:443/version": dial tcp 10.194.62.1:443: i/o timeout [14:51:32] naive q: could the policy work but the traffic be blocked by a missing ferm rule? [14:54:10] also, I have written a tool that might help you inspect calico network policies and resolve their selector to pods https://wikitech.wikimedia.org/wiki/User:Brouberol#Link_a_Calico_Network_Policy_to_Pods,_IPs_and_ports [14:54:19] that might be useful [14:54:41] as for ferm: probably not? kserve works fine with just the NPs [14:54:53] it is available under my $HOME on the deploy servr [14:56:56] I will try it, but I have to clean up bad NPs first [14:57:17] thanks brouberol nice little page [14:57:46] <3 [14:59:25] File "/home/brouberol/inspect_calico_networkpolicy", line 93, in main [14:59:27] selector = network_policy["spec"]["selector"] [14:59:29] KeyError: 'selector' [15:00:29] haaang on. I think my NP cleanup may have actually fixed things... [15:00:55] ah no, still failed, but at least rollback was quick :-/ [15:01:01] I have assumed that spec.selector was present, which was the case for my rules [15:01:35] I have made a copy and will hack the source a bit [15:02:14] Have you tried with `spec.selector: all()` ? cf https://docs.tigera.io/calico/latest/network-policy/policy-rules/service-policy [15:05:13] not yet. and apaprently I can't try live-editing on the server, since helmfile diff/apply just ignroes local changes that are not committed to git [15:06:09] yep, you have to hack the chart to a path instead of a chart name, that maps what we have in https://helm-charts.wikimedia.org/, but if you do this, it will introduce a git diff preventing the timer from pulling [15:06:28] so you should probably copy both the helmfile and the chart in your home and hack from there? [15:07:39] I 'd suggest kubectl edit for trying these things [15:07:51] and then come up with the proper chart changes [15:07:58] faster overall [15:08:05] that's annoing in this case since I am completely deleting the k8s NP and repalcing it with a calico one [15:12:18] so, what happens is that your deployment fails and the calico NP get's rolled back? or what? [15:12:47] yep [15:12:59] you can override that. [15:13:15] * akosiaris looking [15:13:27] maybe --skip-cleanup? [15:13:40] mhno, not by it's description [15:15:08] you want atomic to not be true for that release [15:15:41] it's under helmDefaults: in admin_ng/helmfile.yaml, but you can override it just for you staging cluster release [15:16:10] you can also bump the 600 seconds timeout to whatever number you feel comfortable with [15:16:31] you will end up with an indefinitely broken release, but you will be able to figure out via messing with it manually with kubectl why it's broken [15:16:49] giving that a go with a copy of the relevant bits in my homedir [15:16:52] and after than, a new helmfile sync will clean it up [15:17:19] klausman: re: changes in git, what I often do is, I start with a checkout of deployment-charts repo in my homedir on the deploy hosts, then I override the helmfile.yaml in the deployment directory under helmfile.d to point to the relative path of the chart in that checkout [15:17:51] that's helpful when you have to hack on the templates, probably not so much here where kubectl edit would be easier [15:21:07] I am not sure I understand? helmfile.d/admin_ng/helmfile.yaml does not have any absolute paths, I think? [15:22:15] yes, because it's referencing the stable release of the chart from the repository [15:30:36] cdanis: helmfile.d/admin_ng/helmfile.yaml doesn't chart releases in it [15:30:45] it's the top level one, that includes all the other ones [15:31:00] oh, sorry, yes, I assume the knative chart there does though? [15:31:07] or the instantiation of the knative chart, rather [15:33:59] klausman: this -> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1057907 would stop helmfile from rolling back upon failure [15:34:11] you can use it as is in your homedir ofc as well [15:34:57] futhermore, that chart: wmf-stable/knative-serving? [15:35:07] ack, thanks. [15:35:10] it can be an absolute path instead of wmf-stable/knative-serving [15:35:41] and it would stop looking at the helm-charts repository and just deploying whatever you have in that absolute-path [15:35:47] so you can mess all you want then [15:38:08] tjat [15:38:16] that's what I was trying to communicate, but akosiaris said it better :) [15:39:58] So I changed that line to an absolute one, and edited the version of the knative-serving chart, but I am still getting 0 diffs. [15:40:30] helm repo list also still shows the wmf one [15:48:33] ah, I had missed another chart: line [15:49:43] brouberol: I think the all() bit might have done the trick. [15:56:29] and of course now I have a change that I *thought* I had done three hours ago, that works fine. But it failed back then for likely orthogonal reaosns (like a lack of cleanup) [15:57:06] So atm, I think I have good state in my homedir, and staging should be working fine. I'll get back to this tomorrow and see about the other six million policies. [15:57:11] Thanks everyone! <3 [16:02:56] thanks as well! [16:17:38] I'll probably do a writeup of the "so you want to live-hack charts" approach [16:31:14] that would be swell