[11:07:44] Would someone be able to be a second pair of eyes on my network policy stuff? I am out of ideas why one pod can not talk to another. [11:08:27] klausman: I can check after lunch if you are not in a hurry [11:08:37] sure, that works [11:09:17] ack.. if you want to write in here what you tried and what doesn't work etc.., I'll read it later (or someone else may do it earlier and beat me to it :) [11:09:37] yeah, making a paste [11:12:47] https://phabricator.wikimedia.org/P67223 has the current policy (made with kubectl edit), a log entry from the running activator (it logs a _lot_ of those, so I am pretty sure they all fail), and the describe on the failing probe mentioned in the log. AIUI, destination.selector: app-wmf=="kserve-inference" should work, but apparently it does not. Note that there are more errors in the [11:12:49] logs, but I have to start somewhere [11:14:11] I think it may be only the outlink-topic-model probe that is failing, but it's unclear why [12:08:21] yeah I am pretty sure it's only the o-t-m probe that is failing, which may be independent of the netpol. [12:08:38] Still, a second pair of eyes on the current netpol state would be terrific. [12:09:13] It might sound a little pedantic, but have you tried selector=="app-wmf=='kserve-inference'" ? [12:09:27] not yet, I'll give it a shot [12:09:40] In my experience, the quoting around selectors has always felt weird enough to assume that the single quotes had their importance [12:09:52] this is completely from intuition, not backed by any data [12:10:03] sorry. typo [12:10:11] selector: "app-wmf=='kserve-inference'" [12:11:16] yeah, I fugred :) Doesn't seem to make a difference [12:11:32] and once I re-edit, the outer double quotes are gone [12:18:41] ack [12:19:40] let me rework my script to see how these policies resolve [12:20:01] ty! [12:26:45] root@deploy1003:/home/brouberol# ./inspect_calico_networkpolicy -n knative-serving [12:26:45] [+] NetworkPolicy knative-serving-knative-serving-activator-calico-egress [12:26:45] Pods: [12:26:45] - activator-549b955cfd-s6tt5 [12:26:45] - activator-549b955cfd-tfsnt [12:26:46] Service kubernetes -> ips=10.192.16.93, 10.192.48.64, port=TCP/6443 [12:26:46] ... [12:27:09] so, it seems that the label selector resolves to pods, and that the service selector is correct as well [12:27:59] The question then remains why the policy for the otm doesn't work --- IF that is the problem [12:29:07] I have a similar looking policy for a service I'm currently writing. Let me add this policy to your paste [12:29:34] ack [12:33:15] So I think the spec-level selector I have works, since the pod can talk to the k8s masters (AFAICT). [12:33:53] Using a services-style destination is not really feasibly for the activator, since it needs to talk to all inference services [12:39:35] Given that the probes to e.g. articlequality-predictor-00020-deployment-55d5fd859-shskc in the same NS seem to work fine, I am getting more and more convinced that this is not a netpol problem, but an otm one [12:43:47] I'll do some digging in the omt pods, see if I can find something [12:52:43] I can see that the Prom probes to /metrics on the outlink pods work fine, so it's not completely broken. There are also OTM p[ods outside of experimental that seem to work fine. Best guess: the ones in experimental are broken 'somehow" and this is not a netpol problem. [12:52:48] brouberol: thanks for your help! [12:52:59] And sorry about it being a bit of a goose chase/red herring. [13:04:37] my pleasure [13:05:05] klausman: back! [13:05:16] I am checking https://phabricator.wikimedia.org/P67223, is it still the last one? [13:05:41] elukey: I think it's not a netowrkpolicy problem, but those pods that can't be probed look like leftovers [13:05:59] They are in the xp NS, but the current HEAD of deployment-charts does not specify them. [13:06:03] exo NS [13:06:12] arg... experimental [13:06:40] The only probe fails the activator logs are those two pods. [13:06:50] So it likely can probe everything else. [13:07:43] Otherwise, the paste is still accurate [13:07:50] I am not getting one thing though - isn't the policy only neeed to allow the activator to contact the kube api? [13:08:15] pod egress to other pods should be already granted by the defaults that we have [13:08:33] Well, that may have been my brainfart: I saw the probe failures and assumed the new policy was lacking that bit. [13:09:13] I probably could have tried looking at older logs when I first deployed the policy, but I didn't think of that [13:10:15] I think that it tries to contact the queue proxy, IIRC the 8012 port should be for it [13:10:47] qq - have you tried to kill all activators and check the logs again? [13:10:54] just to be sure that it still happens [13:11:00] Not yet, will do in a second [13:12:13] yeah let's see if the logs keep happening [13:13:40] restarted both pods, they still log the same probe failures. [13:14:18] ml-staging right? [13:14:23] Yep [13:15:06] I've asked Aiko about the pods (since as m,entioned they are not in the deployment-chart anymore) and she said that Ilaias was experinting with tme and more entrypoint args, so it's unclear what state theya re in. [13:21:47] I checked https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1054538/4/charts/knative-serving/templates/networkpolicy-activator.yaml and there was nothing related to allowing egress to kserve-inference pod [13:22:07] yes, I was doing kubectl edits [13:22:12] and now I am wondering if the policy, as it is, blocks more traffic than what it is allowed [13:22:20] also there is no port specified afaics [13:22:24] not sure what it is the default [13:22:27] I wasn't sure if I had missed something in the old policy [13:22:42] what do you mean? [13:23:10] So the old policy that clearly works in prod is a K8S networkpolicy, which has a different syntax than the new calico one [13:23:46] I wasn't sure if my "conversion of intent" was correct, hence me fiddling with the edits. [13:24:07] sure sure but the content should be the same, since they assume a default (pod egress to other pods allowed) [13:24:18] But if the OTM pods are pborken by themselves, then policy I have (minus the rules you mentioned that may be superfluous) might be just fine [13:24:52] yes yes, you can try to check via nsenter if from the activator container you can reach any other kserve-inference container [13:25:02] I.e. just allowing egress to the k8s controlplane [13:26:37] yep [13:27:02] how does one resolve a container's endpoint to talk to? is it part of describe output? [13:28:37] ahyes, IP and port fields [13:29:29] and a curl to articelquality's pod/metrics works, OTM hangs [13:29:52] I am pretty sure by now that the pods for OTM are just broken somehow [13:36:14] So I am somewhat inclined to just delete them and go back to the simpler netpol and continue working on the patchset. wdyt? [13:38:29] sure [13:38:39] whey you say delete them, do you mean removing the isvc? [13:40:45] Well, there's nothing in the deployment charts for them, so I suspect I'd have to use kubectl delete on the deployments and pods [13:43:33] I don't know how Ilias created them in the first place. [13:53:31] okok yes clean those up [13:54:01] if needed we'll add them as isvcs, check if the isvc object is still there [13:54:09] if so maybe dropping it should suffice [13:55:28] ack, will do [13:57:35] kubectl delete -n experimental inferenceservices.serving.kserve.io outlink-topic-model did the trick [13:58:03] and the logspam about failed probes has stopped \o/ [13:58:18] thaks, Luca! [13:58:22] +n