[10:08:30] does our envoy configuration template support the injection of custom filters? We're deploying airflow instances that are exposed publicly (with OIDC auth), but I'd like the /api/* URL prefix to only be reachable from our internal network. Is that possible? [10:16:34] the first thing that comes up to mind is "filter-local-ratelimit-svc" that ML uses to rate limit inbound traffic, but it is related to the Istio sidecars.. Probably something like that could work if you use the Istio Ingress (so in the dse's istio config you'll have to inject your filter) [10:17:07] I didn't know though about airflow being exposed externally, is it documented somewhere? [10:17:41] I guess that local port forwarding on k8s could be more difficult for people that want to self-manage jobs [10:30:51] > is it documented somehwere? [10:30:51] Yes, we've written a Airflow HA architecture documented in which we mention it https://docs.google.com/document/d/1peCptYVHtjVG825vXnZHOc2cicV3X83WT1u24iWIrHs/edit#heading=h.r7i7313rg2d7 and it's in our Phabricator as well [10:37:32] to clarify, they are not exposed externally *now*, but they will be once we migrate the webserver services to dse-k8s [12:05:10] ack ack! [12:58:09] brouberol: our envoy configuration template is definitely modifyable :) [13:17:19] ack, thanks! I'll have a look when we get to it [13:21:22] jayme: do you think you'd have time for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1075311 and related today? [13:28:55] cdanis: yes. I wanted to make sure alex has seen this because IIRC he had some objections against externalIP usage in the past. But I've a meeting with him in an hour where we can discuss [13:29:01] thanks [13:29:22] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1075918/1 def. needs a chart version bump - that I can tell already :) [13:29:25] ah ofc [13:30:09] ✅ [13:32:39] ah, maybe it was this https://github.com/projectcalico/calico/issues/3585#issuecomment-635424401 [13:36:32] so docker-report keeps failing to pull the catalog from the registry, even with 300s of TTFB timeout in nginx (and docker-report hitting the discovery endpoint directly) [13:36:43] at this point I am going to concentrate on the registry's GC [13:36:56] * elukey cries in a corner [13:42:05] jayme: I saw that too, but there's also https://github.com/kubernetes/kubernetes/issues/91374#issuecomment-635708873 [13:42:50] and then, https://github.com/kubernetes/kubernetes/issues/94499 [13:42:55] so I think that's long been fixed [13:49:10] yeah, the feature flag has been removed with 1.22 [13:49:33] long been fixed does in no way mean we have it - unfortunagtely. But here we're lucky it seems [13:49:40] yes I checked :) [13:49:56] and I was pretty sure it'd be okay, because the feature flag was present since 1.18 [14:33:48] cdanis: that's what alex dug up https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/484670 [14:34:50] jayme: hmm ok, this isn't a NodePort though [14:35:46] yes, but it makes this solution less-/inaccessible for all other services [14:36:29] if that is still true [14:36:31] I wish there was a k8s issue attached :D [14:39:09] yeah...I think what I'm saying is: We should maybe test this before implementing something :) [14:39:45] sure, how do you suggest I best do that? is there a reason to not just try out those patches + the values changes in staging? [14:42:36] I'd also feel fine just doing some `kubectl edit` in staging tbh [14:42:52] the other issue we had with announcing service ips is that, without externalTrafficPolicy: Local, we had to enable masquarading because the traffic returning has the source IP of the pod (and won't reach the client) [14:43:00] ah [14:43:19] not sure how that is with "externalTrafficPolicy: Local" - I think we did not check [14:43:33] I think it should work but certainly we can check [14:52:39] jayme: akosiaris: do you object to me live-experimenting with this on staging? [14:53:27] cdanis: was about to say, I'm okay with you kubectl editing a prototype in staging-codfw [14:54:07] great, thanks [14:54:59] would be nice to have a test case for all those scenarios and document that somewhere...sorry that we're not that much help currently - we're a bit swamped and haven't planned to work in this really [14:55:20] yes it's okay, I just don't want to step on toes or do anything controversial [14:56:13] and I don't have a firm sense of what kinds of manual tests / networking changes are considered reasonable [15:23:03] ah [15:23:04] # services "kube-dns" was not valid: [15:23:06] # * spec.externalTrafficPolicy: Invalid value: "Local": may only be set when `type` is 'NodePort' or 'LoadBalancer' [15:23:45] ahaha and then ofc [15:23:46] error: services "kube-dns" could not be patched: services "kube-dns" is forbidden: Use of external IPs is denied by admission control [15:24:01] ah, nict [15:24:05] *nice [15:24:47] yeah, there is DenyServiceExternalIPs [15:25:09] should be somewhere in /etc/kubernetes on the k8s masters [15:25:50] "This feature is very powerful (allows network traffic interception) and not well controlled by policy. " [15:26:59] but it should be possible to answer the masquarading question without externalIP (by announcing serviceIP range instead) [15:27:27] maybe that's the other reason we never looked back on externalIP [15:30:57] maybe [15:31:08] I'm also happy to experiment with announcing the single ClusterIP for coredns [15:31:50] yeah, I'd fall back to that tbh instead of also enabling externalIP [15:31:57] 👍 [15:32:29] I might still have to make it NodePort, to enable externalTrafficPolicy: Local, to prevent SNAT [15:32:31] but we'll see [16:08:46] sigh, the service ClusterIP is advertised, and that works, and I did the above re: NodePort & extTrafPol, but, the SNAT is still happening [16:09:28] yeah, that was the assumption...bummer [16:09:43] https://phabricator.wikimedia.org/P69427 [16:09:45] akosiaris: ^ [16:10:04] cdanis: no objections on my side fwiw, but it's a pretty interesting set of tests that need to be run [16:10:23] but yes, externalIPs + clusterIP should do the SNAT (otherwise it wouldn't work) [16:10:28] I also modified the Corefile config map https://phabricator.wikimedia.org/P69428 [16:11:14] what's reflect for? [16:11:20] can't remember having that one [16:11:27] it's a fake top-level tld [16:11:34] TIL [16:11:38] `whoami` is the plugin [16:11:47] it serves an A and a SRV like shown in the first place [16:11:55] with seen client IP + port [16:12:17] that's neat [16:12:28] much like reflect.wikimedia.org :D [16:17:50] but it's SNAT on the way to the pod or am I reading this wrong? [16:18:13] and the response makes it back to deploy1003, despite coming from the pod ip? [16:20:25] jayme: yep [16:20:56] ehm, not from the pod IP [16:21:11] from the service ip then [16:21:20] that's why it works [16:21:28] yes [16:21:41] but with "externalTrafficPolicy: Local" it should stop working then [16:22:54] 76 3.431388610 10.192.76.3 → 10.64.16.93 DNS 160 Standard query response 0x9b4b A reflect A 10.192.32.101 SRV 0 0 37915 OPT [16:25:01] cdanis: but that's with 'externalTrafficPolicy: Cluster' still? [16:25:12] (currently that's what is configured for the service) [16:25:13] uhh [16:25:16] oh I switched it back [16:25:38] wait [16:26:04] so I'm SNATted by kubestagemaster2004.codfw.wmnet. which makes no sense at all [16:26:14] coredns-7bd797d7dd-np95m 1/1 Running 0 11d 10.192.75.52 kubestage2002.codfw.wmnet [16:26:16] coredns-7bd797d7dd-zkf49 1/1 Running 0 25m 10.192.75.199 kubestage2001.codfw.wmnet [16:26:16] the bgp announcements should also change, as with Local, only the nodes running coredns should announce 10.192.76.3 [16:26:22] yep [16:26:47] cdanis: if you have the appetite and time, may I shoot towards you an idea I had for solving this without service ips and all this natting? [16:26:54] akosiaris: sure [16:26:59] I'm not sure I have the time [16:27:12] the *other* thing I wanted to try was, to just ... manually patch in a LoadBalancer IP [16:27:17] essentially this: https://docs.tigera.io/calico/latest/reference/resources/ipreservation and https://docs.tigera.io/calico/latest/reference/configure-cni-plugins#requesting-a-specific-ip-address [16:27:36] statically allocate 4 pods with IPs [16:27:40] akosiaris: right but that means we need to do something like manage a bunch of single-replica ReplicaSets or something [16:27:42] and delegate to those IPs directly [16:28:00] but...even with Local (like it is now) the response still reaches the client [16:28:02] there is also another slightly different approach [16:28:12] which doesn't need single pod managing [16:28:17] let me find the doc [16:30:05] btw, there is https://github.com/projectcalico/calico/issues/6787 with a similar need/idea [16:30:12] still open [16:31:27] > Indeed, I am directly creating a crd.projectcalico.org/v1 IPPool resource rather than projectcalico.org/v3 via calicoctl, because I'm creating it as part of a larger Helm chart. (Unless there's a more correct way to create Calico resources within charts?) [16:31:38] love the crickets on that one, I don't know the answer either [16:31:46] https://docs.tigera.io/calico/latest/reference/configure-cni-plugins#using-kubernetes-annotations [16:32:01] I'm confused now as to why it still works rn [16:32:03] so instead of doing it per pod manally [16:32:17] you can do it for the entire namespace per docs [16:32:37] the entire namespace [16:32:40] pick a /29 or a /30, put all coredns in their namespace [16:32:48] and kinda done [16:32:55] hopefully! [16:32:57] there's nothing that says the deployment has to be in the kube-system namespace, right? just the kube-dns service? [16:33:15] endpoins can't cross namespace boundaries [16:33:16] I don't think you need a service at all in this steup [16:33:27] oh you're saying run a separate coredns deployment [16:33:31] aaaah you mena the standard one [16:33:42] hmm good catch [16:33:49] that has to stay, definitely [16:33:51] yeah we might need a separate deployment [16:33:54] I had thought about something like that, but, you need at least N+1 pod IPs to do a rolling update and thus you probably want to do healthchecking on the IPs [16:34:06] and thus you maybe can't do just simple pure delegation [16:34:07] alternatively, we can move all of coredns resources in their own namespace? [16:34:13] (it depends on how smart our recdnses are I guess) [16:34:46] btw, you probably want to keep the N pretty low [16:34:52] akosiaris: I don't understand what relies upon looking up kube-system service/kube-dns, either via the API or via the environment variables you'll automatically get in kube-system pods [16:34:55] cause of the sheer size of the delegations [16:35:06] we should not move kube-dns out of kube-system, that will def bite us [16:35:16] akosiaris: well yeah, I'm trying to keep N=1 :D [16:35:34] I was shooting for like 3, but N=1 is also ok [16:35:43] yeah [16:35:55] jayme: that's a good point, but I am not sure why [16:36:06] I irk to ask what would break [16:36:09] and I fear the answer [16:36:37] if in doubt, because! It's de-facto standard...I would try not to rattle :) [16:36:40] ^ [16:36:57] ok, so [16:37:01] hm [16:37:13] I want to figure out why kubestagemaster* is advertising this IP at all [16:38:00] but still...I know I'm annoying but: 1) why does externalTrafficPolicy: Local does not work as advertized and 2) is that a problem (do we care about the source IP in DNS) [16:39:37] and yes 3) why does kubestagemaster advertise the ip [16:41:24] hmm the annotation I pasted above apprently also works on the deployment object? [16:41:38] interesting [16:41:40] maybe we don't even need to move away from kube-system? [16:41:50] I don't hate that [16:41:51] maybe I'm just tired and I'm mixing things up. Will drop off for today - but I'm veeery curious [16:42:00] jayme: yeah, they are very good questions [16:42:08] phew :D [16:42:24] with a proper IPpool create it shouldn't be more than 1 deployment [16:42:30] hmmm [16:42:35] akosiaris: I'll give that a try in staging as well [16:42:41] but I got to put the kids to sleep [16:42:47] ttyt [16:43:55] but don't annotate the deployment object itself if possible, but the podsepc [16:43:57] podspec [16:44:27] oh ofc [18:07:48] the kubestage workers have a BGP session only with their ToR switch, but, the masters have BGP sessions only with both CRs? [18:11:05] ah I see [23:05:32] is anyone aware of any changes made to wikikube in eqiad around 7:30 UTC today (9/26) that might have created a sudden influx of LIST ops for ipamblocks resources? [23:05:35] https://grafana.wikimedia.org/goto/UjCozXgNg?orgId=1 [23:07:08] we're getting a fair amount of chatter from KubernetesAPILatency in -operations, and it might just be that if these ops tend to be slow, we need to special-case the alert. [23:07:36] however, what's weird is we're _only_ seeing this in eqiad, which suggests something's not quite right [23:12:52] correction, it's actually a whole bunch of calico-related resources, it's just that ipamblocks is the only one slow enough to be alerting :) [23:29:59] alright, I've heard out of band that this is a side-effect of a large number of new worker nodes coming up around that time this morning in eqiad. [23:29:59] no action required and will be addressed by freeing up more /26's via decoms tomorrow [23:37:56] silenced KubernetesAPILatency for 24h for resource=~"(blockaffinities|ipamblocks)" site="eqiad" (f33e2c42-d921-4686-a761-59d43ad14d84)