[08:27:59] <_joe_> akosiaris: I was thinking about integrating cfssl with kubernetes. For now, we'll just generate the certs and write them automagically somewhere, but on the mid term, do you think it would be valuable to try to use the k8s internal facilities to manage certs, like the certificates.k8s.io interface? [08:35:50] well, note that certificates.k8s.io doesn't have CA history (it GCs everything after 1h - configurable IIRC). Also if we want to automate (and this only make sense to if we want to automate it) we 'll need something like certmanager [08:36:57] otherwise we 'll end up with having manually to do kubectl certificate approve|deny which isn't really that much of a UX improvement [08:37:20] <_joe_> yeah I was looking and there is no integration with helm [08:37:45] <_joe_> so yes, we'd need something that allows you to generate certs using CRDs if we want a fully automated process [08:38:19] <_joe_> for now we can just have cfssl write out the certs in the appropriate data structures instead of having all the awkward stuff in the private repo [08:39:09] a wait, there is also a built-in signer in kube-controller-manager [08:39:30] so we don't need something like cert-manager, we can rely on kube-controller-manager [08:39:44] at least for a while, I am pretty sure cert-manager is more fully fledged [08:40:25] to enable it all we need is --cluster-signing-cert-file and --cluster-signing-key-file [08:41:04] although that only does the signing parts, the actual review is still up to a human [08:43:17] <_joe_> yeah I would love to just add the cert via a CRD in the helm manifests [08:43:36] <_joe_> if tls.enabled [08:44:31] note btw that that API makes distinction on the signer name [08:44:36] https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers [08:48:05] <_joe_> there is also the added complexity that we need the CA to work across clusters [08:49:29] if we care about history, yes [08:49:57] if not (treat it more statelessly) not so much (although there is probably some detail I am not seeing yet) [08:52:28] <_joe_> you need to call services in one cluster from the other at times [08:53:08] <_joe_> so you need to either add a CA per cluster to your manifests, or have a global CA [09:08:13] global CA of course :P [09:11:37] <_joe_> and that doesn't work well with the in-cluster management :P [09:43:06] 10serviceops, 10SRE: Publish wikimedia-bullseye base docker image - https://phabricator.wikimedia.org/T281596 (10Joe) A possible procedure we can use is the following: - Automate building base images using debeurrotype and docker daily. These images will be bare minimum debian-slim images similar in all respe... [13:13:14] couldn't we just tier this stuff the way the public ecosystem works? have a global internal root CA which signs an intermediate for k8s use, and have the k8s cert consumers send the chain intermediate to clients to link it up? [13:57:07] sure, I don't think we ever pondered creating an entirely new CA but rather a subCA [14:12:57] <_joe_> yeah the idea was a subCA [14:13:25] <_joe_> but why send back a whole chain when we can have better option is my question, tbh [14:13:31] <_joe_> *options [14:13:53] <_joe_> although envoy has greatly reduced the cost of tls negotiation internally [14:22:28] we were talking about populating envoy, right ? I was scoping down to that tbh [14:22:44] as in... most other services currently in k8s don't even need it [14:22:51] s/apps/services/ [15:35:32] _joe_: if you are still around [15:35:44] I removed poolcouter1004 from config [15:35:48] https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&from=now-15m&to=now&refresh=1m [15:36:10] there are still locks, but I obviously have no idea when they are going to be released [15:36:24] should I wait longer or should I just go ahead and reboot the server ? [15:36:48] <_joe_> take a look at what's connecting maybe? [15:36:56] <_joe_> I bet you it's something using python [15:36:57] 10serviceops, 10Prod-Kubernetes, 10Pybal, 10SRE, 10Traffic: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) Upstream calico issue at https://github.com/projectcalico/calico/issues/4607 I am also working on a PR, I 'll post... [15:37:32] effie: probably ss/netstat on the machine will be best -- an open lock always means an active TCP connection [15:37:49] I also have a few tcpdump recipes for inspecting poolcounter traffic [15:37:54] I was afraid you could say that [15:38:02] would* [15:38:04] some of them I even documented! [15:38:16] https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production [15:39:09] <3 [15:39:18] <_joe_> yeah I was suggesting ss :P [15:39:26] <_joe_> that's enough to get the info we actually want [15:40:19] ok it is thumbor, which was the #1 suspect [16:23:45] random question, not sure for who (maybe _joe_?) -- do we have a count/rate of successful edits in prometheus? or only graphite [16:24:08] <_joe_> only in graphite [16:24:56] <_joe_> at some point we'll need to tackle rewriting the metrics names in mw in a way it will make it easy to just convert them from statsd to prometheus, and use statsd-exporter [16:25:19] ack [16:26:09] makes sense, ty, I'll investigate the graphite json 'api' some :) [16:26:42] <_joe_> ugh, good luck :) [16:26:47] <_joe_> did that, regretted it [22:31:03] 10serviceops, 10MW-on-K8s: MW container image build workflow vs docker-registry caching - https://phabricator.wikimedia.org/T282824 (10Legoktm) [22:32:28] 10serviceops, 10MW-on-K8s: MW container image build workflow vs docker-registry caching - https://phabricator.wikimedia.org/T282824 (10dancy) >>! In T282824#7086759, @Legoktm wrote: > If we want to keep that caching I think triggering purges for those URLs after pushing a new image makes the most sense to me,... [22:38:11] 10serviceops, 10MW-on-K8s: MW container image build workflow vs docker-registry caching - https://phabricator.wikimedia.org/T282824 (10Legoktm) Related: {T256762} and maybe also {T264209}. >>! In T282824#7086763, @dancy wrote: >>>! In T282824#7086759, @Legoktm wrote: >> If we want to keep that caching I think... [23:32:53] 10serviceops, 10MW-on-K8s: MW container image build workflow vs docker-registry caching - https://phabricator.wikimedia.org/T282824 (10dancy) p:05Low→03High After changing from docker-registry.wikipedia.org to docker-registry.discovery.wmnet I get the following error during the image build process: ` Step...