[08:42:06] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Increase visibility of container/pod ressource exhaustion - https://phabricator.wikimedia.org/T266216 (10JMeybohm) p:05Triage→03Medium [08:42:45] 10serviceops, 10Kubernetes, 10User-jijiki: Deploy kube-state-metrics - https://phabricator.wikimedia.org/T264625 (10JMeybohm) [08:42:47] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Increase visibility of container/pod ressource exhaustion - https://phabricator.wikimedia.org/T266216 (10JMeybohm) [08:42:49] 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Followup): Raise an alarm on container restarts/OOMs in kubernetes - https://phabricator.wikimedia.org/T256256 (10JMeybohm) [08:45:03] 10serviceops: Upgrade kubernetes nodes to kernel 4.19.x - https://phabricator.wikimedia.org/T255273 (10JMeybohm) [08:45:07] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10User-jijiki: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 (10JMeybohm) [08:46:42] 10serviceops, 10Wikifeeds, 10Kubernetes: wikifeeds-production-tls-proxy regularly exceeding its k8s CPU reservation - https://phabricator.wikimedia.org/T266194 (10JMeybohm) a:03JMeybohm I think you are right, thanks for the heads up! While this probably is also an issue of "too much throttling" (T262527),... [10:05:13] 10serviceops, 10Wikifeeds, 10Kubernetes, 10Patch-For-Review: wikifeeds-production-tls-proxy regularly exceeding its k8s CPU reservation - https://phabricator.wikimedia.org/T266194 (10akosiaris) That's pretty interesting, there shouldn't be so much throttling at so low CPU usage. user+system summed barely h... [10:11:16] 10serviceops, 10Wikifeeds, 10Kubernetes, 10Patch-For-Review: wikifeeds-production-tls-proxy regularly exceeding its k8s CPU reservation - https://phabricator.wikimedia.org/T266194 (10JMeybohm) >>! In T266194#6570986, @akosiaris wrote: > That's pretty interesting, there shouldn't be so much throttling at so... [12:57:25] akosiaris: jayme: what's the eventual plan (if any) for CI validation of helm charts and values.yaml files? :) [13:01:05] cdanis: they are being validated right now. Both helm charts AND deployments [13:01:23] as in validated they are valid per the kubernetes schemas [13:01:36] not just valid yaml. Is there more you would like to see ? [13:01:46] akosiaris: I was asking about the 'ressources' typo :) [13:02:03] I understand that values.yaml can be freeform in some parts, but not all parts, right? [13:02:27] ah, yeah that one is interesting. That's an invalid key but valid yaml. So it does not override anything and would end up being a no-op [13:02:59] right, but I feel like that's something that we can/should protect against -- admin module data.yaml for instance was similar, but we added validation against a schema there [13:03:51] note btw, that is the typo was not in ressources, but in limits (e.g. lamits) CI would have caught it [13:04:19] cause it wouldn't generate an invalid manifest for kubernetes [13:04:22] it would* [13:04:50] Yeah, I just made a pretty good typo :-P [13:06:30] not sure how we could have caught that tbh. Maybe diff against a manifest without the patch? And say it's noop? but that wouldn't catch the typo, it would just tell you your change doesn't really do anything [13:07:17] And that's probably something that we want to be able to do (noop changes) [13:15:07] it's kind of a weird corner case. If the input was resulting in invalid kubernetes yaml we would have caught it, but it doesn't cause it's a noop. But there isn't really a schema for values.yaml files as they can have arbitrary data in them [13:27:09] akosiaris, jayme, cdanis: not directly related, but I really enjoyed this talk on Senpai, a system that "starves" memory and tries to find out the minimla memory footprint: https://www.youtube.com/watch?v=ujk2pfgPul8 [13:28:08] Watched it because it came indirectly through a candidate for our open position. Indirectly: I watch the candidates presentation, this one was next and sounded pretty intersting.