[07:37:32] o/ just wondering what the idea is with not pinning image versions in helm charts? :) [07:45:40] 10serviceops, 10Operations: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10Dzahn) 06:36 mutante: deneb - systemctl start docker-reporter-releng-images 06:37 <+icinga-wm> RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational h... [07:57:42] <_joe_> addshore: not pinning chart version in helmfile.d, rather [07:59:05] oh yes *looks for the chart* [08:07:25] _joe_: I'm confused about there and why the various versions are pinned (images and charts) [08:08:00] <_joe_> so I'd like for jayme to be around as well for this discussion, but basically the concept is [08:08:03] It seems like the current situation makes things a little bit tricky, in terms of having image tags pinned in helmfile.d values files, but having the chart always specify latest, but then also not pinning the version of the chart being used [08:08:10] ack! [08:08:41] <_joe_> so the idea is that the chart should always be at its latest version, consider it the "puppet code equivalent" [08:08:58] <_joe_> whereas the image version and the values of config are what you actually deploy [08:09:13] <_joe_> and those should be injected / pinned in the values.yaml files for helmfile [08:10:38] Feel free to give me a ping if you want a real time chat about it too :) [08:10:53] Yeah, I'm also up for a real time chat about it [08:11:19] I *think* the issue is that because the chart is tightly coupled to the image we generally need to keep the versions in lock step [08:11:46] Yes ^^ [08:12:05] but it sounds like making / updating charts is also a bit more effort than it should be currently? [08:12:07] <_joe_> how come the chart is tightly couple to the image? [08:12:26] <_joe_> most people update the chart once a year, if ever [08:12:41] <_joe_> so if you need to do it a lot, then there is something we should do better [08:12:46] well, the chart defined the env vars that the image requires, it defines the health checks that the image specifies it defines everything, it is the glue between k8s and the image [08:12:49] I'm not sure it should be; but it is see charts/termbox/templates/deployment.yaml#37 [08:13:06] <_joe_> addshore: well usually those should not change much? [08:13:06] where the chart talks about specific file paths in the image [08:13:23] it feels like the wmf way is currently not doing that though, and instead trying to make the helmfile not need to change and always work for everything and just change the values of the chart, which feels a bit backwards to me [08:13:44] <_joe_> actually it's not [08:13:52] So, running k8s in other places, the charts change all of the time, and the only thing I change in my values file is the version of the chart being used normally [08:14:31] <_joe_> that's exactly how it should be, changing the chart should mostly happen if something serious changes, like something structural [08:14:53] <_joe_> IMO, at least [08:15:21] <_joe_> again, think of the chart as the infrastructural glue around your application, why should that change every time? [08:15:41] <_joe_> what you see is the fact we're not publishing charts for external use [08:15:53] <_joe_> if we did, we might want to bump the chart version for every deployment [08:16:03] _joe_: I think in this case the problem is that charts currently "poke about" inside the image [08:16:41] <_joe_> so, looking at your image [08:16:54] <_joe_> sorry, your chart [08:17:19] <_joe_> what does need to change from image version to image version? [08:19:20] _joe_: so we need to be free to change things like the path to serverBuild/src/server/service-runner-entrypoint.js without it breaking [08:19:39] <_joe_> tarrow: so you're free to do it, ofc [08:20:05] but of course if we are on unpinned versions and we release a new chart then we can't run helmfile until we also bump to the new image? [08:20:08] <_joe_> you just update your chart, and in the same commit you can also change the image version both in the chart and in values.yaml [08:20:18] but then we can't roll back [08:20:22] <_joe_> why? [08:20:32] <_joe_> I'm sorry I'm not understanding this [08:20:38] because the old image won't work with the new chart [08:20:42] <_joe_> let me try to explain how I think it should work [08:20:47] <_joe_> and tell me what's wrong [08:21:01] <_joe_> so, let's assume you have to change something in your chart [08:21:08] but how can you then update the images one DC at a time? [08:21:26] or try the new chart and image on the test deployment first? [08:21:43] (I'm here and read the backlog btw.) [08:21:49] o/ :D [08:21:51] jayme: o/ [08:22:22] <_joe_> so, let me try to describe that specific scenario: you have a new image version that needs an update to the chart for $reasons [08:22:28] yep [08:23:12] <_joe_> I would say that changing the command should be avoided in general as much as possible. And usually it's good practice to have a "/run.sh" inside the image, but I digress [08:23:44] I was going to say I think its less of what we need to change, but what can change in new image versions, and that is env vars, resource requirements and limits that should be imposed on the app, needed mounts, paths to anything, port?, But if the idea is that the chart and the deployed values.yml are always updated together, then that makes that less of a problem. [08:23:44] However IMO it's still going in the opposite direction of everything that I have learnt so far, which is interesting, and I want to understand. [08:23:44] I guess toms multi DC change point is still very valid [08:24:25] <_joe_> addshore: most people are also using the docker cache when building stuff, so I wouldn't consider what "the industry standard" is on stuff re-containers to always make sense [08:24:42] <_joe_> so [08:25:15] <_joe_> so, you have to do this. You change the chart, bump the version of the image in the values.yaml of the chart, build a new chart version, and remove any reference to image versions form the helmfile.d values.yaml, as *what is in the chart is what you want to deploy now* [08:25:51] _joe_: how do I do that only for the test deploy and not also change what is in production though? [08:25:56] <_joe_> then you can deploy to staging, wait how long do you want, then deploy to production, even one dc at a time [08:26:20] <_joe_> If you want to run different versions for weeks, then probably it's a good idea to temporarily pin the chart version [08:26:36] <_joe_> rolling back is just one "git revert" in this scenario [08:27:08] <_joe_> addshore: the *default* values.yaml is tied to the chart version. But it doesn't mean that image version should be tied to the chart [08:27:24] <_joe_> imagine if you did CD and you deployed 5 versions of termbox a day [08:27:48] <_joe_> would you consider it logical to create a new chart every time, even if the only thing changing is the image version? [08:28:50] <_joe_> addshore: to be fair to "the industry standard", if you were preparing charts to distribute termbox to others, those charts would be in lockstep in terms of image version [08:29:16] "would you consider it logical to create a new chart every time, even if the only thing changing is the image version?" Yes [08:29:18] <_joe_> but this is for semi-continuous depolyment in your infra, it's different [08:30:20] <_joe_> addshore: it's just useless, and more work. My opinion is that you should change the chart only when 1) you want to distribute to others a newer version with more features 2) you make a breaking change [08:30:30] _joe_: so I think the problem is that we got bitten with that strategy. If all the charts are pinned at latest, but the production values file is now incompatible with the new chart then you a prevented from running helmfile (e.g. to touch production) until you revert to the old chart [08:31:01] <_joe_> tarrow: as I said, I can imagine chart version pinning to happen in specific cases [08:31:17] <_joe_> like you want to keep a different breaking chart version in prod vs test for some time [08:31:30] <_joe_> but in general, you should not need it in standard operating situations [08:31:46] I dont know how much more work it is, i havn' deployed anything using theis setup yet at all. but for a regular deploy isn't it as simple as, bump the chart version and version of the image in the values files for the chart (2 lines), and also then bump the version of the chart being used in the helmfile.ds? [08:31:53] _joe_: what is the benefit of then pinning to latest; why not just bump the version when you need a new chart? [08:32:18] <_joe_> addshore: yes, vs just bump the image version in helmfile.d [08:32:33] <_joe_> it's more work, and I think not needed [08:32:44] <_joe_> tarrow: I didn't say to pin to latest [08:32:51] <_joe_> sorry if I wasn't clear [08:33:14] <_joe_> I said do not pin in helmfile.d, which means the latest version will be used :) [08:33:38] <_joe_> but again, pinning chart versions of charts we control should be an exception, not the rule, IMO [08:34:01] what's the disadvantage in pinning chart versions? [08:34:04] <_joe_> this all needs to be codified somehow though, and we're all learning the best way to handle this stuff. [08:34:46] doesn't that mean that we can easily use one chart on test and one chart prod without accidentally bumping the prod version when testing a new chart? [08:34:55] <_joe_> that's more work 99% of the times, it clogs our charts repo with more stuff, and given it's the opposite of what everyone else is doing it's confusing for SREs [08:35:06] okay, so the chart could specify the image version, rather than have it in the helmfile.d? or is that also not prefered? [08:35:40] <_joe_> addshore: the chart specifies an image version [08:35:52] <_joe_> you can override it in helmfile.d if you're updating it [08:35:57] currently for termbox as "latest" [08:36:02] <_joe_> if you want, think of this in terms of semantic versioning [08:36:04] <_joe_> EWWWWW [08:36:12] <_joe_> latest should *never* be used :) [08:36:23] maybe that is the eww bit that we are all actually talking about then :P [08:36:26] <_joe_> like, never. It's bound to bite you [08:36:31] <_joe_> no, [08:36:40] <_joe_> so let me try to make an example [08:36:59] I mean, I see it in other helm charts too https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/recommendation-api/values.yaml#L12 [08:37:21] it's quite common for wmf charts to have version: latest in values.yaml _joe_ [08:37:21] <_joe_> that doesn't mean it's correct [08:37:25] <_joe_> yeah [08:37:27] <_joe_> sigh [08:37:32] <_joe_> I was just seeing that [08:37:41] <_joe_> ok, that's not great, let's say [08:37:46] <_joe_> and I hope everyone overrides that then [08:37:53] IMO, if we could actually put the image version there, all of this would be a better better IMO [08:37:55] <_joe_> ok, I was missing this detail [08:38:42] <_joe_> jayme: I think we need to write some best practices advice and gather feedback on it [08:39:01] So, for me the conceptual problem is something like "if the chart is tightly coupled to the specifics of the image then the chart and image version should be near each other" [08:39:01] <_joe_> nothing is still properly codified on this topic [08:39:31] <_joe_> tarrow: but my point is - image specifics should not change with every commit to termbox [08:39:55] <_joe_> so you should be able to use - most of the time - the same chart and just bump what image is release [08:39:58] <_joe_> *released [08:40:05] I think it's okay to have "version: latest" (which will not be working as we don't push latest tags) for a specific case...but thats my 2 cents and I feel like all of this is heavily based on personal preference/experience in the past. And agreed _joe_ - we should write something up [08:40:46] _joe_: gotcha; IMHO if the chart was less coupled it would be clearer. e.g. if the chart didn't use args but instead used the ENTRYPOINT defined in the image [08:41:03] _joe_: IMO that "most of the time" just sets us up ready to do the wrong thing on ocasion and deploy broken stuff [08:41:38] +1 to having something written up :) [08:44:52] _joe_: also, it it "right" to revert the chart version by typing git revert? I thought charts were supposed to be immutable and so "reverting" should actually be making a new chart that is like the old one? [08:45:24] Perhaps this whole discussion is also slightly skewed by the fact that right now the charts are .tgz in the repo? rather than being served from somewhere else? I don't know if that is the planed long term solution, but it might end up altering how viable the currently done thing is as a solution. [08:45:48] also, if we only ever use the latest it would be clearer if we didn't have all the old charts in the repo if we're never supposed to use them? [08:46:18] ^^+1 to those points too [08:46:58] <_joe_> Sorry my computer is acting up [08:47:15] <_joe_> I'm on the phone rn [08:47:21] yeah..I think having the chart repo inside of git contributes to the problem as it is annoying to maintain and makes all this helm chart history very "visible" [08:47:51] gotcha [08:48:03] <_joe_> Yes, that needs to change. And we plan to [08:48:52] but is it ever right to `git revert` these charts? Or should we be always incrementing the version number? [08:49:38] Imo we should always be incrementing the version number. And a "revert" could be done in the helm.d values file reverting the chart version being used. [08:50:02] And I think once the helm chart repo isn't in git that's probably the direction that would end up being needed too, perhaps [08:51:42] <_joe_> as I said - I feel that the chart should be bumped when something changes in the chart - but I'm willing to be convinced of the opposite and that we should, once we do CD, update the chart version for every commit [08:51:56] <_joe_> but it seems not useful to me. [08:52:32] Looking at the history of the charts, I can also see the desire from operations to be able to quickly and change things across multiple deployments at once, imo all of those things that are desired to change in that way should be changeable in the values perhaps [08:52:55] <_joe_> what do you mean? [08:53:52] <_joe_> we usually perform two type of operations: add stuff to the charts, or bump versions of sidecar containers [08:54:07] <_joe_> when we do the latter, we want to just change things in helmfile.d [08:54:34] <_joe_> and btw, the organization of helmfile.d is... it needs more engineering time [08:55:56] I'm trying to write things (problems and current assumptions of how things "should be") down as I feel it's to much and maybe confusing to do in IRC and I don't want things to get lost https://docs.google.com/document/d/1RAYtAoI-xXLFicSpEc46TQ53DiVXE9EZB_6ETX_xOdQ/edit# [08:56:21] * addshore requests access [08:56:37] my meetings start about now, but very willing to contribute thoughts etc :) [08:57:37] <_joe_> sigh [08:57:55] <_joe_> that's one of the reasons why I hate that the WMF is using the g-suite so much [08:59:15] i have the same thought with wmde ;) [09:00:07] I'm happe to use something else as well...don't know how a pro/con docs/etherpad decision should be made though [09:00:26] I thought like "temporary stuff in etherpad" [09:00:39] <_joe_> jayme: yes, more or less [09:00:47] <_joe_> my old workflow was [09:01:00] <_joe_> new stuff I'm kinda-sketching in etherpad initially [09:01:11] <_joe_> then under my user namespace on wikitech [09:01:24] <_joe_> once we agree on a solution, move to the general namespace in wikitech [09:01:40] <_joe_> but in this case we can just add tarrow and addshore on the doc :) [09:02:34] <_joe_> I wasn't criticizing your use of the gsuite, I'm just unhappy we're using it in general [09:03:08] Don't worry. Thats how I got it ;) [09:09:58] Thanks! I added some comments about I think where we might have different assumptions (either between us or just the status quo) [09:10:30] Also, in case it wasn't obvious: I'm no helm/helmfile expert :P [09:11:28] <_joe_> no one is [09:11:40] <_joe_> these things have been around for literally 1.5 years [09:11:52] <_joe_> and they changed model 4 times in the meanwhile [09:11:59] <_joe_> so whoever claims to be an expert is lying [09:12:19] <_joe_> we all have some past experience with deploying software, and we should start from there :) [09:21:31] :D [09:31:45] 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Dzahn) >>! In T247018#6169200, @Papaul wrote: > @Dzahn Please to not resolve yet. I still have mgm... [09:57:48] <_joe_> well my trial of cdk8s starts well. First step of the "getting started with python" tutorial starts with "npm install" [10:12:06] <_joe_> volans: we want your opinion on https://github.com/aws/jsii and its use in python projects [10:12:32] <_joe_> :P [10:14:15] rotfl [10:14:33] let's write TypeScript! [11:23:18] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Upgrade all TLS enabled charts to v0.2 tls_helper - https://phabricator.wikimedia.org/T253396 (10JMeybohm) [11:30:43] <_joe_> so a first look at cdk8s tells me that it could make the app developer's live much easier - we could basically have them define their "chart" based on a series of standardized constructs with just a 50 line typescript or python file [11:31:03] <_joe_> but we'd make our lives much more miserable in return [11:31:40] <_joe_> we would have to write horrible typescript or python for most stuff, mostly a dumb map of yaml [11:32:16] <_joe_> we'd gain a lot of flexibility and composability, but we would need to reimplement all the things e.g. helmfile gives us [11:32:45] <_joe_> at the same time, we would be able to e.g. extract values from an API at deployment time [11:34:26] <_joe_> for the developer, it could boil down to something that looks a lot like writing a values.yaml file if it was possible to define a global helm chart that could serve most of our applications [11:34:39] <_joe_> without making it a fractal horror like it would be in helm [11:35:23] <_joe_> we could have nice docs for our proposed abstractions, and writing new ones would involve no usage of go text/template [11:36:23] <_joe_> so tldr - it looks like the correct way to abstract kubernetes manifests, but it needs quite some engineering time to allow it to be usable for us [11:37:35] <_joe_> and we would need to write something to wrap the cli to generate the yaml we want to deploy and then use kubectl to do so [11:38:53] <_joe_> there is also quite some work involved to convert what we have to something like this [11:39:23] I have a nginx related question/problem, maybe someone can help [11:39:35] I'm trying to configure auth_request in nginx. When I try to go to a protected endpoint after setting it up, nginx throws 500 and logs show that "auth request unexpected status: 307 while sending to client" [11:39:42] I'm guessing that 307 that is being used to redirect to login page (which is ok, that's what I want) [11:39:50] but apparently that's not how auth_request works [11:40:09] I'm probably using it incorrectly with mediawiki's (commons.wikimedia.org to be precise) oauth2 endpoint [11:40:11] <_joe_> can I ask you to take a step back and tell me what you're trying to do? [11:40:15] sure [11:40:26] <_joe_> oh you are trying to set up oauth2 with mediawiki in nginx [11:40:39] yeah, that's what I wanted to write :) [11:40:51] <_joe_> is that for WDQS by any chance? [11:41:00] <_joe_> sorry, this is just curiosity [11:41:01] that's, again, what I wanted to write :) [11:41:02] <_joe_> :P [11:41:34] <_joe_> but ok, now I can tell you I never ever tried to set up oauth2 a the webserver layer, but it seems like a perfectly legitimate thing to do [11:41:35] or to be precise with the current acronyms - WCQS (Wikimedia Commons Query Service) we are setting up [11:41:53] I think so, too, just not having much luck with it [11:42:32] <_joe_> so first thing I'd check is if we even install the auth_request module, but I guess we do [11:43:24] yeah, logs are from that module [11:43:56] <_joe_> so, I don't have a magic bullet but seeing the configuration might help [11:44:57] <_joe_> sorry, I've been called to lunch, but I'll read later :) [11:45:02] np [11:45:16] I'll paste it here, just have to edit out confidential data... [11:47:13] https://www.irccloud.com/pastebin/brptkNNM/ [11:48:02] I'm guessing that proxy_pass url may be wrong [11:57:45] <_joe_> yeah I'm not sure that's how proxy_pass works in nginx in fact [11:58:19] <_joe_> and I refer to the query string [11:59:25] <_joe_> the rest of it seems reasonable [12:01:00] <_joe_> btw I assume you could set those two query string parameters in the body of the request [12:01:17] <_joe_> but sorry, I don't have good answers just off the tip of my head :/ [13:58:40] <_joe_> jayme: you were saying? https://github.com/awslabs/cdk8s/issues/21 [13:58:42] <_joe_> :P [13:59:23] * jayme runs away [14:18:00] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Move helm chart repository out of git - https://phabricator.wikimedia.org/T253843 (10akosiaris) p:05Triage→03Medium > My suggestion is either have CI build the index or, even better, have CI do "helm package" as well. +1 > We could also run an instance of... [14:18:26] ahahahahaha [14:19:21] <_joe_> akosiaris: this morning I was about to start playing with cdk8s and jayme joked about it becoming another layer of abstraction for helm [14:19:26] <_joe_> and ta-dah [14:19:43] <_joe_> and I get why people are thinking that [14:19:50] <_joe_> the nice part of helm is the deployment tool [14:19:58] haven't had time to look at it yet [14:20:10] what does it do ? [14:20:43] <_joe_> cdk8s? you write python modules, and then run "cdk8s synth" and that spits out kubernetes yaml [14:21:21] ah, like pulumi? https://www.pulumi.com/docs/get-started/kubernetes/begin/ [14:21:26] <_joe_> the whole thing is done in a way that you can create your library of "constructs", combine them, inherit them and so on [14:22:23] there's a couple more who do similar things [14:22:29] <_joe_> akosiaris: let's say cdk8s seems a little less brutal [14:22:58] <_joe_> the advantage of pulumi is it also does the deployment ofc [14:22:59] there's also jsonnet, kustomize, ksonnet [14:23:09] <_joe_> yeah, all quite meh [14:23:16] yup, not arguing [14:23:30] <_joe_> it's not my first dance searching for a way to write $language and not yaml + go text/templates [14:23:50] <_joe_> cdk8s seems to have gotten one thing right [14:23:55] <_joe_> the idea of constructs [14:24:02] same here. It seems it's very still in flux [14:24:06] <_joe_> that's more or less a collection of k8s resources [14:24:19] I remember reading https://blog.argoproj.io/the-state-of-kubernetes-configuration-management-d8b06c1205 1 year ago and saying.. hmm maybe not yet [14:24:39] _joe_: a collections of k8s resources? Does that ring a bell ? [14:25:09] tell me it's also declarative and we are right towards puppet all over again :-) [14:25:17] * akosiaris just joking [14:29:15] <_joe_> akosiaris: it definitely does, but this at least is a proper programming language [14:29:39] <_joe_> akosiaris: I should've said - a parametrized collection of k8s resources [14:35:56] defines ftw! [14:37:44] _joe_: jayme: btw. I was looking at the amount of traffic that kubernetes currently handles. https://grafana.wikimedia.org/d/000000519/kubernetes-overview?panelId=9&fullscreen&orgId=1 pretty nice I think [14:38:08] there's a few things missing, like zotero requests, but I can live without them [14:38:43] and that's without sessionstore on group0 [15:27:32] 10serviceops, 10Packaging: Please provide our special component/php72 in buster-wikimedia - https://phabricator.wikimedia.org/T250515 (10MoritzMuehlenhoff) That's a total of 20 packages needing a rebuild, realistically we'll only approach this once we move production to Buster. [15:44:11] 10serviceops, 10Operations: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 (10MoritzMuehlenhoff) [16:14:18] 10serviceops, 10Recommendation-API, 10Release-Engineering-Team, 10Services, 10Patch-For-Review: Migrate recommendation-api to kubernetes - https://phabricator.wikimedia.org/T241230 (10akosiaris) [16:39:22] 10serviceops, 10observability: "PHP opcache hit ratio" alert shouldn't bother on mwdebug*/scandium/etc - https://phabricator.wikimedia.org/T254025 (10CDanis) [16:39:42] 10serviceops, 10observability: "PHP opcache hit ratio" alert shouldn't bother on mwdebug*/scandium/etc - https://phabricator.wikimedia.org/T254025 (10CDanis) p:05Triage→03Low [19:41:39] 10serviceops, 10Packaging: Please provide our special component/php72 in buster-wikimedia - https://phabricator.wikimedia.org/T250515 (10Jdforrester-WMF) >>! In T250515#6177325, @MoritzMuehlenhoff wrote: > That's a total of 20 packages needing a rebuild, realistically we'll only approach this once we move prod... [19:52:01] 10serviceops, 10Operations, 10Thumbor, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber)