[07:25:06] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move helm chart repository out of git - https://phabricator.wikimedia.org/T253843 (10JMeybohm) 05Open→03Resolved Finally removed the chart tgz's from git and the mirror from releases. Closing. [07:35:08] <_joe_> \o/ [09:12:53] nice [09:50:31] <_joe_> hnowlan: take a look at how we reorganized blubberoid's helmfile.d as well [09:50:46] <_joe_> I'm finalizing details on that change [10:05:40] very cool [10:43:38] <_joe_> hnowlan: cool that you like it! I'll probably ask you to convert changeprop/api-gateway to the new structure soon :) [10:45:34] <_joe_> uhm I made a typo, and I expected the deployment-charts CI to catch it... [10:45:34] sweet, will do [10:52:59] <_joe_> jayme: and ofc we have to rewrite part of our CI :/ [10:53:54] FYI it seems that on testreduce1001 puppet is doing a change at every run: parsoid-vd ensure changed 'stopped' to 'running' [10:53:58] see for example https://puppetboard.wikimedia.org/report/testreduce1001.eqiad.wmnet/a6f75ee20bdf30c1a371101ffcfc4cd7589eca11 [10:54:48] <_joe_> mutante: ^^ [10:54:58] <_joe_> jayme: also, helmfile apply fails [10:56:01] <_joe_> because apparently it doesn't pass the kubeconfig arg to helm diff [10:56:38] fun fact that may be interesting for you: if your host is about to OOM, the "puppet changes on every run" will alert because puppet will fail to run every time. This led us to have a proper memory check alert [10:58:06] <_joe_> no it isn't [10:58:28] <_joe_> Unable to connect to database, error: Error: ER_ACCESS_DENIED_ERROR: Access denied for user 'testreduce'@'10.64.48.40' (using password: YES [10:58:50] <_joe_> I guess some grants are missing, mutante will work on it [10:59:59] <_joe_> jayme: so specifically, "helmfile diff" passes along what is in helmDefaults.args, while helmfile {apply,sync} don't [11:00:10] <_joe_> so when listing releases, helmfile diff says [11:00:45] <_joe_> exec: helm list ^production$ --tiller-namespace blubberoid --deployed --failed --pending --kubeconfig=/etc/kubernetes/blubberoid-staging.config [11:01:15] <_joe_> while helmfile apply says [11:01:18] <_joe_> exec: helm list ^production$ --tiller-namespace blubberoid --deployed --failed --pending [11:01:26] <_joe_> notice anything missing? :P [11:35:34] _joe_: what?! [11:35:51] that does not make sense does it? [11:37:29] <_joe_> ofc not [11:37:51] <_joe_> so after lunch i'm going through helmfile's sources [11:39:18] <_joe_> it's comforting, nonetheless, to think this is the deployment software we're adopting [12:26:40] _joe_: helm code looks fun compared to helmfile ... :-/ [12:39:24] <_joe_> jayme: this is even crazier [12:40:38] <_joe_> so just to ensure things would operate as expected, I added the kubeconfig env var [12:41:06] <_joe_> and magically, other invocations of helm list *do* include the correct args [12:41:25] on apply? [12:43:10] I think this highly depends on the codepath taken in helmfile... every "helmfile yadda" more or less calls "helm diff ... ... " which then calls "helm list". And in some occations "helm diff .. .. " is missing --kubeconfig which then is missing in the following "helm list" calls ofc [12:44:15] <_joe_> this is the output I'm referring to: https://phabricator.wikimedia.org/P12207 [12:45:15] is that from "helmfile -e staging apply" in blubberoid? [12:47:03] because mine looks different... [12:47:18] Ah. Thats the output *with* kube_env set, right? [12:47:21] <_joe_> yes [12:48:33] <_joe_> so that "Listing releases" comes from state.ListReleases [12:49:36] <_joe_> oh damn, I think I found it [12:51:55] <_joe_> https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/helmfile/+/refs/heads/master/pkg/state/state.go#468 isReleaseInstalled calls [12:52:03] <_joe_> https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/helmfile/+/refs/heads/master/pkg/state/state.go#760 listReleases [12:52:30] <_joe_> this gets the flags from st.connectionFlags(release), and then appends "--deployed", "--failed", "--pending" [12:53:16] <_joe_> now connectionFlags is https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/helmfile/+/refs/heads/master/pkg/state/state.go#1735 and clearly doesn't look at the helmdefaults [12:53:36] they are set earlier [12:55:30] <_joe_> where? [12:55:39] There is this argparser thing https://github.com/roboll/helmfile/blob/v0.125.5/pkg/argparser/args.go#L64 [12:56:08] <_joe_> yes, but that struct, HelmState, has a .HelmDefaults property [12:56:20] <_joe_> which surely contains it [12:56:32] <_joe_> but it's not passed to that specific list command [12:57:12] the argparser.GetArgs is called by the commands in app.go [12:57:54] like here https://github.com/roboll/helmfile/blob/v0.125.5/pkg/argparser/args.go#L64 [12:58:09] <_joe_> sure [12:58:20] <_joe_> so we have those in our objects [12:58:28] wrong link, sorry https://github.com/roboll/helmfile/blob/9a03d79a841a753a74db31c24e3884eb05e0ec94/pkg/app/app.go#L1246 [12:59:09] so the instance helm should contain the --kubeconfig in helm.extra [12:59:36] and thats added to the cmdargs in exec.go ~362 [12:59:41] <_joe_> uhm [12:59:45] <_joe_> why added only there? [13:00:38] <_joe_> https://github.com/roboll/helmfile/blob/9a03d79a841a753a74db31c24e3884eb05e0ec94/pkg/app/app.go#L1187 this happens before and it's where things fail [13:01:10] <_joe_> because that function calls isReleaseInstalled [13:01:46] <_joe_> so something tells me that if we set the additional args to that helm instance we pass to that function, things will be fixed [13:03:03] <_joe_> jayme: to clarify, what if we moved r.helm.SetExtraArgs(argparser.GetArgs(c.Args(), r.state)...) [13:03:13] <_joe_> up at the start of the function? [13:03:14] hmm...right. Maybe just do arg parsind earlier [13:03:16] <_joe_> I mean [13:03:17] hrhr...yeah [13:03:30] pushing binary to deploy1001 [13:03:34] <_joe_> ahah ok [13:04:27] <_joe_> *sigh* [13:06:42] 10serviceops, 10Operations, 10Phabricator, 10Traffic, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Vgutierrez) hmm that's interesting, please note that this is not the first time we use websockets. etherpad.wm.o i... [13:10:48] looking good _joe_ [13:10:54] <_joe_> ahahaha [13:10:58] <_joe_> seriously? [13:11:02] at least no diff... [13:11:02] <_joe_> it was that simple? [13:11:25] <_joe_> where is the binary? under your home/bin? [13:11:31] yup [13:11:59] maybe there is a reason the argparse stuff has been done later in the code...IDK tests pass as well [13:12:30] I've patched that for apply/delete/sync [13:13:08] <_joe_> any idea why I find a difference in checksum/tls-certs ? [13:13:18] <_joe_> nothing else changes, that seems a bit strange [13:13:26] old tls_template without my fix? [13:13:44] then the order of key and cert is random [13:14:00] try again and you should have the chance of no diff :-) [13:14:42] see comment on https://phabricator.wikimedia.org/P12207 for the patch [13:15:59] <_joe_> jayme: ahhh yes the old template yes [13:16:06] <_joe_> we didn't renew the chart [13:16:21] <_joe_> that's what was evading me [13:16:25] <_joe_> "didn't we fix that?" [13:19:38] I don't see why they are copying the r.state and r.helm pointers around there and not use them later ... [13:20:12] feels a bit like we're missing something but so far it seems not [13:22:37] <_joe_> are you preparing an upstream patch? [13:23:49] Yeah, pushing shortly. Will import in our package then as well [13:26:43] <_joe_> ack, thanks [13:26:45] <_joe_> also lol [13:27:13] did we made a full circle now? [13:28:43] ah, no. We need to have to patch helm-diff :) [13:31:48] <_joe_> wait for it [13:37:55] <_joe_> jayme: if you're working on that part, I'll get to fix our CI :) [13:41:56] you should have specified "fix our CI" a bit. Now you own that thing :) [13:43:17] local tests are fine, tests in circleci fail and I need to register with them to see the results...great [13:44:19] <_joe_> :/ [13:49:00] 10serviceops, 10Operations, 10Phabricator, 10Traffic, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10CDanis) The Envoy TLS terminator is now configured to allow websocket upgrades -- however, it's improperly configu... [14:00:48] _joe_: https://gerrit.wikimedia.org/r/c/operations/debs/helmfile/+/619471 [14:07:35] <_joe_> jayme: your failing test has nothing to do with your code [14:07:38] <_joe_> damn it [14:07:56] the firtst one did, though :) [14:08:00] *first [14:08:35] lack of gofmt .. but now it's some whatever CI error...master is failing to so I don't think it's an issue for us [14:19:21] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Recommendation-API, 10Release-Engineering-Team, and 2 others: Migrate recommendation-api to kubernetes - https://phabricator.wikimedia.org/T241230 (10Mholloway) PI will want to track this since #recommendation-api contains a couple of endpoints we main... [14:19:44] anyone from service ops around? mwdebug vms ran out of disk space [14:20:01] but I would like someone with more mw experience around [14:21:19] I'm here and interested but probably don't have the experience you're looking for :) [14:21:51] reading back in -ops now though [14:21:52] <_joe_> rzl: let's look together then :) [14:21:59] so mwdebug have a different partition schema than other mw hosts [14:22:05] it is a vm, but doesn't use lvm [14:22:14] can I still resize the partition? [14:22:21] I am worried this can lead to deployment errors [14:22:29] but I am not really sure [14:23:46] I guess I can: https://wikitech.wikimedia.org/wiki/Ganeti#Adding_disk_space [14:24:02] huh, TIL the mwdebug hosts are vms [14:24:10] _joe_: do you want to take over? [14:24:20] as owner, I can be around for whatever you need [14:25:28] <_joe_> this is a machine with a full disk that doesn't serve traffic, it's quite low-prio for me, but sure, it's just that I don't think we need to raise the disk space [14:25:42] yeah, I am totally ok with it [14:26:04] that is why I wanted to handle you the decision, because you will have more knowledge [14:26:15] *hand over [14:26:50] e.g. Re: ops, so leaving it to you [15:20:13] <_joe_> jayme: if you're still around... I was wondering if we shouldn't change the name of our release from stable to something like wmf/stable [15:26:24] _joe_: I am and we talked about that yesterday, no? I was saying that I did not want to make people overwrite the "magic" stable repo URL ... so yeah, I agree (and I do fear that slashes will not be supported :D) [15:26:41] <_joe_> yeah me too [15:26:46] <_joe_> so let's just use "wmf" [15:26:47] <_joe_> :P [15:27:06] that's why I choose wmf-stable on the wikitech page [15:28:46] <_joe_> ok, let's go with wmf-stable [15:29:15] <_joe_> so we need to modify the script on the deployment servers, and then all the helmfiles [15:29:16] And do the transition together with the helmfile.d migration... [15:29:23] <_joe_> which is easy I think [15:29:36] which script you mean? [15:29:51] <_joe_> the one that runs helm repo update [15:30:00] <_joe_> and / or installs the repo [15:30:28] <_joe_> anyways, I'll write the new CI checks without caring about that detail :P [15:31:20] I can prepare the necessary puppet changes to add "wmf-stable" [15:31:53] maybe without trailing slash this time as well :-D [15:32:02] <_joe_> ahah yes [15:52:30] _joe_: should be just that: https://gerrit.wikimedia.org/r/c/operations/puppet/+/619493 [15:53:12] <_joe_> jayme: ack, I am thinking of how to test the stuff in helmfile.d correctly [15:53:16] <_joe_> and it's kinda tricky [15:54:18] yeah....maybe just check if it renders with "helmfile template" in first place? [15:54:45] <_joe_> the problem is... that needs an updated helm repo :) [15:55:39] hrhr [15:56:52] Is this just a chicken-egg *now* or are we unable to get one in CI? [15:57:25] <_joe_> still not sure [16:10:34] will leave for today, ttyl o/ [20:10:41] 10serviceops, 10Operations, 10Phabricator, 10Traffic, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) test comment [20:16:19] 10serviceops, 10Operations, 10Phabricator, 10Traffic, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) Will it blend? [20:19:32] 10serviceops, 10Operations, 10Phabricator, 10Traffic, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) it appears to blend. [20:36:25] 10serviceops, 10Operations, 10Phabricator, 10Traffic, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) >>! In T238593#6376059, @CDanis wrote: > The Envoy TLS terminator is now configured to allow websocket upgr... [20:44:09] 10serviceops, 10Operations, 10Phabricator, 10Traffic, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) 05Open→03Resolved a:03Dzahn We are seeing realtime notifications again and aphlict is now separated f... [20:53:31] 10serviceops, 10Operations, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Dzahn) Checking the box for phabricator/aphlict. aphlict is now running on a dedicated VM, aphlict1001, on buster and nodejs 10.... [20:53:49] 10serviceops, 10Operations, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Dzahn) [20:54:56] 10serviceops, 10Operations, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Dzahn) Also checking the box for etherpad. That is also on buster and nodejs10 meanwhile. Upgraded by Alex Kosiaris. [20:55:10] 10serviceops, 10Operations, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Dzahn) [20:58:55] aphlict is finally back and working