[07:37:33] _joe_: AIUI https://gerrit.wikimedia.org/r/c/operations/puppet/+/664313/ should be perfectly fine for our current use case, right? v3 backups we currently don't need for the configcluster and there also is no real urge to have it for the k8s etcd's (as we are capable of re-instanciating k8s clusters without backups) [07:38:04] <_joe_> jayme: not 100% sure if it would work out of the box [07:38:27] I would therefore suggest we enable the backups as is for v2 on configcluster enable v3 on a later date [07:39:39] not necessarily later, but as a separate task [07:40:38] _joe_: looks like it is an easy thing to test out (if it still works I mean) [07:41:41] <_joe_> jayme: see my comment, that patch would only work well for the non-k8s clusters [07:41:44] <_joe_> k8s uses the v3 store [07:42:37] yeah, I know. But I think that's fine as having backups of configcluster is important and backups of k8s not that much [07:44:20] backus of v3 store can then be implemented in a second patch (e.g. no need to rush that) [07:45:42] <_joe_> jayme: ack. It would be nice to start bridging grpc as well, I'll look into it [07:47:10] _joe_: the GRPC argument I did not really get. The script runs locally on the nodes anyways, no? So it could simply talk to etcd directly (not via nginx) [07:47:31] <_joe_> jayme: this is where I introduce you to --advertise-client-urls [07:47:49] <_joe_> basically the etcd server tells the client "contact me on port 4001 via https" [07:48:02] <_joe_> and etcdctl is a good citizen and respects the request [07:51:25] <_joe_> anyways ok, just merge that patch :) [07:58:16] ah, interesting detail. That's a difference with the k8s etcd's then as they don't have etcd::tlsproxy in front [08:01:08] <_joe_> yep [08:14:11] 10serviceops, 10SRE: Support etcd v3 backups with ::etcd::backup - https://phabricator.wikimedia.org/T281447 (10JMeybohm) [08:18:23] 10serviceops, 10SRE, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10jcrespo) [08:23:24] 10serviceops, 10DC-Ops, 10decommission-hardware, 10ops-codfw: decommission conf200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T281374 (10JMeybohm) >>! In T281374#7042326, @ops-monitoring-bot wrote: > - COMMON_STEPS (**FAIL**) > - **Failed to run the sre.dns.netbox cookbook**: Cumin execution fa... [08:24:00] 10serviceops, 10DC-Ops, 10decommission-hardware, 10ops-codfw: decommission conf200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T281374 (10JMeybohm) [08:24:55] 10serviceops, 10SRE: Support etcd v3 backups with ::etcd::backup - https://phabricator.wikimedia.org/T281447 (10JMeybohm) p:05Triage→03Medium [08:25:30] 10serviceops, 10SRE: Support etcd v3 backups with ::etcd::backup - https://phabricator.wikimedia.org/T281447 (10JMeybohm) [08:25:33] 10serviceops, 10SRE: Support proxying to etcd v3 storage on buster or later - https://phabricator.wikimedia.org/T275600 (10JMeybohm) [08:26:26] 10serviceops, 10SRE, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10JMeybohm) 05Open→03Resolved [08:56:37] hi, I have a question regarding deployment-charts, if I have a config that varies depending on the site (eqiad/codfw) is there a variable I can refer to or should there be different values.yaml e.g. values-eqiad.yaml in helmfile.d/services/myservice/ [08:58:16] seeing that changeprop has values-eqiad.yaml values-codfw.yaml but I'm not sure to understand where/when the proper values file is picked-up [08:59:09] <_joe_> dcausse: ok so, what deployment are you working on? [08:59:46] _joe_: a new one, chain is at: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/681497 [09:02:01] <_joe_> dcausse: ok so, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/671204/14/helmfile.d/services/rdf-streaming-updater/helmfile.yaml line 20 [09:02:18] <_joe_> this is the hierarchy of values files that will be consumed by your chart [09:02:39] <_joe_> .Environment.Name is the name of the cluster, so "staging", "eqiad" or "codfw" [09:02:58] oh I see [09:03:00] <_joe_> .Release.Name is "main" everywhere [09:03:16] <_joe_> values.yaml files are merged from top to bottom with deep merge [09:03:40] great thanks, this helps a lot [09:03:59] <_joe_> so you can override single values [09:04:42] <_joe_> in your case, given you only have one release, the last of those files is practically useless right now [09:06:31] ok makes sense, will remove it and add the values-$EnvName.yaml files, thanks! [09:08:03] <_joe_> dcausse: I think we need to document this, but I'm not sure which page would be the best place to do so [09:09:13] I read this https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes but not sure it's the best place [09:10:27] <_joe_> yeah I should make a separate page I guess [09:10:55] or https://wikitech.wikimedia.org/wiki/Deployment_pipeline/Components#Helm perhaps [09:13:16] all those values files an the order in which they are consumed is documented at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/README.MD [09:19:19] <_joe_> jayme: yes, but people don't read the readme :P [09:19:44] <_joe_> so I wanted to actually just paste that plus other stuff in wikitech [09:19:51] _joe_: obviously, yes. No reason not to propagate it :P [09:57:18] I totally missed the README... [10:01:26] <_joe_> dcausse: yeah, that's why I was asking where you'd have expected the information to be :) [10:01:49] <_joe_> it means it's non-obvious as we're used to see docs on-wiki [10:29:51] links from wiki page to in repo docs is fine... [12:34:11] profile::etcd is unused in prod, is that an orphaned class which can be removed? [12:38:10] <_joe_> moritzm: that and some other stuff, but we can handle that [12:55:17] ack, I just stumbled upon it when digging up something else [12:59:25] moritzm: _joe_: I started out with https://gerrit.wikimedia.org/r/c/operations/puppet/+/683551 but will not continue before next week [13:02:43] ack [14:38:05] 10serviceops, 10Prod-Kubernetes, 10SRE, 10SRE-tools: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) a:03JMeybohm [18:10:52] 10serviceops, 10DC-Ops, 10SRE, 10decommission-hardware, 10ops-codfw: decommission rdb200[3456].codfw.wmnet - https://phabricator.wikimedia.org/T273140 (10Papaul) [18:12:54] 10serviceops, 10DC-Ops, 10SRE, 10decommission-hardware, 10ops-codfw: decommission rdb200[3456].codfw.wmnet - https://phabricator.wikimedia.org/T273140 (10Papaul) 05Open→03Resolved complete [21:46:45] ah, another bullseye one: "rsync: command not found", somehow rsync did not get pulled on a host that has a role that normally installs it [21:46:51] looking why