[07:18:52] welcome back jayme. I'll have a look! [07:21:05] thanks :) [07:57:03] I'm caught up after a couple of days of OOO. can you point me to a gerrit URL showing the failure? [08:08:39] > You may run rake check_deployments[diff,'dse-k8s-services/postgresql-airflow-analytics-product'] to quickly repdoduce [08:08:39] Sigh, seems like I'm not caffeinated enough [08:09:03] won't be very helpful though as it just says 'Template did not render correctly (HEAD of origin/master).' [08:09:07] https://integration.wikimedia.org/ci/job/helm-lint/24562/consoleFull [08:16:22] I have an idea what might be causing this: we have isolated the PG releases in their own helmfile, itself with assciated kubeconfig file by root:root, to avoid accidental deletion by an un-priviledged user [08:16:35] so there might be a missing yaml file in CI, that we need to download [08:18:39] that sounds weird. Download from where? [08:20:33] not sure, I'm struggling to run the rake tasks atm [08:21:20] either in or out of docker, nothing seems to work [08:23:44] alright, I think I should be able to reproduce. I'll report back when I know more [08:24:40] hmm, it seems to work, with a diff regarding a newline [08:24:48] would that newline diff fail the job though? [08:26:12] nvm me. I had a rough night, I can't seem to be reading correctly this morning. [08:26:12] > +Template did not render correctly (HEAD of local branch). [08:36:12] hmm, it's difficult to debug this without any additional debug information [08:42:46] when running `rake "check_deployments[diff,dse-k8s-services/postgresql-airflow-analytics-product]"`, I'm seeing [08:42:46] helmfile lint output: [08:42:46] ---------------- [08:42:46] err: no releases found that matches specified selector() and environment(aux-k8s-eqiad), in any helmfile [08:44:32] I don't see that one :) [08:45:26] do you see anything more than "Template did not render correctly" ? [08:46:02] no, undortunately the CI does not output the actual error. CI will do something like 'helmfile -e dse-k8s-eqiad template' on both git revisions [08:46:23] execution error at (cloudnative-pg-cluster/templates/cluster.yaml:96:5): The s3.accessKey and s3.secreyKey values were not provided [08:46:56] ok so that indicates a missing secret file [08:46:57] is what I get. So I would assume you need to provide some (additional) fixtures [08:47:39] that's what I meant by "missing some YAML file that we might need to download" [08:48:30] the file itself is /etc/helmfile-defaults/private/dse-k8s_services/postgresql-airflow-platform-eng/{{ .Environment.Name }}.yaml [08:49:12] brouberol@deploy1003:~$ sudo cat /etc/helmfile-defaults/private/dse-k8s_services/postgresql-airflow-platform-eng/dse-k8s-eqiad.yaml [08:49:12] --- [08:49:12] s3: [08:49:12] accessKey: XXX [08:49:12] secretKey: XXX [08:49:12] I think you can provide those values in .fixtures.yaml [08:49:52] nice, I'll whip up a patch for that [08:49:58] how did you get the error message btw? [08:50:01] so helmfile.d/dse-k8s-services/postgresql-airflow-analytics-product/.fixtures.yaml [08:50:16] I changed CI code, which forces a 'rake all' run [08:50:19] I got lost in rake/ruby, which I'm not super familiar with [08:50:53] aah, we had these .fixtures.yaml files, we just didn [08:50:59] o/ [08:51:04] 't port them to the new helmfile PG dir [08:51:09] ./facepalms [08:51:15] as FYI I started a chain of changes to upgrade all charts to mesh.configuration:1.13 [08:51:18] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1144454 [08:51:21] first batch of 20 [08:51:32] so if you need to do something similar, please sync with me first :D [09:03:53] jayme https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1144477 seems to work [09:06:38] brouberol: the diff is expected because the pg setup moves from the airflow deployments to the postgres-airflow ones? [09:14:40] yep [09:15:01] cool. +1 then [09:15:02] thanks [09:15:10] we separated the airflow and PG deployments, to harden the permissions on the airflow kubeconfig files, to make them, only deployable/deletable by SREs [09:15:16] thanks for the help! [11:21:08] brouberol: dse-k8s-services/airflow-wmde/dse-k8s-eqiad seems to still be broken [11:21:18] https://integration.wikimedia.org/ci/job/helm-lint/24687/console [11:39:44] hmm, that's odd [11:40:34] oh, I see why. I'll send a patch to btullis [11:43:36] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1144514 was merged. You should be fine after a rebase jayme [12:07:04] brouberol: o/ lemme know if https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1144454 is ok when you have a moment :) [12:07:22] looking [12:07:24] I'll use it as first real test if you are ok (and if so, lemme know if we can roll it out) [12:07:52] the new config auto injects a custom histogram config for all the envoys basically [12:08:07] so we can reduce what we ingest on Prometheus [12:08:36] this is only changing the statsd->prom exporter config for envoy itself, right? [12:10:31] the main change yes, but it may bring more due to the module update [12:10:57] ah no wait it is not statsd->prom, it is related to envoy's histogram bucket config [12:11:25] the rest in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1144454/1/charts/airflow/templates/vendor/mesh/configuration_1.13.0.tpl is basically what is already running elsewhere [12:12:11] in the diff it is under "histogram_bucket_settings" [12:51:09] yep, that looks good to me, in the sense that I trust you on the histogram config and I'm not seeing any airflow config being affected [12:51:23] do you want to try this out on a specific airflow instance? [12:52:55] ideally yes, no idea what's best etc.. [12:53:57] I can checkout your patch locally and deploy it on airflow-test-k8s if you want [12:57:24] that would be great thanks! [12:57:47] the test is to check whether the envoy metrics have the histogram buckets stated or not [12:59:45] sure, let me do this right now [13:03:09] <3 [13:10:50] I have to perform a bit of ,aintenance in that ns, I'll ping you when I get to deploying the patch [13:11:08] yes please but if you have time, I didn't mean to brutally nerd snipe you :D [13:11:31] np! [13:20:23] hmm somehow, I'm only seeing a diff related to the chart version [13:22:29] when it happen to me, I just run puppet (that forces some gz creation of new chart's versions etc..) [13:22:46] in this case, we didn't merge so maybe it doesn't workk [13:22:57] it should be ok to merge and then test directly in my opinion [13:24:43] I tweaked the helmfile so that it would use a locally checked out version of the chart with your changes in them, so that should work [13:25:01] but sure, if you want to merge and deploy, I'm not seeing any change atm, so I'm ok with that! [13:49:00] merged! [13:52:53] alright, I'll deploy [13:53:41] ok, I'm indeed seeing a config change this time [13:54:44] aaand it's deployed [13:56:34] nice! [13:56:54] what namespace? I'll check the envoy metrics after some meetings [14:40:24] airflow-test-k8s [14:43:12] yep just tested on dse-worker 1009, it seems working! I'll ask Filippo to confirm [14:44:56] nice! [14:47:23] Filippo confirms, it is safe to be deployed in other airflows! Thanks a lot [14:49:31] anytime :)