[07:05:01] <_joe_>	 akosiaris, jayme so helmfile doesn't allow to define a kubeconfig to use in helmfile.yaml, but only a kube context
[07:09:52] <jayme>	 _joe_: *sigh ...
[07:10:07] <jayme>	 what about patching that instead of trying to work around it?
[07:11:39] <_joe_>	 jayme: what about going the other way around and adopt the deploy.sh script that longma wrote, that basically just sets env variables?
[07:13:37] <_joe_>	 jayme: I am thinking about what are our goals - I think more or less having a more DRY sets of definitions, and be able to control centrally values 
[07:13:40] <_joe_>	 right?
[07:14:01] <jayme>	 I've not looked into the script but thought you wanted to avoid another wrapper. But yeah, if it's the easiest way to go we should totally do it
[07:17:10] <_joe_>	 it is the easiest way, yes
[07:17:22] <_joe_>	 although we can just keep things as-is more or less
[07:18:26] <jayme>	 _joe_: I think you are right. We need a way to define global defaults/values in one place but keep the ability for dev to overwrite that. Plus we need a clean API for developers (and us) to use for deployments
[07:19:26] <jayme>	 And we should still allow deployment of single services (not requiring a "sync cluster" I mean)
[07:21:46] <_joe_>	 yes
[07:26:08] <_joe_>	 jayme: more importantly, because we want to do stuff like canary releases in a near future
[07:26:19] <_joe_>	 btw, we should start working on supporting that better.
[07:29:38] <jayme>	 right
[07:34:08] <_joe_>	 do you have a working dev env for helmfile btw?
[07:34:20] <_joe_>	 there are some things that are not clear to me from the docs
[07:40:46] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) @aaron @krinkle thoughts? :)
[07:40:53] <jayme>	 what do you mean? For actually coding on helmfile?
[07:43:03] <jayme>	 that I don't have currently
[07:50:40] <_joe_>	 yeah for that. Fair enough
[08:21:54] <_joe_>	 akosiaris: I think you said that helmfile did support setting the --kubeconfig switch for helm, but AFAICS it doesn't
[08:21:59] <_joe_>	 did I miss something?
[08:22:24] <jayme>	 _joe_: you have a minute for a "hotfix"? https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-report/+/615411/
[08:23:36] <_joe_>	 jayme: oh lol
[08:24:05] <_joe_>	 jayme: what happens if pull fails?
[08:28:03] <jayme>	 404 Client Error: Not Found ("No such image: docker-registry.wikimedia.org/releng/bazel:0.4.0")
[08:43:18] <akosiaris>	 _joe_: args: - "--kubeconfig /path/to/blah" in helmfile.yaml.
[08:44:01] <_joe_>	 akosiaris: ohhh that's great thanks
[08:44:11] <_joe_>	 I hope that works with environments
[08:44:15] <akosiaris>	 the kubeContext thing isn't particularly useful to us right now. Unless we restructure how we build our kubeconfig files to add more contexts
[08:44:29] <_joe_>	 yeah, not very useful indeed
[08:44:37] <akosiaris>	 _joe_: careful with environments btw. The UX isn't particularly great
[08:44:49] <_joe_>	 --environment=<env>
[08:44:56] <akosiaris>	 yes, exactly. 
[08:45:23] <jayme>	 _joe_: updated the CR
[08:45:25] <_joe_>	 yeah, ok, that's why I was considering telling people to use the script long.ma added
[08:45:36] <akosiaris>	 definitely way more involved than just helmfile sync instead of helm -e foobar sync
[08:45:37] <_joe_>	 jayme: thanks, will take a look in a minute
[08:48:11] <_joe_>	 akosiaris: what is better UX: cd eqiad/mathoid; source .hfenv; helmfile sync OR: cd mathoid; helmfile -e eqiad sync ?
[08:49:05] <akosiaris>	 cd mathoid ; helmfile sync :P
[08:50:11] <_joe_>	 akosiaris: well how do you decide which cluster you're deploying to?
[08:50:36] <akosiaris>	 by the difference in the values files
[08:50:51] <akosiaris>	 of course that moves the changes to gerrit 
[08:50:55] <_joe_>	 uhhh that's not how it works
[08:51:12] <_joe_>	 the values file is interpreted after args is interpolated
[08:51:12] <akosiaris>	 which isn't great UX either I guess. But it's way way more declarative 
[08:51:28] <_joe_>	 unless I'm not getting what you are proposing
[08:52:22] <akosiaris>	 I mean a common-values.yaml file, followed by a cluster specific e.g. eqiad.yaml
[08:52:47] <_joe_>	 sure, I am talking about how to tell helmfile which cluster to work on
[08:52:55] <_joe_>	 and which kubeconfig to use
[08:53:16] <_joe_>	 how do you switch which cluster you operate on when running "helmfile sync"?
[08:53:20] <akosiaris>	 btw
[08:53:23] <akosiaris>	 you don't need envs
[08:53:31] <akosiaris>	 you also can filter on -l (release) 
[08:53:32] <akosiaris>	 e.g.
[08:53:42] <akosiaris>	 helmfile -l eqiad-production sync
[08:54:08] <_joe_>	 ok, that would mean needing to define more releases than we do now
[08:54:18] <_joe_>	 more repetition :)
[08:54:33] <akosiaris>	 why?
[08:54:44] <_joe_>	 you need an eqiad-production release
[08:54:50] <_joe_>	 a codfw-production one
[08:54:53] <_joe_>	 and a staging one
[08:54:59] <akosiaris>	 which is exactly what we have now
[08:55:01] <_joe_>	 instead of just production/staging
[08:55:03] <wikibugs>	 10serviceops: Refactor docker-report to use python3-docker - https://phabricator.wikimedia.org/T258560 (10Aklapper)
[08:55:14] <akosiaris>	 actually we have 3 releases, not 2 :P
[08:55:20] <akosiaris>	 and eventgate has 6
[08:55:23] <_joe_>	 yeah, my point is - you can use environments to reduce the releases to two
[08:55:39] <_joe_>	 and you can select which ones to apply based on environment
[08:55:57] <akosiaris>	 yeah but that approach is not particularly declarative, that's my main beef with it
[08:56:02] <_joe_>	 anyways, I'll write my strawman and let's go from there
[08:56:16] <_joe_>	 maybe it will convince you :)
[08:56:24] <akosiaris>	 it's very easy to point out how that can end up in state that is not easy for a 2nd dev to figure out
[08:56:45] <akosiaris>	 imagine dev 1 pushing a change to change the version to all 3 releases
[08:56:51] <akosiaris>	 then updating staging, then codfw
[08:56:58] <akosiaris>	 only to find out there is an issue
[08:57:14] <akosiaris>	 and they pause, trying to figure out what is going on
[08:57:20] <akosiaris>	 only for the end of the day to come
[08:57:30] <akosiaris>	 soooo, which version is now running in eqiad?
[08:57:58] <_joe_>	 so you prefer them to have to do 3 separate commits in gerrit, is what you're saying?
[08:58:30] <_joe_>	 that's doable exactly in the same way with environments
[08:58:59] <akosiaris>	 that's a good question. Btw up to now we haven't been touching the issue that much
[08:59:10] <akosiaris>	 and dev teams are all doing something on their own
[08:59:45] <akosiaris>	 perhaps it's about time we come up with a best practices and guidelines doc?
[09:01:03] <_joe_>	 so I think we need to come up with:
[09:01:11] <_joe_>	 - a standard canary release procedure
[09:01:29] <_joe_>	 - a standard way to check for the canary release status after deploying to it
[09:01:44] <akosiaris>	 yup, part of my OKR for this Q
[09:02:02] <_joe_>	 - a set of best practices around chart management
[09:02:35] <_joe_>	 - a good guide to how to release manually, and give some love to deploy.sh for the people who prefer not copy-pasting 10 commands every time they do a release
[09:04:54] <akosiaris>	 deploy.sh should probably not be needed if we restructure helmfile.d/services enough
[09:05:15] <akosiaris>	 but +1 to the other 3
[09:26:48] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) 05Open→03Resolved ` Jul 22 09:15:57 deneb docker-report-releng[3273]: INFO[docker-report] Building debmonitor report for docker-registry.wikimedia.org/rel...
[09:29:58] <akosiaris>	 _joe_: jayme: btw I am also now graphing the limit at https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?panelId=95&fullscreen&orgId=1&from=now-1h&to=now&refresh=5m
[09:30:03] <akosiaris>	 and the CPU as well
[09:30:27] <jayme>	 yeah, saw that. Cool!
[09:30:34] <_joe_>	 nice!
[09:31:04] <volans>	 after the attempted and failed nerdsnipe of this morning I'll retry here more "officially"... we should at some point plan to do GC of obsolete images from debmonitor... :-P
[09:32:30] <akosiaris>	 yes
[09:32:43] <jayme>	 great idea, go ahead volans :D
[09:32:51] <akosiaris>	 :D
[09:34:35] <volans>	 nice try jayme! :)
[09:36:39] <_joe_>	 volans: I think there was some blocker on your side back in the day, and you said you'll reap any image not refreshed for a month. specifically, I don't think it's possible for us to get the list of images that correspond to a specific prefix from debmonitor in a machine-readable format
[09:37:04] <_joe_>	 we had this discussion with moritzm two weeks ago, that is what we need from debmonitor in order to do it properly
[09:37:32] <_joe_>	 the logic for purging will be "anything that has not been reported on for 2 weeks will be expunged"
[09:37:49] <_joe_>	 but I need to get the list of images debmonitor has registered
[09:38:06] <_joe_>	 I think it might be easier to do GC completely on teh debmonitor side tbh
[09:38:31] <volans>	 so, the old chat we had was a pebcak on my side, we do GC on debmonitor side of anything that is orphaned (binary not referenced by any host/image, src packages without binaries, etc...)
[09:38:40] <volans>	 and in cascade that cleans anything not referenced anymore
[09:39:06] <volans>	 for the hosts we do GC on the decommissioning cookbook that deletes the host explicitely
[09:39:22] <volans>	 if your GC strategy is purely based on a date sure we can do it on Debmonitor side quite easily
[09:39:26] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review: Update deprecated extension names in envoy config - https://phabricator.wikimedia.org/T258140 (10JMeybohm) K8s sidecar(s) have one more:  ` [2020-07-22 09:34:36.435][1][warning][misc] [source/common/protobuf/utility.cc:198] Using deprecated option 'envoy.api.v2...
[09:39:41] <_joe_>	 volans: if an image doesn't get a report, it's not on the registry anymore
[09:39:48] <volans>	 if instead it's attached to the deprecation/deletion of an old image
[09:39:52] <volans>	 than it's easier on your side
[09:39:53] <_joe_>	 or, it's not the latest version of that image
[09:40:10] <volans>	 the original idea was to have all tghe images that are in prod
[09:40:32] <volans>	 so what happens if you create a new version of the image but the pods are still running the old one?
[09:40:47] <volans>	 I just want to avoid to remove images from debmonitor that are in use
[09:44:52] <jayme>	 is the target to have everything in debmonitor that is in the registry or to have everything there that is actually running?
[09:45:04] <_joe_>	 jayme: everything that's running in theory
[09:45:11] <_joe_>	 but we still don't have that info
[09:45:14] <jayme>	 because the latter probably would not work with releng stuff
[09:45:25] <_joe_>	 "the latest version in the registry" is a good approximation imho
[09:45:51] <_joe_>	 the alternative is to tie the deletion of the image from debmonitor with expunging it from the registry
[09:45:58] <_joe_>	 which currently we don't do much
[09:46:47] <volans>	 when it's deleted from the registry it means that is not running anymore already/
[09:46:50] <volans>	 ?
[09:47:17] <jayme>	 not exactly...a human needs to verify :-/
[09:47:54] <jayme>	 but in general one should verify that the image is not used anywhere before deleting it ofc
[09:48:54] <_joe_>	 so the final solution is having "servicecatalog"
[09:48:59] <_joe_>	 my old plan
[09:50:01] <volans>	 ack, if we're going in that direction (hopefully automating that part) we could tie it to the deletion, but that's up to you. As an option deleting everything older than X is very quick to implement in debmonitor if Mor.itz is ok with the potential out-of-sync with what's running
[09:50:20] <volans>	 we could do this for now and then improve later too
[09:53:00] <_joe_>	 everything older than 3 months might make sense
[09:54:02] <_joe_>	 as a stopgap for now, if it's simple to do
[09:54:18] <volans>	 that's ~250 images over the 605 currently tracked
[09:54:25] <volans>	 yes it is
[09:57:30] <moritzm>	 there's definitely false positives with that approach, when I was checking on old packages from stretch-backports the other day, there was an image which wasn't rebuilt for a year or so? (IIRC zotero)
[09:58:02] <moritzm>	 but fine with me, we'll need to establish a proper process for image updates anyway, and enhance the s/r ratio until we're there
[10:00:48] <_joe_>	 moritzm: if an image is not rebuilt, it will be reported with every run in theory
[10:01:12] <_joe_>	 so the reported date should be fresh, if it's not, something is not working well
[10:01:32] <_joe_>	 so an image may be built 1 year ago, but it should be reported every week tops
[10:01:59] <moritzm>	 sure, sure. but idea is to drop images which haven't been rebuilt in the last three months, or am I misunderstandin?
[10:02:15] <_joe_>	 moritzm: no, just the ones not submitted to debmonitor
[10:02:24] <_joe_>	 so for instance
[10:02:25] <moritzm>	 ak, ok
[10:02:26] <_joe_>	 https://debmonitor.wikimedia.org/images/docker-registry.wikimedia.org/wikimedia/mediawiki-services-zotero:2020-02-26-152028-production
[10:02:38] <_joe_>	 this was built in february, but reported 2 days ago
[10:02:39] <moritzm>	 perfectly fine with me
[10:02:43] <_joe_>	 because it's the last 
[10:07:43] <_joe_>	 btw, I'm starting to think that https://github.com/GoogleContainerTools/distroless is the best way to approach the problem of building images.
[10:08:32] <_joe_>	 akosiaris: do we have a task about restructuring helmfile.d?
[10:08:48] <akosiaris>	 _joe_: nope
[10:09:01] <_joe_>	 ok, writing one
[10:49:28] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Release Pipeline: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10Joe)
[10:49:40] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Release Pipeline: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10Joe) p:05Triage→03Medium
[10:51:23] <volans>	 there you go: https://gerrit.wikimedia.org/r/c/operations/software/debmonitor/+/615423
[10:52:01] <volans>	 jayme: you see how it's done now? 1/ try to nerdsnipe someone 2/ fail miserably 3/ self-nerdsnipe yourself :-P
[11:06:39] <jayme>	 haha, yeah. Great! :-)
[14:23:10] <_joe_>	 akosiaris, jayme https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/615498/ 
[14:23:23] <_joe_>	 this is my strawman for restructuring helmfile.d
[14:24:54] <_joe_>	 quite untested, but I'm unsure how to test it tbh
[14:47:32] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema)
[14:49:36] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) 05Open→03Stalled
[15:50:27] <volans>	 moritzm, _joe_: do you want a chance to review the debmonitor patch or I should go ahead with j.ayme's +1 and merge+make release+release?
[15:50:50] <_joe_>	 volans: lose the chance to nitpick you? no way
[15:51:48] <moritzm>	 having a look now
[16:03:36] <_joe_>	 jayme, akosiaris I'm leaving for the day but my helmfile strawman is now correct (I tested the output using `helmfile build`)
[16:03:59] <_joe_>	 so it's a good starting point for discussing if we like this setup
[16:05:01] <jayme>	 Nice. Will look at it tomorrow
[16:37:50] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10jcrespo)
[16:41:48] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=api.svc.codfw.wmnet&service=LVS+api+codf...
[16:47:02] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10jcrespo)
[16:48:10] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10CDanis) First occurrence was June 17th, 15:10 UTC:  `Jun 17 15:10:38 icinga1001 icinga: SERVICE ALERT: api.s...
[17:30:04] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10Dzahn) a:03Dzahn mw2335 - mw2339 are configured as API appservers in confctl  but they are regular appserv...
[19:51:51] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10Dzahn) So this happened whenever the check ended up talking to one of the servers in that 2335 - 2339 range....
[19:52:38] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10Dzahn) 05Open→03Resolved
[19:53:09] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10Dzahn)
[19:53:13] <wikibugs>	 10serviceops, 10Operations: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn)
[20:07:54] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10observability: monitoring for mismatched LVS realserver addresses/configurations - https://phabricator.wikimedia.org/T258648 (10CDanis)