[07:05:01] <_joe_> akosiaris, jayme so helmfile doesn't allow to define a kubeconfig to use in helmfile.yaml, but only a kube context [07:09:52] _joe_: *sigh ... [07:10:07] what about patching that instead of trying to work around it? [07:11:39] <_joe_> jayme: what about going the other way around and adopt the deploy.sh script that longma wrote, that basically just sets env variables? [07:13:37] <_joe_> jayme: I am thinking about what are our goals - I think more or less having a more DRY sets of definitions, and be able to control centrally values [07:13:40] <_joe_> right? [07:14:01] I've not looked into the script but thought you wanted to avoid another wrapper. But yeah, if it's the easiest way to go we should totally do it [07:17:10] <_joe_> it is the easiest way, yes [07:17:22] <_joe_> although we can just keep things as-is more or less [07:18:26] _joe_: I think you are right. We need a way to define global defaults/values in one place but keep the ability for dev to overwrite that. Plus we need a clean API for developers (and us) to use for deployments [07:19:26] And we should still allow deployment of single services (not requiring a "sync cluster" I mean) [07:21:46] <_joe_> yes [07:26:08] <_joe_> jayme: more importantly, because we want to do stuff like canary releases in a near future [07:26:19] <_joe_> btw, we should start working on supporting that better. [07:29:38] right [07:34:08] <_joe_> do you have a working dev env for helmfile btw? [07:34:20] <_joe_> there are some things that are not clear to me from the docs [07:40:46] 10serviceops, 10Operations, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) @aaron @krinkle thoughts? :) [07:40:53] what do you mean? For actually coding on helmfile? [07:43:03] that I don't have currently [07:50:40] <_joe_> yeah for that. Fair enough [08:21:54] <_joe_> akosiaris: I think you said that helmfile did support setting the --kubeconfig switch for helm, but AFAICS it doesn't [08:21:59] <_joe_> did I miss something? [08:22:24] _joe_: you have a minute for a "hotfix"? https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-report/+/615411/ [08:23:36] <_joe_> jayme: oh lol [08:24:05] <_joe_> jayme: what happens if pull fails? [08:28:03] 404 Client Error: Not Found ("No such image: docker-registry.wikimedia.org/releng/bazel:0.4.0") [08:43:18] _joe_: args: - "--kubeconfig /path/to/blah" in helmfile.yaml. [08:44:01] <_joe_> akosiaris: ohhh that's great thanks [08:44:11] <_joe_> I hope that works with environments [08:44:15] the kubeContext thing isn't particularly useful to us right now. Unless we restructure how we build our kubeconfig files to add more contexts [08:44:29] <_joe_> yeah, not very useful indeed [08:44:37] _joe_: careful with environments btw. The UX isn't particularly great [08:44:49] <_joe_> --environment= [08:44:56] yes, exactly. [08:45:23] _joe_: updated the CR [08:45:25] <_joe_> yeah, ok, that's why I was considering telling people to use the script long.ma added [08:45:36] definitely way more involved than just helmfile sync instead of helm -e foobar sync [08:45:37] <_joe_> jayme: thanks, will take a look in a minute [08:48:11] <_joe_> akosiaris: what is better UX: cd eqiad/mathoid; source .hfenv; helmfile sync OR: cd mathoid; helmfile -e eqiad sync ? [08:49:05] cd mathoid ; helmfile sync :P [08:50:11] <_joe_> akosiaris: well how do you decide which cluster you're deploying to? [08:50:36] by the difference in the values files [08:50:51] of course that moves the changes to gerrit [08:50:55] <_joe_> uhhh that's not how it works [08:51:12] <_joe_> the values file is interpreted after args is interpolated [08:51:12] which isn't great UX either I guess. But it's way way more declarative [08:51:28] <_joe_> unless I'm not getting what you are proposing [08:52:22] I mean a common-values.yaml file, followed by a cluster specific e.g. eqiad.yaml [08:52:47] <_joe_> sure, I am talking about how to tell helmfile which cluster to work on [08:52:55] <_joe_> and which kubeconfig to use [08:53:16] <_joe_> how do you switch which cluster you operate on when running "helmfile sync"? [08:53:20] btw [08:53:23] you don't need envs [08:53:31] you also can filter on -l (release) [08:53:32] e.g. [08:53:42] helmfile -l eqiad-production sync [08:54:08] <_joe_> ok, that would mean needing to define more releases than we do now [08:54:18] <_joe_> more repetition :) [08:54:33] why? [08:54:44] <_joe_> you need an eqiad-production release [08:54:50] <_joe_> a codfw-production one [08:54:53] <_joe_> and a staging one [08:54:59] which is exactly what we have now [08:55:01] <_joe_> instead of just production/staging [08:55:03] 10serviceops: Refactor docker-report to use python3-docker - https://phabricator.wikimedia.org/T258560 (10Aklapper) [08:55:14] actually we have 3 releases, not 2 :P [08:55:20] and eventgate has 6 [08:55:23] <_joe_> yeah, my point is - you can use environments to reduce the releases to two [08:55:39] <_joe_> and you can select which ones to apply based on environment [08:55:57] yeah but that approach is not particularly declarative, that's my main beef with it [08:56:02] <_joe_> anyways, I'll write my strawman and let's go from there [08:56:16] <_joe_> maybe it will convince you :) [08:56:24] it's very easy to point out how that can end up in state that is not easy for a 2nd dev to figure out [08:56:45] imagine dev 1 pushing a change to change the version to all 3 releases [08:56:51] then updating staging, then codfw [08:56:58] only to find out there is an issue [08:57:14] and they pause, trying to figure out what is going on [08:57:20] only for the end of the day to come [08:57:30] soooo, which version is now running in eqiad? [08:57:58] <_joe_> so you prefer them to have to do 3 separate commits in gerrit, is what you're saying? [08:58:30] <_joe_> that's doable exactly in the same way with environments [08:58:59] that's a good question. Btw up to now we haven't been touching the issue that much [08:59:10] and dev teams are all doing something on their own [08:59:45] perhaps it's about time we come up with a best practices and guidelines doc? [09:01:03] <_joe_> so I think we need to come up with: [09:01:11] <_joe_> - a standard canary release procedure [09:01:29] <_joe_> - a standard way to check for the canary release status after deploying to it [09:01:44] yup, part of my OKR for this Q [09:02:02] <_joe_> - a set of best practices around chart management [09:02:35] <_joe_> - a good guide to how to release manually, and give some love to deploy.sh for the people who prefer not copy-pasting 10 commands every time they do a release [09:04:54] deploy.sh should probably not be needed if we restructure helmfile.d/services enough [09:05:15] but +1 to the other 3 [09:26:48] 10serviceops, 10Operations, 10Patch-For-Review: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) 05Open→03Resolved ` Jul 22 09:15:57 deneb docker-report-releng[3273]: INFO[docker-report] Building debmonitor report for docker-registry.wikimedia.org/rel... [09:29:58] _joe_: jayme: btw I am also now graphing the limit at https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?panelId=95&fullscreen&orgId=1&from=now-1h&to=now&refresh=5m [09:30:03] and the CPU as well [09:30:27] yeah, saw that. Cool! [09:30:34] <_joe_> nice! [09:31:04] after the attempted and failed nerdsnipe of this morning I'll retry here more "officially"... we should at some point plan to do GC of obsolete images from debmonitor... :-P [09:32:30] yes [09:32:43] great idea, go ahead volans :D [09:32:51] :D [09:34:35] nice try jayme! :) [09:36:39] <_joe_> volans: I think there was some blocker on your side back in the day, and you said you'll reap any image not refreshed for a month. specifically, I don't think it's possible for us to get the list of images that correspond to a specific prefix from debmonitor in a machine-readable format [09:37:04] <_joe_> we had this discussion with moritzm two weeks ago, that is what we need from debmonitor in order to do it properly [09:37:32] <_joe_> the logic for purging will be "anything that has not been reported on for 2 weeks will be expunged" [09:37:49] <_joe_> but I need to get the list of images debmonitor has registered [09:38:06] <_joe_> I think it might be easier to do GC completely on teh debmonitor side tbh [09:38:31] so, the old chat we had was a pebcak on my side, we do GC on debmonitor side of anything that is orphaned (binary not referenced by any host/image, src packages without binaries, etc...) [09:38:40] and in cascade that cleans anything not referenced anymore [09:39:06] for the hosts we do GC on the decommissioning cookbook that deletes the host explicitely [09:39:22] if your GC strategy is purely based on a date sure we can do it on Debmonitor side quite easily [09:39:26] 10serviceops, 10Operations, 10Patch-For-Review: Update deprecated extension names in envoy config - https://phabricator.wikimedia.org/T258140 (10JMeybohm) K8s sidecar(s) have one more: ` [2020-07-22 09:34:36.435][1][warning][misc] [source/common/protobuf/utility.cc:198] Using deprecated option 'envoy.api.v2... [09:39:41] <_joe_> volans: if an image doesn't get a report, it's not on the registry anymore [09:39:48] if instead it's attached to the deprecation/deletion of an old image [09:39:52] than it's easier on your side [09:39:53] <_joe_> or, it's not the latest version of that image [09:40:10] the original idea was to have all tghe images that are in prod [09:40:32] so what happens if you create a new version of the image but the pods are still running the old one? [09:40:47] I just want to avoid to remove images from debmonitor that are in use [09:44:52] is the target to have everything in debmonitor that is in the registry or to have everything there that is actually running? [09:45:04] <_joe_> jayme: everything that's running in theory [09:45:11] <_joe_> but we still don't have that info [09:45:14] because the latter probably would not work with releng stuff [09:45:25] <_joe_> "the latest version in the registry" is a good approximation imho [09:45:51] <_joe_> the alternative is to tie the deletion of the image from debmonitor with expunging it from the registry [09:45:58] <_joe_> which currently we don't do much [09:46:47] when it's deleted from the registry it means that is not running anymore already/ [09:46:50] ? [09:47:17] not exactly...a human needs to verify :-/ [09:47:54] but in general one should verify that the image is not used anywhere before deleting it ofc [09:48:54] <_joe_> so the final solution is having "servicecatalog" [09:48:59] <_joe_> my old plan [09:50:01] ack, if we're going in that direction (hopefully automating that part) we could tie it to the deletion, but that's up to you. As an option deleting everything older than X is very quick to implement in debmonitor if Mor.itz is ok with the potential out-of-sync with what's running [09:50:20] we could do this for now and then improve later too [09:53:00] <_joe_> everything older than 3 months might make sense [09:54:02] <_joe_> as a stopgap for now, if it's simple to do [09:54:18] that's ~250 images over the 605 currently tracked [09:54:25] yes it is [09:57:30] there's definitely false positives with that approach, when I was checking on old packages from stretch-backports the other day, there was an image which wasn't rebuilt for a year or so? (IIRC zotero) [09:58:02] but fine with me, we'll need to establish a proper process for image updates anyway, and enhance the s/r ratio until we're there [10:00:48] <_joe_> moritzm: if an image is not rebuilt, it will be reported with every run in theory [10:01:12] <_joe_> so the reported date should be fresh, if it's not, something is not working well [10:01:32] <_joe_> so an image may be built 1 year ago, but it should be reported every week tops [10:01:59] sure, sure. but idea is to drop images which haven't been rebuilt in the last three months, or am I misunderstandin? [10:02:15] <_joe_> moritzm: no, just the ones not submitted to debmonitor [10:02:24] <_joe_> so for instance [10:02:25] ak, ok [10:02:26] <_joe_> https://debmonitor.wikimedia.org/images/docker-registry.wikimedia.org/wikimedia/mediawiki-services-zotero:2020-02-26-152028-production [10:02:38] <_joe_> this was built in february, but reported 2 days ago [10:02:39] perfectly fine with me [10:02:43] <_joe_> because it's the last [10:07:43] <_joe_> btw, I'm starting to think that https://github.com/GoogleContainerTools/distroless is the best way to approach the problem of building images. [10:08:32] <_joe_> akosiaris: do we have a task about restructuring helmfile.d? [10:08:48] _joe_: nope [10:09:01] <_joe_> ok, writing one [10:49:28] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10Joe) [10:49:40] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10Joe) p:05Triage→03Medium [10:51:23] there you go: https://gerrit.wikimedia.org/r/c/operations/software/debmonitor/+/615423 [10:52:01] jayme: you see how it's done now? 1/ try to nerdsnipe someone 2/ fail miserably 3/ self-nerdsnipe yourself :-P [11:06:39] haha, yeah. Great! :-) [14:23:10] <_joe_> akosiaris, jayme https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/615498/ [14:23:23] <_joe_> this is my strawman for restructuring helmfile.d [14:24:54] <_joe_> quite untested, but I'm unsure how to test it tbh [14:47:32] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [14:49:36] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) 05Open→03Stalled [15:50:27] moritzm, _joe_: do you want a chance to review the debmonitor patch or I should go ahead with j.ayme's +1 and merge+make release+release? [15:50:50] <_joe_> volans: lose the chance to nitpick you? no way [15:51:48] having a look now [16:03:36] <_joe_> jayme, akosiaris I'm leaving for the day but my helmfile strawman is now correct (I tested the output using `helmfile build`) [16:03:59] <_joe_> so it's a good starting point for discussing if we like this setup [16:05:01] Nice. Will look at it tomorrow [16:37:50] 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10jcrespo) [16:41:48] 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=api.svc.codfw.wmnet&service=LVS+api+codf... [16:47:02] 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10jcrespo) [16:48:10] 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10CDanis) First occurrence was June 17th, 15:10 UTC: `Jun 17 15:10:38 icinga1001 icinga: SERVICE ALERT: api.s... [17:30:04] 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10Dzahn) a:03Dzahn mw2335 - mw2339 are configured as API appservers in confctl but they are regular appserv... [19:51:51] 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10Dzahn) So this happened whenever the check ended up talking to one of the servers in that 2335 - 2339 range.... [19:52:38] 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10Dzahn) 05Open→03Resolved [19:53:09] 10serviceops, 10Operations, 10Traffic, 10observability: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10Dzahn) [19:53:13] 10serviceops, 10Operations: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) [20:07:54] 10serviceops, 10Operations, 10Traffic, 10observability: monitoring for mismatched LVS realserver addresses/configurations - https://phabricator.wikimedia.org/T258648 (10CDanis)