[02:15:32] 10serviceops, 10Release-Engineering-Team, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Test Wikitech is still running wmf.8 (should be on wmf.11) - https://phabricator.wikimedia.org/T241251 (10Andrew) Thank you @Dzahn ! [11:17:22] 10serviceops, 10Operations: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T240684 (10elukey) [11:43:15] 10serviceops: Create PHP 7.2.26 Wikimedia package - https://phabricator.wikimedia.org/T241224 (10MoritzMuehlenhoff) 05Open→03Resolved 7.2.26 is on apt.wikimedia.org since last week [11:51:47] 10serviceops, 10Citoid, 10Operations, 10Release Pipeline, 10Services: Migrate citoid and zotero services to helm ( scap-helm is deprecated ) - https://phabricator.wikimedia.org/T233702 (10Mvolz) @akosiaris is this done then? [13:25:53] 10serviceops, 10Citoid, 10Operations, 10Release Pipeline, 10Services: Migrate citoid and zotero services to helmfile ( scap-helm is deprecated ) - https://phabricator.wikimedia.org/T233702 (10akosiaris) [13:27:44] 10serviceops, 10Operations, 10SRE-tools, 10docker-pkg: Report image metadata to debmonitor - https://phabricator.wikimedia.org/T241206 (10Joe) 05Open→03Resolved [13:27:49] 10serviceops, 10Citoid, 10Operations, 10Release Pipeline, 10Services: Migrate citoid and zotero services to helmfile ( scap-helm is deprecated ) - https://phabricator.wikimedia.org/T233702 (10akosiaris) > Helm files were added but then had to be reverted as the build no longer worked. > Addition of helm... [13:32:29] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 (10Joe) [13:33:01] 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): restrouter.svc.{eqiad,codfw}.wmnet in a failed state - https://phabricator.wikimedia.org/T242461 (10akosiaris) >>! In T242461#5793618, @Pchelolo wrote: >> Since (long-term) we aim to replace all of this, is abandoning it entirely an option? > >... [13:46:47] is https://phabricator.wikimedia.org/T242604 for the entire registry including releng? [13:55:36] 10serviceops, 10Operations: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10MoritzMuehlenhoff) [13:57:06] <_joe_> moritzm: indeed [13:57:23] 10serviceops, 10Citoid, 10Operations, 10Release Pipeline, 10Services: Migrate citoid and zotero services to helmfile ( scap-helm is deprecated ) - https://phabricator.wikimedia.org/T233702 (10Mvolz) 05Open→03Resolved [13:57:32] <_joe_> I'm going to add a list of images that can't even make it to debmonitor [13:59:12] ok, when I browsed the image list last week I saw some very outdated releng images, but wasn't sure yet where to flag these :-) [14:14:34] 10serviceops, 10Operations, 10Thumbor, 10Wikimedia-Logstash, and 2 others: Stream Thumbor logs to logstash - https://phabricator.wikimedia.org/T212946 (10fgiunchedi) Reviving this as part of this Q's OKRs to move services off logstash non-kafka inputs, I'll followup with patches to move to the localhost-ud... [14:51:36] akosiaris: helllooo [14:51:43] reading more about helmfile [14:51:52] instead of doing a whole new release for canary [14:52:16] would it be possible to somehow use either environments, or maybe just labels (set in helmfile.yaml) and selectors on cli [14:52:17] ? [14:53:02] or is the completely different 'release' necessary to be able to deploy it separately? [14:58:28] 10serviceops, 10Operations: decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `hassaleh.codfw.wmnet` - hassaleh.codfw.wmnet (**FAIL**) - Downtimed host on... [15:18:46] 10serviceops, 10Operations: decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `hassium.eqiad.wmnet` - hassium.eqiad.wmnet (**FAIL**) - Downtimed host on I... [15:40:36] ottomata: I don't see how either would help. a) The functionality anyway needs to be added to the chart (in order for a different helm release - whether in the default env or a different env) for it to work. So whether it's in a release, a value or an env in the underlying chart it's more or less the same thing. b) envs are just groupings of releases+configuration that might not make sense to be in the default deploy (the i [15:40:36] dea is that one does helmfile -e env1 ). Finally the entire point of helmfile is that you avoid having to remember to pass --this_flag --that_flag to differentiate state when deploying. Do you really want to remember to pass various flags when deploying if you can have it otherwise? [15:41:10] fwiw I was also thinking about this during the weekend. I think that we can bypass the issue with some simple changes [15:41:58] 10serviceops, 10Operations, 10Patch-For-Review: decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10MoritzMuehlenhoff) [15:41:59] ahhh hm [15:42:06] namely, add a flag that defaults to false, that we set to true for the "main" release, that allows other services to be address by the main's service [15:42:30] and b) change just a bit the approach of canaryOf to use the above added functionality [15:42:34] oh and actually add other services? [15:42:51] then everything that has to do with state or relationship between the releases becomes a simple [15:43:11] can_drive_others/be_driven_by_X essentially [15:43:16] s/drive/address/ [15:43:42] the "release" label remains the same in this case [15:44:01] and both "main" and "canaryA,B,C" each have their own [15:44:15] which is more or less were the entire issue stemmed from [15:44:24] yeah, i'm trying to accomplish the same too [15:44:26] and you can match on the value of "release" on whatever you want [15:44:28] not vary label release [15:44:37] in order to have nice graphs etc [15:44:39] keep it always .Release.Name [15:44:44] but add another label to match on [15:44:45] ? [15:44:53] for the networkpolicy and service [15:45:03] niah, just skip adding it to the "main's" service [15:45:05] oh that is different than what you are suggesting [15:45:07] just that [15:45:15] 10serviceops, 10Operations, 10Patch-For-Review: decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10MoritzMuehlenhoff) 05Open→03Resolved hassium/hassaleh have been retired. [15:45:16] no, no need for new labels [15:45:30] i'm finding it really hard to reason about label: release being different thant .Release.Name [15:45:33] it is very confusing [15:45:38] yup, me too [15:45:47] it was flawed after all [15:45:59] lemme wrap up the other stuff I have to do and I 'll have a go at it [15:46:11] ok, i'm having a go too...we'll see where we converge :p [15:46:27] 10serviceops, 10Operations, 10Patch-For-Review: decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10MoritzMuehlenhoff) [15:46:36] 10serviceops, 10Operations, 10Patch-For-Review: decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10MoritzMuehlenhoff) [16:12:07] ottomata: https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/564054/. I 'll test it in the next couple of days [16:15:58] akosiaris: will look, i have another idea too [16:16:22] insteadd of using addressed by [16:16:29] can we just use .Release.Namespace [16:16:33] and match the service to that? [16:16:40] instead of .Release.Name ? [16:16:51] label namespace: .Release.Namespace [16:16:54] then in service [16:16:57] matchLabels [16:17:02] namespace: .Release.Namespace [16:17:05] don't add namespaces in your helm charts [16:17:08] bad idea [16:17:08] IIUC this is already set [16:17:09] ahhhh [16:17:10] really why? [16:17:19] it seems to be exactly what we want? [16:17:26] we want a label that matches the value of [16:17:27] not particularly [16:17:36] services//service_name [16:17:39] e.g., you may want to have the following [16:17:50] mainA/canaryA, testB/canaryB [16:17:58] which in a dev env makes sense [16:18:03] not in production perhaps [16:18:12] (mainA is a namespace in this example?) [16:18:22] and since you tied it to the namespace, now you need different namespaces to test that [16:18:29] nope, it's helm releases [16:18:51] ah [16:18:54] in dev we just have default namespace [16:18:55] hmmm [16:18:55] also, highly resist the urge to put namespaces in helm charts (at least in helm 2, I haven' [16:18:59] ok [16:19:00] then [16:19:05] haven't looked into helm 3 [16:19:08] well this service name is the same thing i guess [16:19:17] just another label that will be the same as namespace in our prod [16:19:20] here was my attempt at that [16:19:22] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/564052 [16:19:40] i think i am doing it withhout addressed_by [16:19:55] but i do need one extra var to disable deployment of the k8s service [16:20:01] which i don't love, but at least it is just a boolean [16:20:20] i'm just making it so that all releasees withh the same service.name will be matched by the k8s service [16:21:03] you don't really need to add it to the networkpolicy objects. In reality all you need is to remove the "release" from the "main's" Service (and only that) and add something that the "secondary" can specify to be matched [16:21:11] yeah [16:21:14] we could also go the not even doing that last part [16:21:22] just saw you adding canary stuff in your first patch so I did that [16:21:27] but then we would lose the mainA/canaryA, mainB/canaryB thing [16:21:27] i can remove from everywhere except service [16:21:37] which I am betting you will show up and ask at some point :P [16:21:52] no you still have it [16:21:55] you still can add more releases [16:21:58] you seem to ask all the "exotic" features nobody else does [16:21:59] they just all have the same service.name [16:22:20] oh but you mean to have a hierarchy... [16:22:21] hhmmmm [16:22:24] i don't think i'll need that [16:22:45] we follow kind of the same approach, but you just defined a value to be shared [16:23:00] I just tie the N releases together [16:23:04] it's not really that different [16:24:11] but I like the | default true }} approach [16:24:21] well, the defaulting, maybe I should add it [16:24:27] but meeting time! [16:24:44] aye heh i'm in meeting too [16:25:04] ottomata: keep in mind overall I want to extract that functionality in a common shared template in order not to have many implementations of it lying around. [16:25:07] yeah [16:25:09] agree [16:27:00] akosiaris: where does the service prometheus label come from? k8s prometheus exporter stuff? [16:27:04] e.g. [16:27:05] service_runner_request_duration_seconds_count{app="eventgate-analytics",instance="10.64.75.44:9102",job="k8s-pods",kubernetes_namespace="eventgate-analytics",kubernetes_pod_name="eventgate-analytics-797b878544-2g7r7",method="POST",pod_template_hash="797b878544",release="analytics",service="eventgate-analytics",status="ALL",uri="v1_events"} [16:27:18] will this override that by setting it as a k8s label? [16:27:33] ottomata: confusingly enough that's from prometheus-statsd configuration [16:27:37] good point [16:27:40] the exporter? [16:27:45] yup [16:27:49] it's in the config in the chart [16:28:16] labels: [16:28:16] service: $1 [16:28:16] uri: $2 [16:28:19] OHh yes yes [16:28:20] right ok [16:28:27] and $1 comes from service-runner itslsef [16:28:32] yup [16:28:34] i remember talking about this with keith [16:28:41] which is not great btw [16:28:50] because we need that label from service-runner native prometheus stuff too [16:28:53] not great? [16:28:54] maybe when the service-runner prometheus patch gets merged [16:29:12] we can ditch that statsd-exporter and rely on native prometheus [16:29:17] i'm using it in the eventstreams chart (TB reviewed :D) [16:29:24] yup, it's fine for now [16:29:28] yes, but i asked keith to make sure service was set as a label [16:30:38] what keith? or cole [16:30:43] sorry it is cole working on it? [16:31:05] yes cole. [16:31:07] sorry :p [16:32:35] 10serviceops, 10Release-Engineering-Team: Update Wikimedia production to PHP 7.2.26 - https://phabricator.wikimedia.org/T241222 (10Jdforrester-WMF) [16:32:53] 10serviceops, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO: Update Wikimedia production to PHP 7.2.26 - https://phabricator.wikimedia.org/T241222 (10Jdforrester-WMF) [16:36:38] 10serviceops, 10Operations, 10Packaging: package requirements for upgrading deployment_servers to buster - https://phabricator.wikimedia.org/T242480 (10Dzahn) p:05Triage→03Low [16:37:06] = think we should add this service name by default in all resource's labels, like we do for release. If we are going to do multiple releases for the same helmfile 'service' is a bit confusing to not have a label that ties them all together. wmf.releasename is a bit confusing here too [16:37:13] for eventgate with canary [16:37:17] it will be eventgate-canary [16:37:21] for all eventgate services [16:39:50] hmm, which is used for app [16:39:59] OH in eventgate it is used for app [16:40:44] ooo and the deployment name [16:40:47] that's a problem [16:40:59] it'll be eventgate-canary for all canary releases [16:41:19] eventgate-analtyics's canary deployment name would be eventgate-canary, as would eventgate-main's canary deployment [16:42:07] maybe we can just include service.name in wmf.releasename [16:43:24] 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): restrouter.svc.{eqiad,codfw}.wmnet in a failed state - https://phabricator.wikimedia.org/T242461 (10Eevans) >>! In T242461#5797481, @akosiaris wrote: >>>! In T242461#5793618, @Pchelolo wrote: >>> Since (long-term) we aim to replace all of this,... [16:46:29] hmm akosiaris where does the prometheus label 'app' come from? is that somehow taken from k8s prometheus scraper? [16:46:47] it isn't in the statsd exporter config [16:47:14] ottomata: it's a label in the chart's resources [16:47:53] the idea being that e.g. app=eventgate, service=eventgate-analytics [16:48:48] yes, but how does it get to prometheus? some special prometheus scraper stuff for k8s services? [16:49:37] i'm worried that these things are going to be changd if release.anme is canary, app is going to change to eventgate-canary for all different eventgate instances out there. [16:49:46] since wmf.releasename does not use this service.name [16:50:03] ottomata: yup, prometheus scrapes the pods and gets the labels [16:50:20] we can instruct it to ditch labels or add new labels [16:50:25] ah huh interesting [16:50:57] but hm, it adds the labels to any prometheus metrics already exported? [16:51:13] service_runner_request_duration_seconds_count doesn't have it from service-runner or statsd exporter [16:51:17] but it does have service [16:51:45] if there is a label in k8s and a label from the prometheus exporter in the app itself (or statsd exporter), which label takes precedence? [16:52:10] I don't know :-( [16:52:18] I haven't met that yet [16:59:37] it looks to me like 'app' was meant to be what we are doing with 'service'. [17:00:48] wonder if we should rename app to service? hmmm [17:59:44] akosiaris: fyi update https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/564052 a lot with some of those thoughts [17:59:47] updated* [18:00:07] a problem is aside from eventgate, i think we've never really considered multi deployment charts (right?) [18:00:10] helmfile helps a lot [18:00:37] but now we are doing multi-release + multi service instance [18:00:55] so it kinda needs to be settled and done well, not sure if that patch does it how we should [18:00:57] but is an idea [18:01:00] ottomata: we do have multi deployment charts. termbox is one [18:01:22] and up to now I 've done my best to have that capability when testing [18:01:42] i think using as the main identrifier for a resource is a problem. chartname-releasename [18:01:50] but that's good [18:01:58] app: {{ template "wmf.chartname" . }} [18:02:00] is not right for that though [18:02:08] it isn't a release of a chart. [18:02:13] it is a release of a service (which uses a chart) [18:02:23] nope app is meant to be app [18:02:32] like if you were deploying an instance of apache with it [18:02:38] it would be app: apache [18:03:01] no matter whether the thing served by that apache is a wiki, a phpbb (ew) or joomla [18:03:24] and the implicit assumption was that the chart name would match the application [18:04:49] hmmm [18:04:51] ok [18:04:57] then we do indeed need another label [18:05:00] like service [18:05:12] there isn't one now that ids the service [18:05:48] that's true [18:06:13] we do have the one that gets added by the statsd-exporter btw, we should make sure not to mess that up [18:06:18] in eventgate whenever I/we made it multie instance [18:06:22] and meeting again, sorry :( [18:06:24] i changed app to wmf.releasename [18:06:28] which i guess was wrong at that time then [18:06:58] we should have introduced new service label then [18:07:19] ok will amend patch and to that, but i think wmf.releaseanme needs to be changed to include service.name too [18:07:47] https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/564054 is now functional btw [18:41:35] also btw, i've noticed we inconsistently indent array entries in yaml [18:41:45] got a pref? [18:41:56] list: [18:41:56] - value1 [18:41:56] - value2 [18:41:57] or [18:42:05] list: [18:42:05] - value1 [18:42:05] - value2 [18:55:36] akosiaris: updated https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/564052 again, i like it a bit better than yours, just because there are fewer knobs to twiddle. all active deployment releases in the same service.name will be addressed by the production release's k8s service [18:56:03] so to add a canary, you just add another release to helmfile.yaml and tweak values accordingly [18:56:12] in this case all we need is to reduce replicas to 1 for canary [20:49:42] 10serviceops, 10Performance-Team, 10Patch-For-Review: Stack for shutdown/destruct fatals missing from php7-fatal-error.php logs - https://phabricator.wikimedia.org/T241097 (10Krinkle) @Joe @jijiki Would like to deploy this soon. Ping me any time this week if you can squeeze it in so I can stand by for confir... [20:50:35] 10serviceops, 10Performance-Team, 10Patch-For-Review: Stack for shutdown/destruct fatals missing from php7-fatal-error.php logs - https://phabricator.wikimedia.org/T241097 (10Krinkle) a:05aaron→03Krinkle [20:50:53] 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): restrouter.svc.{eqiad,codfw}.wmnet in a failed state - https://phabricator.wikimedia.org/T242461 (10Pchelolo) [20:51:23] 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): restrouter.svc.{eqiad,codfw}.wmnet in a failed state - https://phabricator.wikimedia.org/T242461 (10Eevans) >>! In T242461#5798385, @Eevans wrote: >>>! In T242461#5797481, @akosiaris wrote: >>>>! In T242461#5793618, @Pchelolo wrote: >>>> Since (...