[06:46:56] 10serviceops, 10Operations, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Ladsgroup) My 2c, I would really appreciate. having some sort of support for usecases like {T240884} while we are building something lik... [07:06:57] 10serviceops, 10Operations, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) From what I understand, the service will run inside the sandbox, and eval() can be done in this service with relative safety.... [08:53:51] 10serviceops, 10Operations, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Joe) Hi, sorry for not adding some comments earlier, I was busy with the aftermath of an UBN! task. Let me list some of the characteris... [09:06:12] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10Joe) 05Open→03Resolved Reporting here in brief: * We confirmed the problem had to do with activating firejail for all... [09:07:06] 10serviceops, 10Operations, 10Platform Engineering, 10Wikidata, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10jijiki) [09:21:30] _joe_: you have an opinion on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/619437/5#message-4b7e555825e839338a13bddef88d5308c9a4ea70 (staging used as name of release and environment). As we're in an early stage of migration we can still change that without issues I guess [09:21:45] <_joe_> yes [09:22:27] the same name could also lead to unclear overwrite situations with values.yaml files I guess [09:22:38] <_joe_> yes [09:23:57] <_joe_> so I fully agree that we should change the name of the release [09:24:07] <_joe_> and also that it should be the same in all clusters [09:24:19] <_joe_> so my proposal is to have the following names for releases: [09:24:40] <_joe_> - test for test releases, only on staging, that would also include a separate service definition [09:24:56] <_joe_> - canary for the canary part of the release [09:25:20] <_joe_> - main for the main release [09:25:40] <_joe_> main should be present everywhere, canary is optional [09:26:22] <_joe_> uhm we should discuss this in a task :) [09:26:34] yeah :) [09:29:39] <_joe_> are you writing it? Else I'll do it [09:29:53] I'll do [09:30:47] Problem I see is that changing the name of the release will lead to a short downtime in environment != staging :-/ [09:41:05] <_joe_> jayme: well, there are various ways to work with that [09:41:29] <_joe_> one is to depool the service on a cluster, do the transition, repool it [09:43:28] _joe_: sure. Just wanted to menttion as it makes the migration a bit more complex [09:44:18] if only we had a cookbook for depooling services on a cluster :-) [09:45:54] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10JMeybohm) Current state is that we have 3 environments (staging, codfw, eqiad) where we usually deploy 1 to N releases each: - productio... [09:46:21] 10serviceops, 10Operations, 10conftool: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10Joe) [09:46:32] 10serviceops, 10Operations, 10conftool: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10Joe) p:05Triage→03High [09:57:02] 10serviceops, 10Operations, 10Traffic, 10conftool: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10Joe) Adding traffic as their systems are the ones affected. [10:38:09] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10Ottomata) +1 to the proposal. > The downside of renaming the release for production environment is that we need to deal with a bit of d... [10:40:10] 10serviceops, 10Prod-Kubernetes, 10Release Pipeline, 10Patch-For-Review: Refactor our helmfile.d dir structure for services - https://phabricator.wikimedia.org/T258572 (10Joe) If we want to adopt it, it should be documented in the helmfile.d README I think, and probably on wikitech as well. +1 from me. [11:52:57] 10serviceops, 10Operations, 10SRE-tools: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) The pool/depool logic is quite the same as in `sre.discovery.pool/depool` I guess. For checking the service availability on some o... [12:11:02] 10serviceops, 10Operations, 10SRE-tools: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10Joe) >>! In T260663#6399630, @JMeybohm wrote: > The pool/depool logic is quite the same as in `sre.discovery.pool/depool` I guess. > > For c... [15:16:34] do I understand correctly that I can't really do SSL for staging->staging connections in k8s, because the certificates are signed for, say, eventgate-analytics.discovery... and I'm going via staging.svc... ? [15:20:16] cc _joe_ [15:21:03] <_joe_> Pchelolo: we can change that though, create a cert specifically for staging [15:21:15] <_joe_> but yes, staging generally has that problem [15:21:31] it's not really nessesary, I'm just verifying my understanding of the problem [15:21:40] I'll mybe create a ticket to support it [15:21:48] but don't need it right now. thank you [15:43:44] 10serviceops, 10Operations, 10SRE-tools: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) a:03JMeybohm Claiming as I would like us to have this for helmfile migration. [15:44:52] Pchelolo: what are you trying to do? just test staging? [15:45:11] ottomata: yeah, test fluent-bit -> eventgate with TLS in staging [15:45:35] ah from an actually app, not just on the cli with curl hm [15:45:45] yeah.. [15:46:07] we do not do TLS in staging for change-prop -> eventgate either [15:47:02] Pchelolo: i guess maybe we should just leave an http port open for all services in staging? [15:47:11] ottomata: it already is [15:47:19] oh [15:47:33] i knew a couple were still open [15:47:51] ottomata: I mean, yesterday when I produced an event - that was over http [15:48:01] different port right [15:48:02] ? [15:48:11] yup [15:48:13] (i'm on a train and internet is not great to look it up myself) [15:48:14] yea [15:48:53] ah ya, eg main and analytics have the http port still open [15:48:59] 35192 vs 4592 [15:49:07] https://wikitech.wikimedia.org/wiki/Service_ports [15:49:47] for now will switch back and file a task [15:52:08] <_joe_> it would be nice if staging had its own way to use those same names [15:52:46] <_joe_> like if we had a special jumpbox which had a whole separate dns zone... or if we just added a .staging.wmnet subzone [15:53:28] 10serviceops, 10Kubernetes: Support TLS for service-to-service communication in k8s staging - https://phabricator.wikimedia.org/T260917 (10Pchelolo) [16:09:31] a staging.wmnet zone makes sense to me! then we coudl just add that to the SAN and not worry about it [16:12:03] rzl: note that the dumps will continue to run in eqiad after the sqitchover, using the eqiad dbs. there's nothing to be done about it; there is no codfw dumps cluster, nor are there plans for there to ever be one [16:12:09] *switchover [16:24:42] apergos: understood, thanks [16:26:12] I'm making a list of everything that's *not* moving, partly for our own risk analysis and partly because dcops wants a list of eqiad machines that will still be serving -- will make sure the dumps are included [16:33:46] rzl: has gerrit been mentioned one way or the other? [16:34:02] gerrit is not moving afaik [16:35:31] ok. converting the replica is possible but not zero effort [16:36:23] then i have a couple static things like releases and people and miscweb which can totally fail-over but just need a Hiera change which changes the source/dest of rsyncs and then DNS CNAME but that should be it [19:39:41] 10serviceops, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Followup): Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) a:05RLazarus→03None [19:39:57] 10serviceops, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Followup): Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) a:03jijiki [21:22:22] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) [23:11:54] 10serviceops, 10Performance-Team: decom tungsten - https://phabricator.wikimedia.org/T260395 (10Dzahn) [23:12:36] 10serviceops: Decommission mw[2135-2214].codfw.wmnet - https://phabricator.wikimedia.org/T260654 (10Dzahn) a:03Dzahn [23:13:50] 10serviceops, 10DC-Ops, 10Operations, 10Performance-Team, 10ops-eqiad: decom tungsten - https://phabricator.wikimedia.org/T260395 (10Dzahn) a:05Dzahn→03Cmjohnson [23:13:56] 10serviceops, 10DC-Ops, 10Operations, 10Performance-Team, 10ops-eqiad: decom tungsten - https://phabricator.wikimedia.org/T260395 (10Dzahn) [23:17:21] 10serviceops, 10Operations, 10Patch-For-Review: decom releases1001 and releases2001 - https://phabricator.wikimedia.org/T260742 (10Dzahn) p:05Triage→03Medium [23:17:44] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad, 10Performance-Team (Radar): decom tungsten - https://phabricator.wikimedia.org/T260395 (10Dzahn) [23:18:15] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad, 10Performance-Team (Radar): decom tungsten - https://phabricator.wikimedia.org/T260395 (10Dzahn) a:05Cmjohnson→03None [23:18:41] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad, 10Performance-Team (Radar): decom tungsten - https://phabricator.wikimedia.org/T260395 (10Dzahn) I am not sure if I am supposed to directly assign to people now or keep just using the ops- tag as before. Looks like the words on the decom template got...