[07:17:15] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) >>! In T264991#6534293, @bd808 wrote: > Should this task be merged with {T245757} somehow?  Probablyish, but the scope is a little differe...
[07:19:01] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) >>! In T264991#6534026, @Dzahn wrote: > ii  prometheus-nutcracker-exporter       0.2+nmu1                         all          Prometheus...
[07:21:33] <wikibugs>	 10serviceops, 10Packaging: Please provide our special component/php72 in buster-wikimedia - https://phabricator.wikimedia.org/T250515 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff >>! In T250515#6177325, @MoritzMuehlenhoff wrote: > realistically we'll only approach this once we move production to Buster.  This...
[07:58:04] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Joe) >>! In T264991#6533992, @Legoktm wrote: >>>! In T264991#6533968, @Dzahn wrote: >> - ploticus >  > {T253377}  Given we're sunsetting graphoid instead, Ea...
[08:18:12] <jayme>	 _joe_: nemo-yiannis: looks like push-not just needs a lot of time to get latency back down to "normal" https://grafana.wikimedia.org/d/NQO_pqvMk/push-notifications?orgId=1&from=1602201152770&to=1602403260466&var-dc=eqiad%20prometheus%2Fk8s&var-service=push-notifications
[08:19:01] <_joe_>	 jayme: I'd like to better understand what that means
[08:19:01] <jayme>	 maybe it's just a matter of very low traffic with some spikes in between that make the p99 come down very slowly...
[08:19:17] <_joe_>	 i have no idea but this seems a bug
[08:19:30] <jayme>	 yeah. It's def. weird
[08:20:11] <jayme>	 escpecially because apache-bench tests do not show any requests taking >1s
[08:45:49] <nemo-yiannis>	 i checked grafana and i can see some correlation with GC activity
[08:47:45] <nemo-yiannis>	 at some point we detected a memory leak in one of the endoints: https://phabricator.wikimedia.org/T263058
[08:58:34] <jayme>	 but that does not really seem to correlate with the service being restarted (which should "clean up" leaking memory). Also, I would guess no one has send a (APNS) message as the service is just gettin monitoring traffic, right?
[09:09:35] <nemo-yiannis>	 The apn client is being instantiated on app init, that's why I thought that this would be related. Since the service is not getting traffic, i will try to reproduce the issue and extract metrics locally
[09:10:16] * nemo-yiannis filing a ticket, this looks like a bug
[09:11:01] <jayme>	 thanks
[09:18:20] <wikibugs>	 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: High latency on push notification service initialization - https://phabricator.wikimedia.org/T265258 (10Jgiannelos)
[10:29:23] <wikibugs>	 10serviceops, 10Operations, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10kostajh) >>! In T252391#6533353, @nettrom_WMF wrote: > @kostajh : Thanks for picking this up and pinging me about it. I think...
[14:17:42] <kalle>	 Hi all
[14:18:44] <kalle>	 Does adding dependencies in a Helm chart mean that those dependencies will be bundled as containers running within that same pod?
[14:20:27] <kalle>	 If so, how would one go about setting up the helm so that some dependency might be running in other pods, potentially automatic up-down-scaling the service? Or is it ment that we should scale the one specific pod up and down?
[14:21:33] <kalle>	 This is regarding the Wikispeech extension. Our backend, speechoid, is a bundle of quite a few services with rather simple dependencies.
[14:22:14] <kalle>	 But some of the services are rather heavy on CPU, e.g. the speech synthesis. 
[14:23:49] <kalle>	 I was thinking it would make sense to scale those services only, having k8s balance the requests.
[14:25:13] <kalle>	 Also, we have installed a HAProxy infront of one of the services, to act as a request queue. Only letting in one request at the time since it will consume 100% of the available threads. It feels as we should let k8s handle that when there are multiple instances up and running.
[14:26:43] <kalle>	 For reference:
[14:26:45] <kalle>	 https://gerrit.wikimedia.org/r/admin/repos/q/filter:services+wikispeech
[14:27:16] <kalle>	 https://www.mediawiki.org/wiki/Wikispeech
[14:58:43] <effie>	 kalle: is there a task related to rolling out speechoid to kubernetes?
[14:59:05] <effie>	 it would be lovely if we could discuss those details on a task 
[15:19:26] <_joe_>	 kalle: yeah also, new services architectures are usually discussed before getting to the deployment phase with the stakeholders (including SRE)
[15:20:06] <_joe_>	 maybe that was done. In that case, can you point me to the people you spoke with?
[15:20:42] <_joe_>	 so that I can get a better idea of how the release was planned
[15:22:41] <_joe_>	 if not, we will need to take some time to advise you on how to proceed. Horizontal pod autoscaling is not a great way to spawn new workers on demand, unless we are ok with having a lot of latency for individual requests.
[15:25:18] <_joe_>	 what you probably want to do is to return 503 to the readiness probe while your container is processing a request (or multiple requests if we decide to serve more than one thread from the same pod)
[15:25:48] <_joe_>	 but again, that won't probably work if not with a small number of incoming requests
[15:49:36] <wikibugs>	 10serviceops, 10Wikidata, 10Wikidata Query Builder, 10Wikidata Query UI, 10User-Addshore: Host static sites on kubernetes - https://phabricator.wikimedia.org/T264710 (10Addshore)
[15:50:05] <addshore>	 ^^ finished adding the content to the ticket relating to static sites on k8s, would appreciate questions and thoughts to be collected there now :0
[15:50:07] <addshore>	 :)
[15:55:40] <wikibugs>	 10serviceops, 10Wikidata, 10Wikidata Query Builder, 10Wikidata Query UI, 10User-Addshore: Host static sites on kubernetes - https://phabricator.wikimedia.org/T264710 (10Addshore)
[15:56:25] <wikibugs>	 10serviceops, 10Wikidata, 10Wikidata Query Builder, 10Wikidata Query UI, 10User-Addshore: Host static sites on kubernetes - https://phabricator.wikimedia.org/T264710 (10Addshore)
[19:13:53] <kalle>	 effie: https://phabricator.wikimedia.org/T265280
[19:15:01] <kalle>	 _joe_: We've only talked to releng at this point, making sure they accept how we blubbered things up. 
[19:15:45] <kalle>	 So this is really the inital ops-contact, working our way towards beta-cluster release.
[20:50:17] <wikibugs>	 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) ^ This will make redis connection handling slightly healthier but I can't say it will handle this case as observability of ores is...