[07:17:15] 10serviceops, 10Operations, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) >>! In T264991#6534293, @bd808 wrote: > Should this task be merged with {T245757} somehow? Probablyish, but the scope is a little differe... [07:19:01] 10serviceops, 10Operations, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) >>! In T264991#6534026, @Dzahn wrote: > ii prometheus-nutcracker-exporter 0.2+nmu1 all Prometheus... [07:21:33] 10serviceops, 10Packaging: Please provide our special component/php72 in buster-wikimedia - https://phabricator.wikimedia.org/T250515 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff >>! In T250515#6177325, @MoritzMuehlenhoff wrote: > realistically we'll only approach this once we move production to Buster. This... [07:58:04] 10serviceops, 10Operations, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Joe) >>! In T264991#6533992, @Legoktm wrote: >>>! In T264991#6533968, @Dzahn wrote: >> - ploticus > > {T253377} Given we're sunsetting graphoid instead, Ea... [08:18:12] _joe_: nemo-yiannis: looks like push-not just needs a lot of time to get latency back down to "normal" https://grafana.wikimedia.org/d/NQO_pqvMk/push-notifications?orgId=1&from=1602201152770&to=1602403260466&var-dc=eqiad%20prometheus%2Fk8s&var-service=push-notifications [08:19:01] <_joe_> jayme: I'd like to better understand what that means [08:19:01] maybe it's just a matter of very low traffic with some spikes in between that make the p99 come down very slowly... [08:19:17] <_joe_> i have no idea but this seems a bug [08:19:30] yeah. It's def. weird [08:20:11] escpecially because apache-bench tests do not show any requests taking >1s [08:45:49] i checked grafana and i can see some correlation with GC activity [08:47:45] at some point we detected a memory leak in one of the endoints: https://phabricator.wikimedia.org/T263058 [08:58:34] but that does not really seem to correlate with the service being restarted (which should "clean up" leaking memory). Also, I would guess no one has send a (APNS) message as the service is just gettin monitoring traffic, right? [09:09:35] The apn client is being instantiated on app init, that's why I thought that this would be related. Since the service is not getting traffic, i will try to reproduce the issue and extract metrics locally [09:10:16] * nemo-yiannis filing a ticket, this looks like a bug [09:11:01] thanks [09:18:20] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: High latency on push notification service initialization - https://phabricator.wikimedia.org/T265258 (10Jgiannelos) [10:29:23] 10serviceops, 10Operations, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10kostajh) >>! In T252391#6533353, @nettrom_WMF wrote: > @kostajh : Thanks for picking this up and pinging me about it. I think... [14:17:42] Hi all [14:18:44] Does adding dependencies in a Helm chart mean that those dependencies will be bundled as containers running within that same pod? [14:20:27] If so, how would one go about setting up the helm so that some dependency might be running in other pods, potentially automatic up-down-scaling the service? Or is it ment that we should scale the one specific pod up and down? [14:21:33] This is regarding the Wikispeech extension. Our backend, speechoid, is a bundle of quite a few services with rather simple dependencies. [14:22:14] But some of the services are rather heavy on CPU, e.g. the speech synthesis. [14:23:49] I was thinking it would make sense to scale those services only, having k8s balance the requests. [14:25:13] Also, we have installed a HAProxy infront of one of the services, to act as a request queue. Only letting in one request at the time since it will consume 100% of the available threads. It feels as we should let k8s handle that when there are multiple instances up and running. [14:26:43] For reference: [14:26:45] https://gerrit.wikimedia.org/r/admin/repos/q/filter:services+wikispeech [14:27:16] https://www.mediawiki.org/wiki/Wikispeech [14:58:43] kalle: is there a task related to rolling out speechoid to kubernetes? [14:59:05] it would be lovely if we could discuss those details on a task [15:19:26] <_joe_> kalle: yeah also, new services architectures are usually discussed before getting to the deployment phase with the stakeholders (including SRE) [15:20:06] <_joe_> maybe that was done. In that case, can you point me to the people you spoke with? [15:20:42] <_joe_> so that I can get a better idea of how the release was planned [15:22:41] <_joe_> if not, we will need to take some time to advise you on how to proceed. Horizontal pod autoscaling is not a great way to spawn new workers on demand, unless we are ok with having a lot of latency for individual requests. [15:25:18] <_joe_> what you probably want to do is to return 503 to the readiness probe while your container is processing a request (or multiple requests if we decide to serve more than one thread from the same pod) [15:25:48] <_joe_> but again, that won't probably work if not with a small number of incoming requests [15:49:36] 10serviceops, 10Wikidata, 10Wikidata Query Builder, 10Wikidata Query UI, 10User-Addshore: Host static sites on kubernetes - https://phabricator.wikimedia.org/T264710 (10Addshore) [15:50:05] ^^ finished adding the content to the ticket relating to static sites on k8s, would appreciate questions and thoughts to be collected there now :0 [15:50:07] :) [15:55:40] 10serviceops, 10Wikidata, 10Wikidata Query Builder, 10Wikidata Query UI, 10User-Addshore: Host static sites on kubernetes - https://phabricator.wikimedia.org/T264710 (10Addshore) [15:56:25] 10serviceops, 10Wikidata, 10Wikidata Query Builder, 10Wikidata Query UI, 10User-Addshore: Host static sites on kubernetes - https://phabricator.wikimedia.org/T264710 (10Addshore) [19:13:53] effie: https://phabricator.wikimedia.org/T265280 [19:15:01] _joe_: We've only talked to releng at this point, making sure they accept how we blubbered things up. [19:15:45] So this is really the inital ops-contact, working our way towards beta-cluster release. [20:50:17] 10serviceops, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) ^ This will make redis connection handling slightly healthier but I can't say it will handle this case as observability of ores is...