[05:45:56] 10serviceops, 10Operations, 10Performance-Team: Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10Marostegui) >>! In T243149#5818385, @aaron wrote: > As long as there are any health checks that hit MediaWiki in codfw that involve DB... [06:29:47] 10serviceops, 10Release-Engineering-Team, 10Patch-For-Review: decommission phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T238957 (10Marostegui) I have removed the users from the database [09:21:30] o/ [09:21:43] Currently most of the services we have deployed are still node services? [10:01:10] <_joe_> addshore: yes [10:01:32] <_joe_> that doesn't mean you _need_ to use it [10:27:07] yup [10:28:16] I'm currently sat with the WMSE wikispeech people talking about their thing. And right now they have a python service, a go service and a java service. That ideally they would like the get deployed [10:33:47] <_joe_> so 3 services 3 different languages? [10:34:59] <_joe_> 🤔 [10:35:55] <_joe_> addshore: I think for go we already have the pipeline working, not sure about python/java [10:36:05] <_joe_> you'll need to reach out to releng [10:36:21] <_joe_> but also, why 3 different languages? We'll have to discuss this [10:36:29] I saw prod docker images for python and go, but not java, I'll poke releng soon [10:36:42] <_joe_> yeah we don't have a prod image for java [10:37:01] <_joe_> but it's easy to create [10:37:12] <_joe_> also java programs can be built into standalone executables [10:37:19] Yup, the Java and go bits are written by other people but needed for the whole text to speech bit [10:37:23] ack [10:37:56] do we have any services that currently do things such as write data to swift / sql dbs? [10:48:06] <_joe_> yes [10:48:08] <_joe_> MediaWiki [10:48:41] <_joe_> sorry, I was knee-deep into puppet [10:49:13] <_joe_> addshore: jokes aside, yes we can have services that write to swift or to a sql db [10:49:27] <_joe_> it won't have access to the same buckets/databases as MediaWiki though ofc [10:59:23] ;) [10:59:25] ack!, thanks [11:20:27] 10serviceops, 10Proton, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Profile proton memory usage for Helm chart - https://phabricator.wikimedia.org/T238830 (10akosiaris) I did rerun 2 times the `num_workers=3` test. No big diff. 100 "locust users", spawned at a rate of 0.1/s. After a... [12:51:12] 10serviceops, 10Operations, 10ops-eqiad: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10jijiki) @Cmjohnson @Jclark-ctr please let us know when we can have those servers online [13:54:29] One of the things I'm trying to figure out with them _joe_ is if it would make sense for the storage of these voice snippets that are rendered to be stored by the service or instead manage that in mediawiki. [13:54:47] So the thing could work something like this https://usercontent.irccloud-cdn.com/file/SmLddc5B/mscgenjs_chart%20(1).png [13:56:23] <_joe_> I'm dumb, that thing looks extremely more complex than my head feels confortable with :D [13:59:06] haha, yes, green being mediawiki extension things, pink being storage, blue being the 3 services they have [14:01:22] my gut thinks it makes the most sense to do the caching and storage within mediawiki. Not much point in reinventing all of that within this python service they have [14:14:06] addshore: I think from an ops perspective, the most important thing is to clarify which kind of protocol is used for each call, it is not the same a webservice call than an internal method call than a mysql query [14:15:14] and what that entails in terms of access, separation of concerns, etc. [14:26:39] 10serviceops, 10observability, 10Wikimedia-Incident: Add alert for app servers in prod serving outdated MediaWiki branches - https://phabricator.wikimedia.org/T242023 (10WDoranWMF) a:03hnowlan [14:28:23] jynus, ack, thanks for that, I'll add protocols to that diagram :) [14:36:42] the one says 'python' on it and that's helpful, maybe putting (java) on the marytts piece is useful, and (go) on the pronlex tab [14:37:26] pronlex really requires sqlite? huh [14:38:34] It requires some sort of storage [14:39:21] New version :) https://usercontent.irccloud-cdn.com/file/Xi2RmcgE/mscgenjs_chart%20(2).png [14:41:03] <_joe_> seriously though, 3 services 3 languages, I'm not enthusiastic [14:41:19] I'm trying to figure out if the audio files really want to be stored for ever or not, and thus the best way to remove said things [14:41:32] _joe_: I knew you would say that ;) [14:41:43] <_joe_> also [14:41:50] <_joe_> 3 services for one functionality! [14:42:04] <_joe_> nanoservices :D [14:43:12] so, the pronlex thing does something different, but would only be needed for arabic, so probably an initial deployment wouldnt need that thing anyway, thus only 2 services [14:43:34] <_joe_> uhhh [14:43:49] marytts is the actual "multilingual text-to-speech synthesis system" bit, that defintly does a thing [14:43:50] <_joe_> so a java one and a python one [14:44:04] <_joe_> ok so that service does text-to-speech [14:44:09] <_joe_> what does the python api do? [14:44:54] well, right now the python bit is the service & api in front of the other 2, that includes persistent storage of the audio files for example. [14:45:45] I need to dive deeper into that service when them and figure out what is going on. But I guess the idea is that that service need not only be used with mediawiki. [14:46:04] <_joe_> ok [14:46:38] <_joe_> because else, there is no point in having anything between mediawiki and marytts, I would naively imagine [14:46:40] but, if the persistence of data were to for example move to mediawiki, then maybe the rest of the things in that API maybe should too, maybe, not that mediawiki should be doing all of the things [14:47:13] it seems like that's what extensions are for, if this wouldn't be used elsewhere [14:47:30] "if this wouldn't be used elsewhere", Thats probably one of the imporant bits [14:47:46] but my understanding is that this python API at least initially would not be public and would only be called by mediawiki itself. [14:48:00] the argument against that of course is, the python API is already written [14:49:32] I'm going to try not to sound crabby, and I understand about having inveted time and energy into a project and not wanting it to go to waste. but this is why consulting about the arch in advance is helpful [14:50:32] yup, I think I need to dive into it and see exactly what it is doing in there [14:50:53] 👍 [14:51:14] im really enjoying these MSCgen diagrams though :D [14:51:27] I testd out the marytts section in a browser, and while it might be tedious to listen to long articles, it's definitely miles ahead of where the tech was even 5 years ago [14:52:14] Yeah, it all seems like quite a nice project, but the arch is a bit sprawling it seems. [14:57:02] <_joe_> yeah so, I'm really sorry to raise doubts when something has already been written, but this is the first time I look into it [14:57:39] Yup, I totally understand, and don't worry, I didn't write it for see it before either :) You have the same thoughts as me [15:01:06] 10serviceops, 10Analytics, 10Analytics-Kanban: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10akosiaris) > But first, I think a big source of confusion in our patches is the conflation of the word 'service'. I think you are r... [15:02:24] How exactly does the request of a fullsize file / image from mediawiki get requested via URLs such as https://upload.wikimedia.org/wikipedia/commons/b/bc/Comme_des_Garcons_at_the_Met_%2862473%29.jpg [15:19:17] akosiaris: should I keep eventstreams.chartname then? [15:19:19] instead of wmf.? [15:19:42] we also really should resolve our service name issues before proceeding much furtgher [15:19:50] the tls templates thing also abuses release name for SERVICE_NAME [15:20:11] _tls_helpers.tpl [15:23:37] ah you commented! reading :) [15:26:23] <_joe_> it doesn't "abuse" it [15:26:39] <_joe_> it does the same that was done in other services [15:27:26] yeah, but it is confusing for charts that are used to deploy multipled services (e.g. helmfile), as releasename is often just 'production' [15:27:41] see also https://phabricator.wikimedia.org/T242861, we are trying to figure this out [15:29:59] <_joe_> I read that task and I am keeping myself out of the discussion for now. I have my opinion but it can definitely wait that you two have reached a consensus [15:30:09] hahah ok can't wait :) [15:34:52] ottomata: yeah, it's interesting indeed. I get why that change was proposed and back then I even merged it, but it's now that I see the repercussions of that. [15:35:23] I was hoping to deduplicate helpers.tpl in the exact same way the _tls_helpers.tpl has been deduplicated [15:36:45] ottomata: I 'd say keep the templates as you have them for now, I 'll figure it down the path [15:36:54] ok [15:36:58] with eventstreams.? [15:37:03] yup [15:37:15] worse case scenario I might just change it in the future [15:39:36] k [15:54:27] 10serviceops, 10Analytics, 10Analytics-Kanban: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10Ottomata) > My kneejerk reaction to this is "instance of what"? of eventgate? K cool, let's figure out a different name. I like ser... [15:56:25] 10serviceops, 10Operations, 10ops-codfw: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) @jijiki I have no problem with that. [18:36:50] 10serviceops, 10Release-Engineering-Team, 10Patch-For-Review: decommission phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T238957 (10Dzahn) Thanks Manuel !:) Production IPs removed from DNS. [18:40:25] 10serviceops, 10Release-Engineering-Team: decommission phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T238957 (10Dzahn) a:05Dzahn→03Jclark-ctr [18:41:14] 10serviceops, 10Operations, 10Performance-Team: Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10aaron) What user impact did it cause? [18:42:06] 10serviceops, 10Release-Engineering-Team: decommission phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T238957 (10Dzahn) This server has been temporarily assigned in T215335 and used in T221389. Giving it back to the pool of spares after it has served its purpose.. It has been originally purchased in... [19:16:42] 10serviceops, 10Operations, 10Wikimedia-Etherpad: Migrate etherpad1001 to Buster - https://phabricator.wikimedia.org/T224580 (10Dzahn) The following packages are used by the puppet role but so far missing on buster: * prometheus-etherpad-exporter * etherpad-lite [19:55:27] 10serviceops, 10Operations, 10ops-eqiad: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) [20:01:30] 10serviceops, 10Operations, 10ops-eqiad: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) [21:25:20] 10serviceops, 10Operations, 10Performance-Team (Radar): Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10aaron) [22:25:56] 10serviceops, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team): Jobrunner monitoring still calles /rpc/runJobs.php - https://phabricator.wikimedia.org/T243096 (10Dzahn) Yes, it's possible to make the monitoring check do a POST request. It uses the `check_http` nagios (icinga) plugin and th... [22:29:30] 10serviceops, 10Operations, 10Performance-Team (Radar): Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10jijiki) @aaron non at all since it was codfw. On the other hand, we were a bit alarmed because of it, since we didn't expect s... [22:36:37] 10serviceops, 10Operations, 10Performance-Team (Radar): Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10Krinkle) Looks like the main action is to avoid these alarms in the future, asking a few questions (some may be obvious): * D... [22:37:21] 10serviceops, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Jobrunner monitoring still calles /rpc/runJobs.php - https://phabricator.wikimedia.org/T243096 (10Dzahn) See the example change above. You would just have to replace "POST_DATA" with the actual data as a s...