[07:03:21] 10serviceops, 10Operations: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10jijiki) p:05Triage→03Normal [07:03:41] 10serviceops, 10Operations: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10jijiki) [07:03:47] 10serviceops, 10Operations: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Marostegui) I power cycled it from the idrac (it was totally stuck) [07:53:37] 10serviceops, 10Operations: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10MoritzMuehlenhoff) a:03Cmjohnson The server has broken memory (and warranty expires in a month): ` Record: 43 Date/Time: 03/10/2019 07:53:15 Source: system Severity: Critical Description: Correctable mem... [07:53:46] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10MoritzMuehlenhoff) [16:20:04] Hi, we're thinking about in process caching in the wikidata termbox service and we're wondering how many instances of the service we're likely to have. Looking through operations/deployment-charts it looks like at the moment all services are just one replicated instance. Is that correct? Is this the default you'd expect for a new service unless there is a special reason? [18:05:43] tarrow: why would the instance count matter? Are you thinking about local cache vs shared cache? I would encourage you not to make a lot assumptions other than that your service should be able to scale horizontally as traffic increases [18:07:22] bd808: Yes, we're assuming that come local caching is what we want. But, if actually it will be deployed with "loads" of instances then we should be thinking about some shared one [18:08:58] e.g. to pull numbers out of the air: we expect around 60/requests per min and an in process cache with ttl of 1 min. If it turned out that we were expecting 50 instances that might make us reconsider [18:09:18] I don't think anyone knows how many replicas the pod will need yet right? Like you are not far enough into things to find out how many concurrent users a single deploy will handle or how many global concurrent users will need to be supported [18:13:31] bd808: we already have a reasonable estimate of the request rate (1/s) we'll have but I was wondering how I (/we) might estimate how that would translate into the number of replicas [18:19:09] or is this very hard to estimate until we actually deploy because there will be some auto-scaling as the load varies? [18:24:20] I think I'm trying to figure out if my assumption is correct that: request rate (per second) is of the same order of magnitude as the number of instances [18:24:31] I don't know if anything has been built out to support autoscaling on the production k8s cluster yet. I would kind of guess not [18:26:13] how to structure cache is obviously a complicated issue especially when you have no actual experience running the service at scale yet [18:27:29] I am sure _j.oe_ will have some opinions for you when he's around. I think he's out on vacation this week though. [18:28:07] Yep, I imagine it will need some tinkering post deploy; I'm just trying to figure out where to start [18:29:18] anywho! thanks for the thoughts :) [18:34:26] <_joe_> (supposed to be on vacation but around) tarrow: the number of instances are going to be elastically determined by the load [18:34:49] <_joe_> so I decidedly discourage usage of per-instance caching of the final rendered results [18:35:14] <_joe_> I'd cache locally just things that might need to be computed across multiple non cached requests [18:35:23] _joe_: the caching would be of some of the things we use to do the rendering not the result [18:35:42] <_joe_> how much data are we talking about? :) [18:35:43] specifically things like localisation messages [18:35:59] <_joe_> yep it makes sense to keep them in a local cache probably [18:36:15] <_joe_> keep in mind pods can be created/destroyed arbitrarily, more or less [18:36:24] yeah, that's fine [18:36:54] <_joe_> so the cache is going to be a real local cache, not a semi-persistent cache like apc is on our appservers :) [18:37:04] cool [18:37:22] and just to check that something like a 1min ttl wouldn't be worthless if the request rate is 1/s? [18:37:32] <_joe_> well [18:37:57] <_joe_> it would mean a 59/60 => 98% hit rate [18:38:00] <_joe_> not that bad :) [18:38:09] yeah, that would be fine [18:38:32] but if 2 instances then 58/60 hit rate (also cool) [18:38:49] but if 50 instances then 10/60 hit rate (not so cool) [18:39:04] <_joe_> well we won't need 50 instances for 1 req/s I hope [18:39:06] <_joe_> :P [18:39:25] <_joe_> I guess 1 instance can withstand something like 1 req/s [18:39:36] yeah! I really hope so too! :P I just wanted to check that by default there wasn't some massive pool of instances [18:39:48] yep, I hope that 1-2 instances would be sufficient [18:40:18] <_joe_> tarrow: that's one of the good things of kubernetes - we can dedicate to the service almost exactly the resources it needs [18:40:44] <_joe_> and in a (hopefully) not so distant future, we'll have autoscaling too [18:40:48] cool! that's great [18:41:08] * _joe_ goes back to his "vacation" [18:41:31] thanks for that! Please vacation more though don't let me ruin it :P [20:50:13] 10serviceops, 10Operations, 10RESTBase-API, 10Core Platform Team Backlog (Designing), 10Services (designing): Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10mobrovac) Conceptually I agree that we should conform to the spe... [20:53:39] 10serviceops, 10Operations, 10RESTBase-API, 10TechCom, and 2 others: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10mobrovac) [22:00:45] 10serviceops, 10Operations, 10RESTBase-API, 10TechCom, and 2 others: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10Pchelolo) I would say we should update our swagger to 3.0 and become standard-compatible. In 3.0 there's a...