[07:03:21] <wikibugs>	 10serviceops, 10Operations: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10jijiki) p:05Triage→03Normal
[07:03:41] <wikibugs>	 10serviceops, 10Operations: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10jijiki)
[07:03:47] <wikibugs>	 10serviceops, 10Operations: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Marostegui) I power cycled it from the idrac (it was totally stuck)
[07:53:37] <wikibugs>	 10serviceops, 10Operations: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10MoritzMuehlenhoff) a:03Cmjohnson The server has broken memory (and warranty expires in a month):  ` Record:      43 Date/Time:   03/10/2019 07:53:15 Source:      system Severity:    Critical Description: Correctable mem...
[07:53:46] <wikibugs>	 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10MoritzMuehlenhoff)
[16:20:04] <tarrow>	 Hi, we're thinking about in process caching in the wikidata termbox service and we're wondering how many instances of the service we're likely to have. Looking through operations/deployment-charts it looks like at the moment all services are just one replicated instance. Is that correct? Is this the default you'd expect for a new service unless there is a special reason?
[18:05:43] <bd808>	 tarrow: why would the instance count matter? Are you thinking about local cache vs shared cache? I would encourage you not to make a lot assumptions other than that your service should be able to scale horizontally as traffic increases
[18:07:22] <tarrow>	 bd808: Yes, we're assuming that come local caching is what we want. But, if actually it will be deployed with "loads" of instances then we should be thinking about some shared one
[18:08:58] <tarrow>	 e.g. to pull numbers out of the air: we expect around 60/requests per min and an in process cache with ttl of 1 min. If it turned out that we were expecting 50 instances that might make us reconsider
[18:09:18] <bd808>	 I don't think anyone knows how many replicas the pod will need yet right? Like you are not far enough into things to find out how many concurrent users a single deploy will handle or how many global concurrent users will need to be supported
[18:13:31] <tarrow>	 bd808: we already have a reasonable estimate of the request rate (1/s) we'll have but I was wondering how I (/we) might estimate how that would translate into the number of replicas
[18:19:09] <tarrow>	 or is this very hard to estimate until we actually deploy because there will be some auto-scaling as the load varies? 
[18:24:20] <tarrow>	 I think I'm trying to figure out if my assumption is correct that: request rate (per second) is of the same order of magnitude as the number of instances
[18:24:31] <bd808>	 I don't know if anything has been built out to support autoscaling on the production k8s cluster yet. I would kind of guess not
[18:26:13] <bd808>	 how to structure cache is obviously a complicated issue especially when you have no actual experience running the service at scale yet
[18:27:29] <bd808>	 I am sure _j.oe_ will have some opinions for you when he's around. I think he's out on vacation this week though.
[18:28:07] <tarrow>	 Yep, I imagine it will need some tinkering post deploy; I'm just trying to figure out where to start
[18:29:18] <tarrow>	 anywho! thanks for the thoughts :)
[18:34:26] <_joe_>	 (supposed to be on vacation but around) tarrow: the number of instances are going to be elastically determined by the load
[18:34:49] <_joe_>	 so I decidedly discourage usage of per-instance caching of the final rendered results
[18:35:14] <_joe_>	 I'd cache locally just things that might need to be computed across multiple non cached requests
[18:35:23] <tarrow>	 _joe_: the caching would be of some of the things we use to do the rendering not the result
[18:35:42] <_joe_>	 how much data are we talking about? :)
[18:35:43] <tarrow>	 specifically things like localisation messages
[18:35:59] <_joe_>	 yep it makes sense to keep them in a local cache probably
[18:36:15] <_joe_>	 keep in mind pods can be created/destroyed arbitrarily, more or less
[18:36:24] <tarrow>	 yeah, that's fine
[18:36:54] <_joe_>	 so the cache is going to be a real local cache, not a semi-persistent cache like apc is on our appservers :)
[18:37:04] <tarrow>	 cool
[18:37:22] <tarrow>	 and just to check that something like a 1min ttl wouldn't be worthless if the request rate is 1/s?
[18:37:32] <_joe_>	 well
[18:37:57] <_joe_>	 it would mean a 59/60 => 98% hit rate
[18:38:00] <_joe_>	 not that bad :)
[18:38:09] <tarrow>	 yeah, that would be fine
[18:38:32] <tarrow>	 but if 2 instances then 58/60 hit rate (also cool)
[18:38:49] <tarrow>	 but if 50 instances then 10/60 hit rate (not so cool)
[18:39:04] <_joe_>	 well we won't need 50 instances for 1 req/s I hope
[18:39:06] <_joe_>	 :P
[18:39:25] <_joe_>	 I guess 1 instance can withstand something like 1 req/s
[18:39:36] <tarrow>	 yeah! I really hope so too! :P I just wanted to check that by default there wasn't some massive pool of instances
[18:39:48] <tarrow>	 yep, I hope that 1-2 instances would be sufficient
[18:40:18] <_joe_>	 tarrow: that's one of the good things of kubernetes - we can dedicate to the service almost exactly the resources it needs
[18:40:44] <_joe_>	 and in a (hopefully) not so distant future, we'll have autoscaling too
[18:40:48] <tarrow>	 cool! that's great
[18:41:08] * _joe_ goes back to his "vacation"
[18:41:31] <tarrow>	 thanks for that! Please vacation more though don't let me ruin it :P
[20:50:13] <wikibugs>	 10serviceops, 10Operations, 10RESTBase-API, 10Core Platform Team Backlog (Designing), 10Services (designing): Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10mobrovac) Conceptually I agree that we should conform to the spe...
[20:53:39] <wikibugs>	 10serviceops, 10Operations, 10RESTBase-API, 10TechCom, and 2 others: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10mobrovac)
[22:00:45] <wikibugs>	 10serviceops, 10Operations, 10RESTBase-API, 10TechCom, and 2 others: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10Pchelolo) I would say we should update our swagger to 3.0 and become standard-compatible.  In 3.0 there's a...