[04:37:16] 10serviceops, 10Excimer, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), and 2 others: Excimer: new profiler for PHP - https://phabricator.wikimedia.org/T205059 (10Krinkle) [05:53:18] <_joe_> ottomata should really stay on irc when he's not awake [05:54:04] <_joe_> the reason not to have 50 pods with one runner instead of 10 with 5 runners is multiple, but I guess I'll explain later [08:05:39] yeah I was going to make the same point [09:54:05] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Pablo-WMDE) @Smalyshev No reason to be sorry for asking the right questions! If we truly wanted* to boil the reason down to one sentence:... [11:30:24] 10serviceops, 10Operations, 10Proton, 10Readers-Web-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Joe) [11:31:34] 10serviceops, 10Operations, 10Proton, 10Readers-Web-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Joe) @Jhernandez I'm happy to explain to you whatever you might want to know about our load-balancing infrastructure, a... [12:47:13] akosiaris: _joe_ for my education why is better to have 10 with 5 runners rather than 50 pods? Only reason i can think about is node is single-threaded so better CPU utilization? [12:48:06] <_joe_> fsero: every worker uses one thread, so no [12:48:44] <_joe_> fsero: you remove some overhead, both in terms of running sidecars and in terms of service-runner itself [12:49:02] <_joe_> and most importantly, you reduce the number of objects kubernetes needs to manage [12:49:16] <_joe_> also if the application has a local cache, it becomes more effective [12:49:34] <_joe_> the counterargument is granular resource allocation [12:49:42] <_joe_> we should just find the sweet spot for us [12:50:53] as another counterargument it makes scaling policies more complex, if one more pod means several workers scaling down policies should be even more conservatives [12:51:03] but thanks :) [12:51:13] <_joe_> that's still part of "granular resource allocation" [12:51:48] <_joe_> we can surely get to the point where it's better for us to have 1 pod == 1 worker [12:52:06] <_joe_> but 1 worker is a pretty lame amount of responses/s AFAICT [12:52:30] <_joe_> we're anyway going to be memory bound well before we're cpu bound I think [13:25:44] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10daniel) > The reason this is a dedicated service is the language it is written in (typescript), which was chosen because it allows us to cr... [14:05:05] one more reason is memory consumption [14:05:45] * apergos is listening closely... err, reading closely [14:06:00] depending on the software linux's CoW memory tricks mean that starting more pods vs more workers will result in overall greater memory usage [14:06:06] ORES is a fine example of that [14:06:37] as all workers seem to be started off the main one and after the models have been loaded into memory [14:06:59] all workers end up sharing the same memory as the ML models are effectively read-only python objects [14:07:09] right [14:07:15] but that's a trick linux does, it's not evident in ps output [14:07:32] where if you count the total RSS you will end up with many times more memory than the box has [14:08:03] I read the code trying to figure out exactly what they do at some point, trying to make that memory show up under shared [14:08:18] but as effectively we are talking about python objects that's impossible [14:08:22] heh [14:09:50] that's pretty interesting stuff [14:10:03] is there a kube (n) presentation on that in the works? [14:10:56] if you are interested into how we could achieve that, the trick is mmap() with MAP_SHARED and then using that memory as read-only memory for all workers to refer to [14:11:06] but again, python objects can't be stored in there as they go in the heap [14:11:37] akosiaris: seeking some recommendations :) [14:11:41] apergos: haven't thought of that.... hmmm [14:11:45] look at this nasty thing I did: https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/483035/7/charts/eventgate-analytics/templates/deployment.yaml [14:11:47] ottomata: ah yes. you need a bouncer [14:11:50] line 93-109 [14:11:54] that's my #1 recommendation [14:11:56] haha [14:12:08] lol [14:12:11] an IRC bouncer that is. I see your questions the next morning and there's noone to answer to [14:12:17] or use irccloud [14:12:19] :P [14:12:36] naw, i don't want to get chats when i'm offline [14:12:39] that's what email is for :) [14:12:56] i consider irc ephemeral, if i miss it, i miss it, if someone misses me, so be it! :) [14:13:01] then pose questions in email form too? [14:13:10] or wait until the next morning :D [14:13:14] i would if it was urgent [14:13:52] i dunno, sometimes its just nice to talk realtime so i can wait [14:14:33] we can also just wait until meeting tomorrow if one of yall can make it [14:14:46] sure, but at times with all the tz differences it's difficult [14:14:52] SYS_PTRACE ? [14:15:08] ya it allows me to SIGHUP eventgate [14:15:17] why SIGHUP it ? [14:15:33] that container is periodically running git pull [14:15:37] on schemas [14:15:45] in a stupid while; sleep loop [14:15:54] if there is a change [14:15:58] OH HM actually [14:16:13] haha, wait, i think i don't need to sighup in this case. doh. that is for stream config, not schemas [14:16:19] lol [14:16:20] it doesn't cache if a schema doesn't exist [14:16:29] and schemas are immutable [14:16:32] btw that file is full of dev stuff right ? [14:16:34] oh hm,...... [14:16:42] akosiaris: probably some, like that git pull container [14:16:45] but most i like [14:16:48] oh wow [14:16:51] cause I see 192.168.99.100 spluttered all around it [14:16:51] if i don't have to sighup [14:16:53] OH [14:16:54] yes [14:16:55] well [14:16:57] that's in a conditional [14:17:07] if minikube ... [14:17:10] and images we don't have as well like image: appropriate/nc [14:17:15] also ottomata related to that why not extract part of that deployment into a dev template and load it? [14:17:16] yeah that's dev w'd have to import [14:17:23] anyway wait, if we don't need to sighup [14:17:30] then can I just use a CronJob [14:17:35] seeing zookeeper in a deployment.yaml makes me scream [14:17:41] to git pull in the volume? (can a CronJob share the volume?) [14:17:51] ok seeking recommendation there too [14:18:00] how would you suggest we make it easy for folks to run in minikube? [14:18:11] all that zookeeper/kafka stuff is wrapped in a conditional [14:18:34] let's start by seggregating it and keeping it simple [14:18:45] you can create multiple charts obviously [14:18:56] right but there is so much copy/paste [14:18:57] and have services talk to each other in minikube [14:19:02] oh for kafka [14:19:09] and zookeeper [14:19:19] and whatever else will never make it to production [14:19:26] let's be btw very clear about this [14:19:34] (whoops) [14:19:36] NO STATEFUL SERVICES IN KUBERNETES! [14:19:42] am I loud enough ? [14:19:43] akosiaris: those are meant for development only [14:19:48] the ONLY reason they are there [14:19:50] is to run in minikube [14:20:01] so i can run eventgate and test and do the performance testing thing [14:20:19] yeah but they can be helm charts in their own right [14:20:42] i am happy to put them in another chart if that is better...that would be easier for development, because then the kafka and zk pods won't go up and down as I delete and install the chart while devel [14:20:42] ing [14:20:48] no need to go around doing a big "if dev" "foo" else "bar" [14:21:09] aye, just a ---set main_app.kafka.brokers.bla.bla= [14:21:29] sure i can do that, didn't know if a chart just for local development in deployment-charts would be ok [14:21:39] should we import those images into docker-registry? [14:21:40] ottomata: i believe zookeeper and kafka is a requirement for testing, i will suggest you to use https://docs.helm.sh/chart_best_practices/#requirements-files a requirements.yaml poiting to local helm charts created for zookeeper and kafka [14:21:52] oo cool [14:22:19] ottomata: we don't just import images in the docker registry btw. We try to build everything from source for the obvious reasons [14:22:29] then probably you will have a values.minikube.yaml and a values.production.yaml [14:22:57] there are ways to disable subcharts using values.yaml if you dont find it ping me [14:24:13] ottomata: also to answer the question of more pods or more workers... it's not that easy. up to now we got with num_workers: 1 after some tests showed that it is more reliable than 0 [14:24:41] huh interesting [14:25:07] cool, requirements.yaml makes tons of sense, will do that [14:25:09] namely service-runner's supervision works a tad better than plain kubernetes [14:25:42] akosiaris: if it is for devel, would it be worth including the kafka image in our repo? or just add a step in a readme to dl it in your own docker machine? [14:25:43] tests with 2 and above show more or less the obvious. You need less pods when increasing workers. But it's not just that balance that needs to be struck [14:26:00] ottomata: probably not worth it including it [14:26:04] ok great [14:26:17] minikube can anyway download from any docker registry out there [14:26:22] ya [14:27:03] there also the balance of how many workers do we start in the dev environment [14:27:29] and generally we don't want the pod size to differ much between dev and production as that increases the diff between the 2 envs [14:28:10] aye [14:28:24] i'll just leave it at 1 for now, i don't have any reason to change it (yet) [14:28:36] that sounds fine. We can always revisit anyway [14:28:41] ya [14:29:19] oh, ... can a CronJob access a shared volume of a pod? (I think not...the CronJob runs in a different pod?) [14:30:36] yes it is indeed [14:31:01] the pod is the "quantum" scheduling unit if that helps you. Anything else is kind of built on it [14:31:27] I am not sure if a volume could be doubly bind mounted in 2 pods [14:31:28] ottomata: no [14:31:32] ah there you go [14:31:49] CronJob will spawn it's own pod, so they cannot share a volume with another pod [14:32:09] yar, thought so :/ [14:32:15] what's your use case for the cronjob anyways? [14:32:20] git pull :P [14:32:22] yup [14:32:46] effectively live updating the configuration without a deploy from what I gather [14:32:46] we might want to deploy this schema registry as a simple http server elsewhere for this [14:32:52] mmm how frequently changes that? [14:32:53] then it can get the schemas remotely instead [14:32:56] fsero not often [14:33:04] at least, not yet [14:33:09] if you git pull over an init container it makes more sense to me to do a redeploy [14:33:19] you will get a new bunch of pods with new schema [14:33:23] that's better you think? [14:33:28] giving that two versions could coexist [14:33:31] at the same time [14:33:36] ya they could. [14:33:54] so there is no need to git pull from a cronjob, pods are meant to be immutable once launched [14:34:16] so if you need an update just create another deploy and update :) [14:34:18] yeah. might be ok. even though the schemas are supposed to be immuntable too, they rarely do need to be changed [14:34:25] which would require a redploy (without a SIGHUP, anyway :) ) [14:34:29] ok [14:34:32] i'll go that route then [14:34:37] that'll make the chart way simpler [14:36:11] fsero: eventually this should be automatic. when we get to use this service for more general purpose analytics [14:36:19] people will make new schemas pretty frequently [14:36:56] but, at that point we can request schemas over http rather than locally [14:37:10] actually, looking into it.. i dont really get why you need an init container, after each commit or tag or release over event-schema repo you could create a new container image [14:37:15] and use that container image in that deployment [14:37:20] hm, one thought: redploying (and sighuping) have the same effect of deleting all compiled/cached scheams [14:37:54] which is fine, but it means that for each eventgate process, they'll have to recompile each schema the first time it sees an event of that type [14:38:02] so, the first few requests to each new pod will be slowed [14:38:15] i think that's ok though, especially until we get a schema service [14:38:42] fsero, i'd rather not build new containers every time if i can help it, the schemas are meant to be an evolving resource [14:39:08] another design might have them stored in a database instead of a git repo [14:39:15] we chose git because we wanted them to be decentralized [14:41:45] akosiaris: _joe_: does it become complicated to manage the networking aspect of multiple workers in a single pod? i had thought that k8s services abstracted over pods, not containers within a pod [14:42:19] <_joe_> cdanis: workers are doing IPC AFAIR [14:42:25] cdanis: well we mean multiple processes here so no [14:43:02] in fact IIRC we are talking about https://nodejs.org/api/cluster.html [14:43:20] oh, okay [14:43:38] I had it in my head that there were multiple identical HTTP servers running in the same pod [14:45:08] (I once ran a service that worked something like that; I do not recommend it) [14:45:37] multiple containers all running HTTP servers? hmm [14:45:58] it wasn't even HTTP; it was an old proprietary TCP protocol [14:45:58] and I am guessing talking to each other and causing a nice graph of interconnections [14:46:06] * akosiaris runs away screaming [14:46:08] no, fortunately each of them were independent of each other [14:46:45] but they had ~10GB or so of somewhat-static data they all needed to mmap() [14:47:07] and each one was mostly single-threaded (except for some stages of query execution) [14:47:22] thus the interest in packing multiple of them onto a single Borg alloc (basically a k8s pod) [14:49:37] ottomata: eventgate tracks the schema version used somehow? As in the event produced or as an api endpoint? [14:50:20] just curious because doing git pull in a initContainer has a downside you dont know which version fo the schemas are there unless you enter he container and do a git log [14:50:24] ah yes the mmap() part makes sense [14:50:43] fsero: the files in the schema repo are in principle immutable [14:50:50] so git pull will only ever get new files [14:52:08] but yes, each event points at its own schema via a uri [14:56:03] ack :) [15:05:41] <_joe_> please let's not modify the content of containers at runtime [15:05:50] <_joe_> re: what fsero said [15:06:31] <_joe_> if it's an on-demand caching of content from another source, that's ok ofc [15:09:14] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Performance-Team (Radar): Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) [15:13:57] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) [15:22:39] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10Gilles) [15:26:16] kafka-single-node: https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/484498/ [15:26:38] <_joe_> ottomata: I think you should use SRV records [15:27:01] <_joe_> a solution I proposed at the time as an alternative to our awkward puppet functions :P [15:28:52] ottomata: in minikube you can use in-cluster dns [15:29:12] so zookeeper-service.NAMESPACE.svc.cluster.local will resolv to pod ip [15:29:37] where NAMESPACE is the namespace where is deployed, exposed in helm as .Release.Namespace [15:29:43] _joe_: that makes sense [15:30:04] q how does the srv record work if aa broker is down? [15:30:55] fsero: instead of the 192.168.99.100 in places? [15:31:09] yep [15:31:58] it will also use internal network instead of public minikube ip and nodeport and then probably you wont need that docker0 promisc on [15:32:48] hm! [15:35:07] <_joe_> ottomata: the SRV record gives you a list of all brokers [15:35:15] <_joe_> the client may cycle between them I guess [15:35:22] <_joe_> where the client == your code [15:35:33] joe ya, the client will use whatever is given to it to ask for the final list of brokers to use [15:35:44] but, if what is given to it is offline, i don't think it will retry, if it is only given one. [15:36:00] unless the client is restarted and it requests again and happens to be given a different broker to try [15:36:14] normally clients are given the list of brokers to try, and it will go down the list until it finds one that responds [15:38:49] <_joe_> ottomata: I'm saying you can do it in your code [15:39:08] <_joe_> and yes, you can take the srv record and build a list of brokers to pass to the client [15:39:44] joe what's the etcd srv record in prod? wanna query it to understand [15:39:59] <_joe_> there are several [15:40:09] whichever :) [15:40:55] <_joe_> _etcd._tcp.eqiad.wmnet [15:41:13] <_joe_> _etcd-server-ssl._tcp.v3.eqiad.wmnet [15:41:25] <_joe_> as you can see it also specifies various metadata about the service [15:41:33] <_joe_> like a weight (if you like) and a port [15:42:15] oh...so the code needs to do a manual dns query...? [15:42:31] and translate the result to the broker list? [15:43:34] (am googling...) [15:46:25] in minikube you already have SRV records https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#srv-records [15:46:51] in production you will not have it for now since we have not deployed any cluster DNS server yet [15:47:22] <_joe_> fsero: he's referring to kafka brokers [15:47:32] <_joe_> those should stay off of kube as long as possible [15:47:39] (agree :) ) [15:47:39] ahhh ok ok [15:47:41] <_joe_> typically until the day I retire [15:47:47] <_joe_> or I leave [15:48:23] apparently "librdkafka uses the standard libc resolver and will attempt to resolve the broker address on each connect attempt but no more often than broker.address.ttl which defaults to 1s." [15:49:12] _joe_: not understanding how kafka clients would use this srv record...would my code need to manually to a SRV query and then parse the result? [15:50:32] <_joe_> yes [15:50:40] <_joe_> it's usually pretty easy [15:51:00] hmmmm [15:51:06] why is that better than a round robin a record? [15:51:53] <_joe_> because it gives you potentially more control on how the clients use the brokers [15:52:03] <_joe_> but if you don't think it applies, a RR record is good too [15:52:14] <_joe_> just go add it to ops/dns :P [15:52:18] the clients only use the address for bootstrap [15:52:39] once they connect, the broker will give them a list of brokers to use for actual communication [15:52:52] and i think since librdkafka is retrying every 1 second or connect attempt [15:52:57] round robin should be fine [15:53:18] if the given broker is down, the next connect attempt shoudl re-resolve the address and get the next one [15:53:24] <_joe_> well it depends [15:53:28] <_joe_> anyways [15:53:38] <_joe_> it should be good enough, yes [15:53:51] ok cool. i think i prefer that than custom code [15:53:59] will make an ops/dns patch... [15:54:04] for review [16:36:59] _joe_: would lvs be better than round robin dns? [16:37:11] <_joe_> ottomata: depends [16:37:27] <_joe_> ottomata: can we properly monitor kafka from pybal? [16:37:32] <_joe_> if so, it /might/ be better [16:37:38] like, monitor if a broker is up? [16:37:40] i think so [16:37:51] if it can connect via tcp 9092 it is up [16:37:54] <_joe_> but I don't think it gives you a significant gain [16:38:01] <_joe_> over retrying [16:38:05] yeah, just faster failover/ttl [16:38:07] yeah [16:38:34] not really that much of a gain, i guess it would be faster for removing/adding brokers, but since it is only used for bootstrapping doesn't really matter [16:57:44] _joe_: something like https://gerrit.wikimedia.org/r/#/c/operations/dns/+/484509/ ? [16:58:27] hm not sure if the ip v6 entires should be in the same round robin A record... [16:58:49] <_joe_> ottomata: yeah they shouldn't I guess [16:59:17] <_joe_> also check with brandon (or RTFM :P) what's the right syntax for round-robins in gdnsd [17:21:50] fsero: still seem to need the promsic on even when using dns names [17:36:17] ottomata: mmm thats weird [17:36:30] ill try to do it on my minikube [17:42:28] 10serviceops, 10Operations, 10Proton, 10Readers-Web-Backlog, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Pchelolo) With request rate as low as this endpoint is expected to have, the Varnish hit rate would probably be very cl... [21:22:06] assuming my question answerers have left for the day? [21:22:06] :) [21:34:52] lol ottomata [21:35:02] hahah [21:39:55] :) [21:47:25] ottomata: they did suggest though earlier or yersteday that you keep an irc client online [21:48:01] so to keep the coversation going :) [21:50:03] heheh i talked with them this morning [21:50:07] never! i'llsend an email... [21:50:11] or wait until tomorrow :p [21:53:46] your resistance to an irc bouncer is noted [22:04:57] 10serviceops, 10Thumbor, 10User-jijiki: First page of a specific PDF files on Commons does not render a preview - https://phabricator.wikimedia.org/T213771 (10jijiki) [22:05:09] 10serviceops, 10Thumbor, 10User-jijiki: First page of a specific PDF files on Commons does not render a preview - https://phabricator.wikimedia.org/T213771 (10jijiki) p:05Triage→03Normal