[08:08:58] 10serviceops, 10TechCom-RFC: RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10Joe) >>! In T208524#4982558, @akosiaris wrote: > **"Log all requests received via the production logging facilities"** > > Should we make this a bit more generic? e... [08:10:56] 10serviceops, 10TechCom-RFC: RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10Joe) >>! In T208524#4982537, @akosiaris wrote: > I think that the > > **"Collect RED metrics; be able to export those metrics according to WMF standards specified i... [08:12:14] 10serviceops, 10TechCom-RFC: RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10Joe) >>! In T208524#4982033, @dr0ptp4kt wrote: > Would it be possible to clarify the wording on "There is no existing FLOSS software that provides the same functiona... [08:16:38] 10serviceops, 10TechCom-RFC: RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10Joe) [08:27:35] 10serviceops, 10TechCom-RFC: RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10jijiki) > **Production deployment** > > Have backups if the service stores any data May be we could add "and a restoration/emergency plan", in terms that som... [09:14:30] 10serviceops, 10TechCom-RFC: RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10akosiaris) >>! In T208524#4987369, @Joe wrote: >> I am also a bit skeptical about this: >> >> "Be able to perform requests **via TLS** to a specific hostname/ip pro... [09:16:34] _joe_: good morning. There are a few minor fix for docker-pkg which would be worth merging in when you get time ;) [09:16:43] I think I rebased them all over the last few days [09:17:14] <_joe_> hashar: yeah I'm counting on wrapping a release up today [09:17:49] there are a bunch of them that are very minors [09:18:08] your patch for the "update action", I noticed you crafted a change for CI that relied on it so it seems to work [09:18:18] <_joe_> yes, I'm refining that [09:18:18] though I haven't reviewed the code (and the test fail somehow bah) [09:18:34] <_joe_> then merging that and the rest of your changes that are ok, fix the other ones [09:18:37] <_joe_> and make a deploy [09:18:54] fwiw, eventgate-analytics has a dashboard https://grafana.wikimedia.org/d/POYzU8rmz/eventgate-analytics?refresh=1m&orgId=1 [09:18:56] and I have one to speed up the scan process (using a ThreadPool to do queries to the registry) https://gerrit.wikimedia.org/r/#/c/operations/docker-images/docker-pkg/+/484578/ which nicely speed up the builder scan ;) [09:19:12] some graphs are still empty as no requests make it to the service [09:19:37] _joe_: good! And after the release, I guess we can look at bumping the docker python dependency version ), [09:19:52] <_joe_> akosiaris: nice!! [09:20:00] yeah it was really easy [09:20:12] took me like maybe 1 min to create it? [09:20:37] I have to say something is a tad off though [09:20:53] I expected to see some requests for the kubelet checks [09:21:02] <_joe_> right [09:21:03] s/for/because of/ [09:22:39] <_joe_> you can query the exporter in the pods to see if data is there [09:25:47] _joe_: :P [09:26:52] <_joe_> no I mean with curl to see if some error in translation happened [09:27:00] <_joe_> between prometheus and the grafana graphs [09:27:00] yeah I know [09:27:18] <_joe_> but I fear the error might be one level below [09:27:20] <_joe_> or even two [09:27:30] I am pretty sure too [09:27:35] there are not metrics for /?spec [09:27:37] or well / [09:27:42] more correctly [09:27:43] <_joe_> heh [09:35:52] <_joe_> so, now on a traditional server I'd go in, fire up tcpdump and confirm quickly no stats are emitted [09:36:03] <_joe_> how can I do it with a kubernetes pod? [09:36:14] <_joe_> I know you created an image to do that [09:36:16] heh, the trick is to use the exporter pod currently [09:36:37] but you can do that on the actually application image as well [09:36:43] s/image/container/ [09:36:54] ssh into the host, get the pod name from kubectl [09:37:14] docker exec -u root -it /bin/bash [09:37:21] <_joe_> maybe make it easy to do automatically? like having a script that does all for you - attach the debug container to the pod [09:37:23] and then just install tcpdump [09:37:27] <_joe_> oh [09:37:40] I was looking into attaching the debug container into the pod [09:37:44] turns out it's not so easy [09:37:49] <_joe_> sigh :/ [09:38:01] I still think there is a way, but I am still working on it [09:38:10] <_joe_> there is an issue - most of our containers don't run as root [09:38:26] that's why docker exec -u root [09:38:40] <_joe_> d'oh [09:38:44] this btw is untrue for the actually prometheus exporter image [09:38:51] that one is still root (and needs to be fixed) [09:38:53] <_joe_> sorry I'm context-switching too much [09:38:58] <_joe_> yeah, it should [09:39:05] in this specific case kubectl exec is enough [09:39:12] but we should close that loophole [09:39:31] in fact.. lemme do that now [09:47:47] <_joe_> another option is to modify our charts so that optionally, if you add "debug_release = 1" to your values, it will create a special release using the same service, with just one replica, that will have the debug image attached [09:52:03] <_joe_> switching context again [09:52:25] <_joe_> hashar: is there a way to have CI for docker-pkg fail if coverage has reduced? [09:52:43] <_joe_> I mean a standardized way [09:53:10] hmm [09:53:20] kunal did that for phpunit coverage yeah [09:53:36] using a different pipeline which run the coverage twice (once for HEAD once for HEAD^ then compare) [09:53:40] no clue how it works though [09:54:30] Jenkins has some support to fail a build when a test metric goes down, but Jenkins is unable to find the build for the previous/parent commit [10:15:19] <_joe_> uhm this is something that would be worth having a standardized interface [10:16:18] <_joe_> in general, I'd like to be able to define a config file in my repository to enable some kind of tests [10:16:48] <_joe_> but that's a larger discussion and I don't think you have the time to work on it [10:51:05] I just learned about https://github.com/phusion/baseimage-docker [10:51:32] please read the "What's inside part" [10:51:36] all of it [11:31:23] <_joe_> ok the init part is something I wanted to look into [11:31:28] <_joe_> but... logrotate? [11:31:55] <_joe_> ah ok from there on, sigh [11:33:05] <_joe_> I'm not sure about the syslog blackhole too [11:33:57] <_joe_> specifically, I think we're running service-runner as a single process, but we shellout from it [11:34:07] <_joe_> which shouldn't be a problem [11:34:56] <_joe_> given all children will be children of the node process. But if SIGKILL is sent to node while it's shelling out, it could result in a zombie process [11:36:00] I am not that much worried tbh [11:36:23] as long as everything remains stateless, clean shutdown is not really required [11:36:44] and zombie process, with service-runner as a hypervisor aren't really that easy to happen [11:37:04] and btw it is only running 1 worker per pod right now [11:37:29] so I don't foresee a future full of zombie processes anytime soon [11:38:09] but anyway I sent the phusion thing mostly to laught about it [11:38:18] the have cron, ssh, logrotate [11:40:14] <_joe_> yeah from the syslog thing and on it's.. they're building a vm, not a container [11:40:29] it's even funnier that they are refuting it afterwards [11:40:36] <_joe_> which is how people tend to use containers btw [11:40:42] while they have an ssh server [11:40:43] sigh [11:40:58] <_joe_> kubernetes imposes a very specific way to handle the lifetime of your containers [11:41:18] <_joe_> as in - don't expect anything to last or to be able to persist state locally reliably [11:41:45] <_joe_> a damning limitation that's easily solved by nfs volume mounts, of course [11:41:53] <_joe_> :P [12:08:29] Hi, I'm looking at what needs to be done to set up a new SSR service for the wikidata termbox and hence have a lot of newbie questions: Do we get via kubernetes some "free" logging like: number of connections to the service or length of individual connections to the service? [12:19:52] <_joe_> tarrow: oh great question! The TL;DR is "we will", but for more answer you might need to wait for akosiaris to be around (I think he's at lunch) [12:20:18] <_joe_> tarrow: is your service exposing metrics already? [12:20:29] <_joe_> I guess it's based on service-runner, right? [12:23:00] _joe_: good questions right back :). Right now we are not sending metrics out from the service [12:23:51] <_joe_> ok that will need to change before we deploy it, anyways [12:24:01] <_joe_> sadly we still don't have implementation guidelines [12:24:17] <_joe_> (the whole pipeline is pretty new :() [12:24:21] yep; and right now we aren't using service-runner as far as I can tell [12:24:34] although just looking at the docs I think perhaps we should be [12:24:53] * tarrow just started working on this on Monday lots is still pretty unclear :) [12:25:23] <_joe_> tarrow: that's ok, I'm happy we're talking before you're ready to deploy tomorrow :P [12:25:36] hehe [12:27:55] <_joe_> it's seriously appreciated. But circling back, a typical dashboard for a service running on kubernetes right now might look like this: https://grafana.wikimedia.org/d/000000187/mathoid?refresh=1m&orgId=1 [12:28:16] <_joe_> if we can maintain some consistency in what and how metrics are reported, it would be great [12:28:34] tarrow: definitely look at service-runner, It does provide quite some stuff out of the box so you don't have to go around and implement metrics or logging yourself. in the kubernetes infra we just make sure to configure it correctly and get the metrics (and we 'll soon have logging directly into logstash as well) [12:28:36] <_joe_> but there's more to say than can be discussed on irc I guess. [12:29:02] the template is here https://github.com/wikimedia/service-template-node [12:29:09] <_joe_> so yes, that was why I was suggesting using service-node [12:29:37] Cool, so most of those metrics on the dashboard come from service runner out of the box? [12:29:46] almost all [12:30:00] the only ones not from service-runnner are CPU and network [12:30:44] <_joe_> yeah tarrow was asking if service-runner will automatically send out those metrics or not [12:30:55] <_joe_> and AIUI most of them, yes [12:31:14] <_joe_> but Alex spent the last week in grafana land [12:31:19] <_joe_> :P [12:31:26] <_joe_> he knows more and better [12:31:46] great, I'll be sure to take a good look :) [12:34:31] <_joe_> you can ask to the fine people in #-services too [12:34:37] <_joe_> but they're all asleep right now [12:37:54] Silly question 2: Is there somewhere I can read about traffic from php -> services? How does that work with the whole NodePort thing? I'm particularly wondering if we need to think about TLS certificates etc [12:40:05] <_joe_> you mean making requests from your service to php, or vice-versa? [12:40:38] <_joe_> the vice-versa will work like with any other service described in mediawiki-config/wmf-config/ProductionService.php [12:40:52] from php to our service [12:41:07] <_joe_> so we'll pick a port for production, create an LVS ip/service name [12:41:34] <_joe_> and add it to mediawiki's config [12:41:48] <_joe_> no, tls will be handled by the infrastructure [12:41:58] <_joe_> you won't need to manage it [12:42:12] brilliant! [12:42:13] <_joe_> it will be managed as in in the future :D [12:42:24] :D [12:42:32] <_joe_> for now, mw -> service works unencrypted [12:43:26] <_joe_> I'm going afk for now, want to be back when the US west coast wakes up, but please ask [12:43:32] <_joe_> we will answer eventually [12:43:50] Thanks so much! the fog of confusion is already lifting :) [13:52:30] _joe_: [for when you're back]: WRT to a "liveness test" endpoint for Kask, did you have more something specific in mind, something you'd want us to pattern after? Somewhere in the various discussions I heard mention of `/healthz` (ala Google convention), and of service-checker (https://github.com/wikimedia/operations-software-service-checker). I understand the spirit such an endpoint, just trying to establish if there is something [13:52:30] more concrete here. [14:21:37] <_joe_> urandom: I wouldn't use a service-checker endpoint for a liveness test [14:22:03] <_joe_> liveness means "the application is up" [14:22:48] <_joe_> "Many applications running for long periods of time eventually transition to broken states, and cannot recover except by being restarted. Kubernetes provides liveness probes to detect and remedy such situations. [14:22:50] <_joe_> " [14:23:42] <_joe_> so it should be a url served by the application, and that not depends on external factors [14:24:12] <_joe_> so if the worker cannot accept new requests, it will correctly mark the pod as dead [14:24:30] <_joe_> while it won't kill a pod just because cassandra is slow [14:25:25] <_joe_> so, ideally a /healthz url where you expose some stats about the state of the application, returning 200 OK [14:25:35] <_joe_> without needing to connect to cassandra [14:25:58] <_joe_> as for the format of the data at that endpoint, *heh*, good question [14:26:16] <_joe_> akosiaris: any ideas about ^^ ? [14:27:23] Does the Prometheus exporter not satisfy this? [14:27:47] liveness is going to be just a tcp socket probe [14:28:13] readiness (the pod is able to serve traffic) probe should be an HTTP endpoint that says "hey I can receive traffic" [14:28:42] don't confuse the 2 notions (and don't reuse the same probe for both) or else you are in the for the suprises I met [14:30:00] right now we use /_info as a readiness probe and just connecting to the tcp socket as a liveness for service-runner apps [14:30:47] akosiaris: and was does /_ output? [14:30:48] but I would like us to move to something more explicit like /healthz that returns 200 OK for the readiness probe [14:31:10] Just 200 OK? Content-length: 0? [14:31:39] deploy1001:~$ curl http://mathoid.svc.eqiad.wmnet:10042/_info [14:31:39] {"name":"mathoid","version":"0.7.1","description":"Render TeX to SVG and MathML using MathJax. Based on svgtex.","home":"https://github.com/wikimedia/mathoid"} [14:32:30] <_joe_> akosiaris: I was sure it worked like you described, but the kubernetes docs confused me [14:32:47] urandom: it's configurable btw [14:32:49] see https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-liveness-http-request [14:33:06] so you can return 200 OK with some custom header we agree on [14:33:11] <_joe_> akosiaris: should /healtz's response contain any useful data? [14:34:04] I 'd say yes, at least for our own sanity. If the endpoint is able to answer the question "can I receive traffic?" it can probably also tell why it can't [14:34:17] <_joe_> I think it should, possibly basic things like uptime / n requests served / items in queue? [14:34:17] something like "Can't establish connection to cassandra" [14:34:35] <_joe_> uhm, not sure that should be part of a readiness probe tbh [14:34:45] <_joe_> else you'll end up with no pod able to connect [14:35:01] <_joe_> and all being killed [14:35:47] killed? [14:35:55] what does killing have to do with the readiness probe? [14:36:10] <_joe_> sorry, you're right [14:36:16] <_joe_> they will be all marked as not ready [14:36:19] express btw has https://www.npmjs.com/package/express-healthz-endpoint [14:36:21] <_joe_> what happens then? [14:36:25] which from what I see is really simple [14:36:28] just 200 OK [14:37:07] _joe_: no traffic reaches those pods [14:37:38] if all pods have a problem, all will be depooled AFAIK and you will end up with no endpoints for the service [14:37:39] <_joe_> ok, if cassandra is down, then all pods are not ready. What happens to whoever calls the service? [14:37:52] connection refused [14:38:00] akosiaris: the source repo for that is missing... [14:38:14] urandom: ah so it's not just me. OK I was trying to find it [14:38:20] damn... [14:38:45] the node world is chaotic place [14:38:51] _joe_: which makes sense if all pods have that error. Cassandra is down, no point in accepting requests you can't serve [14:39:06] <_joe_> akosiaris: you can respond according to your api [14:39:14] <_joe_> instead than "connection refused" [14:39:24] <_joe_> seriously, it's a bad idea :) [14:39:58] <_joe_> the only reason a pod should be not ready is: 1 - the application is not fully initialized 2 - the server is not able to handle a request right now [14:40:12] in this case it's 2 [14:40:16] and it's for all the pods [14:40:16] <_joe_> anything that depends on other factors should be managed by the application [14:40:18] <_joe_> nope [14:40:33] assume btw all endpoints require a cassandra connection [14:40:33] <_joe_> the application can respond 'service unavailable' to its clients [14:40:44] <_joe_> which the clients will know how to handle [14:40:59] <_joe_> or whatever else is its contract [14:41:40] <_joe_> truncating access to a service because some of its backends is unavailable is wrong, imho [14:41:50] some? [14:41:53] I said all [14:42:00] <_joe_> even in that case [14:42:19] <_joe_> why can't the service respond to its clients "I can't connect to the db" ? [14:42:28] I guess it depends on whether or not that contract is "connection refused" :) [14:42:29] in that case it's effectively a diff between spewing 500s and connection refused [14:42:53] <_joe_> akosiaris: yes, and spewing 500s is something an rpc system can manage [14:42:54] but it's not necessarily a good /healthz endpoint btw [14:43:10] connection refused as well [14:43:18] :) [14:43:36] <_joe_> urandom: we can start with an empty /healthz response for now, and elaborate on that I guess [14:43:43] anyway we are arguing semantics here over a pathological case where we should have already been alerted by our monitoring [14:44:57] OK, so to summarize, we need a `/healthz` endpoint (and I guess we're saying we want it to be called `/healthz`) that indicates readiness, which in this case *should* (for example) include connectivity to Cassandra [14:45:09] <_joe_> urandom: nope :P [14:45:13] lol [14:45:13] No. [14:45:15] OK [14:45:21] <_joe_> not the connectivity part [14:45:24] <_joe_> at least for now [14:45:30] * volans grabbing popcorns [14:45:47] so it indicates...that the server is capable of marshaling an HTTP response? [14:46:31] at least for starters [14:46:34] <_joe_> yes, which might not be the case if e.g. you cannot spawn new goroutines because you've reached the concurrency limit [14:46:58] <_joe_> which can be caused by excessive gc, or cassandra being slow, etc [14:47:21] or just too many requests [14:47:27] <_joe_> or that, yes [14:48:26] <_joe_> so the basic job of what the probe needs to do - avoid sending new requests to pods that are over capacity - will be fullfilled [14:48:40] OK [14:48:45] that is easy [14:49:07] but then I guess we still need something to fulfill what https://github.com/wikimedia/operations-software-service-checker does [14:49:14] something to use w/ icinga [14:50:21] <_joe_> urandom: well it depends on what you need, kask interface is so simple that a simple fetch of a test key should be enough? [14:50:55] _joe_: yeah, clarakosi and I were talking about this yesterday... it means "reserving" a key, I guess [14:51:34] <_joe_> if so, you can either create a super-small swagger spec just dedicated to this, or you choose the format of your liking and implement it in service-checker :D [14:52:06] a super-small swagger spec to house the `x-amples` [14:52:13] [14:52:29] <_joe_> urandom: as I said, I'm open to other formats [14:52:57] <_joe_> if they come with an implementation for service-checker [14:53:01] <_joe_> ;) [14:54:38] ya. [15:31:47] _joe_, akosiaris: fwiw: https://gerrit.wikimedia.org/r/#/c/mediawiki/services/kask/+/493249/ [15:32:04] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes, and 2 others: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10fgiunchedi) >>! In T214289#4984959, @fsero wrote: > It seems there are some issues on the swift side regarding container-real... [15:32:24] and we'll come up with some basic alternative service test discovery for service-checker [15:32:29] clarakosi: ^^ :) [15:34:07] 👍 [15:41:01] _joe_: so the security-fenced k8s deploy idea, was that tentative, or is it on like donkey kong? [15:41:46] * urandom is wondering whether he should circle-in other in CPT [15:49:23] <_joe_> urandom: I'm sorry I should've written down the ideas that came out of the service operations meeting [15:49:40] <_joe_> I'll try to get there [15:58:26] #wikimedia-tech [15:58:38] sorry, I am a klutz [16:34:02] yoo hooo [16:34:08] akosiaris: fsero how goes? [16:37:19] ottomata: I think fsero is not going to be around for the next few (or more) days [16:37:38] ottomata: but we are working on it with jijiki [16:37:47] keeping detailed notes as we go [16:39:21] k danke [16:40:36] <_joe_> what are you working on? [16:41:16] a better procedure for https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service [16:41:45] <_joe_> heh [16:42:07] <_joe_> there is a lot of historical duplication to reduce there too [16:44:12] akosiaris, jijiki: it's yours the disabled puppet with reason 'enabling LVS for eventgate-analytics'? [16:44:21] yes [16:44:24] please don't enable [16:45:00] didn't see any log and I was disabling/enabling to test an icinga change, I got this on icinga1001 (passive) [16:48:14] <_joe_> lol [16:48:24] <_joe_> remember to check which icinga server is active :D [16:48:43] what do you mean? [16:48:50] that's for us [16:48:56] ahhh :D [16:49:23] use the "official" script for known hosts and you can use icinga.wikimedia.org ;) [16:54:43] also please don't re-enable puppet on icinga2001, as it got your message now and not mine anymore [17:13:42] volans: we are done with icinga [17:14:16] jijiki: ack, thanks, actually should be fine to re-enable, I'm currently on a meeting, so if icinga needs update soon feel free to reenable [17:14:28] cool, thanks! [17:14:29] ok we enabling tx [17:14:43] from my tests my patch should work fine [17:14:49] worst case scenario revert [17:26:23] ottomata: done. eventgate-analytics.discovery.wmnet works fine [17:26:50] we have some preliminary notes already. patch was pretty good to start with so thanks! [17:54:14] OH NOOOO [17:54:15] meeting [17:54:27] oops wrong chat [18:44:46] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Test different growth factors for memcached (prep step for upgrade to newer versions) - https://phabricator.wikimedia.org/T217020 (10elukey) I had a very interesting chat with upstream on IRC (thanks dormando!) and it w... [19:10:15] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Test different growth factors for memcached (prep step for upgrade to newer versions) - https://phabricator.wikimedia.org/T217020 (10Joe) FWIW I would like to decrease the max key size instead than increasing it. [20:31:09] 10serviceops, 10Release-Engineering-Team, 10Scap: Deploy scap 3.9.1-1 - https://phabricator.wikimedia.org/T217287 (10thcipriani) [20:31:47] 10serviceops, 10Scap, 10Release-Engineering-Team (Watching / External): Deploy scap 3.9.1-1 - https://phabricator.wikimedia.org/T217287 (10thcipriani) [23:21:14] [01:52:25] <_joe_> hashar: is there a way to have CI for docker-pkg fail if coverage has reduced? <-- probably yes. Just haven't looked into python stuff yet