[06:17:19] 10serviceops, 10TechCom-RFC: RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10kchapman) @greg here in phab [06:42:22] 10serviceops, 10Analytics, 10EventBus, 10Operations, and 2 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10akosiaris) Do we have logs of this happening? [06:44:14] ottomata: yes you can talk directly to the pod. Something like curl http://:/_info [07:33:58] service-checker-swagger kubernetes1001.eqiad.wmnet http://cxserver.svc.eqiad.wmnet:8080All endpoints are healthy [07:34:02] ok that's good finally [08:37:29] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) >>! In T213089#4940510, @elukey wrote: > EDIT: after a chat with upstream it was suggested to me to follow up with Debi... [08:44:04] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) Leaving here also a reference of https://github.com/memcached/memcached/issues/359: > Regression in systemd-based sandb... [08:54:21] 10serviceops, 10CX-cxserver, 10Release Pipeline, 10Patch-For-Review, and 3 others: Migrate cxserver to kubernetes - https://phabricator.wikimedia.org/T213195 (10jijiki) [09:41:17] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10MoritzMuehlenhoff) >>! In T213089#5023411, @elukey wrote: > Leaving here also a reference of https://github.com/memcached/memcac... [13:01:20] 10serviceops, 10CX-cxserver, 10Release Pipeline, 10Patch-For-Review, and 3 others: Migrate cxserver to kubernetes - https://phabricator.wikimedia.org/T213195 (10jijiki) K8s is service currently ~8% of total traffic, we will rump it up to 50% tomorrow, please ping us if there are any issues [13:03:30] 10serviceops, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Doing), and 3 others: Enabling api-request eventgate to group1 caused minor service disruptions - https://phabricator.wikimedia.org/T218255 (10jijiki) [13:38:05] 10serviceops, 10Analytics, 10EventBus, 10Operations, and 2 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10Ottomata) https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.03.13/eventgate?id=AWl4sxguNBo9dX1kfcii&_g=(re... [13:43:07] 10serviceops, 10Operations, 10Core Platform Team Backlog (Later), 10Patch-For-Review, 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Ottomata) [14:05:25] akosiaris: o/ [14:05:43] is there an easy way to cause a full service restart of all pods? [14:05:50] i could do a fake 'upgrade' somehow [14:06:06] i want to see if i can get one of the pods in the bad state we encountered yesterday without any loda [14:06:27] sure [14:06:31] kubectl delete pods --all [14:06:38] ok greta! :) [14:06:46] be prepared for a page :P [14:07:09] shouldn'ti t do it one at a time? [14:07:29] you can also just delete a pod enough times [14:07:30] like it does during an upgrade? [14:07:43] delete pods --all will just delete all pods [14:07:46] it will bring the service down [14:08:00] that's on the docs page :-) well only for mathoid [14:08:02] right, but if i delete single pod, does it spawn a new one before really deleting the old one? [14:08:03] if you want it to happen like a rolling upgrade you need to do an update [14:08:06] but eventually it will be on some central page [14:08:09] yeah want to do thta [14:08:19] ooohhh [14:08:23] * apergos follows along [14:08:25] but if you delete a single pod, it will spawn a new one [14:08:46] myabe if I --set some unused value? [14:08:51] with an upgrade? [14:09:04] no, it needs to have the deployment object changed [14:09:07] aye [14:09:15] let me get this straight [14:09:25] sometimes, a pod is not able to do something [14:09:33] something == networking? reaching out to kafka? [14:09:48] and we are trying to pin down what exactly [14:10:29] i'm not 100% sure yet. as I thought about it this morning, it might be due to local produce load. with enough produce requests waiting for ACK from kafka, the local producer queues might fill up, and might cause kafka produce requessts to timeout. so that could have been it. [14:10:30] but [14:10:41] petr and I also had problems before we turned on group1 wikis [14:10:45] with a single pod [14:10:49] after an upgrade [14:11:02] group0 wikis are very low volume [14:11:08] ~10 msgs / second [14:11:46] i'm not 100% that it was the same problem though. there were a couple of issues. one of them was that service-runner wasn't restarting its worker process propertly if it died during startup. [14:11:56] akosiaris: follow-up to the session storage change https://gerrit.wikimedia.org/r/c/operations/puppet/+/496445 we'll need it to make Icinga happy [14:12:02] petr fixed that in servier-runner, and the deployed version of eventgate uses that fix now. [14:12:29] also. [14:12:36] akosiaris: yesterday when a pod had this problem [14:12:41] it wouldnt' respond on /_info [14:12:46] but, it was still pooled. [14:12:54] so mw kept sending requests to it (and timing out) [14:13:15] sorry, s/this problem/one of these problems/ [14:13:25] just a bit though, kubernetes would have depooled it after a bit [14:13:36] they aren't all 100% differentiated. [14:13:42] it would have? [14:13:49] yup [14:13:56] it would not have killed it, but it would have depooled it [14:14:47] oh interesting, [14:14:51] so readiness will depool if fail [14:14:57] liveness will kill if fail [14:15:00] yup [14:15:02] hm. [14:15:21] think of liveness like what pybal does [14:15:27] ehm [14:15:28] aye [14:15:28] dammit [14:15:30] readiness [14:15:36] readiness is what pybal does if it helps [14:15:45] liveness does not exist in our current infratructure [14:15:54] Interesting. [14:16:18] so. i wonder [14:16:29] if the problem yesterday was overloaded local kafka producers [14:16:35] if that coudl cause a cascading problem. [14:16:48] maybe there's some threshold we reached where there were too many msgs/ sec on a single isntan ce [14:17:00] we were pushing 4K / second yesterday, and we have 20 pod replicas [14:17:31] if a single replica has too much, it would possibly be depooled, which would cause other replicas to get more traffic possibly pushing over some threshold? [14:17:41] just an idea, i'm still not sure if the producer is the problem. [14:17:54] i'm going to see if i can repro duce on an unloaded cluster now [14:18:06] and also, i'm going to see if we can get some monitoring stats out of rdkafka client [14:18:12] i know we can, but i think we have to code it in there. [14:18:14] if the total load was below the capacity you would have seen flapping events [14:18:46] pods receiving traffic, then not receiving traffic as they are depooled and continue to process their queued events, then again receiving traffic [14:18:48] right, where would we have seen that? were can we see what is pooled/depooled? [14:18:52] per pod? [14:19:09] at scap-helm status there is a ready column and a number [14:19:19] for eventgate a pooled pod would be 2/2 [14:19:24] Ah [14:19:26] for a depooled one it would be 1/2 [14:19:26] cool [14:19:29] got it [14:19:39] i don't remmeber seening anything but 2/2 but i wasn't looking for that. [14:19:56] i'm going to downgrade the image and rolling restart that way [14:20:00] 2 being the number of containers in the pod btw (the actual app+the statsd exporter) [14:20:05] that'll also put us back on the buggy service runner [14:20:09] ah [14:20:31] Pchelolo: FYI ^ [14:20:40] reading [14:22:11] mutante: thanks! merged [14:22:39] akosiaris: thanks, i saw. i got the rest on icinga2001 [14:28:05] the ferm service does not start yet, because: [14:28:07] DNS query for 'sessionstore1001-a.eqiad.wmnet' failed: NXDOMAIN [14:28:46] icinga part is happy again though and that's why it notices the unrelated thing now [14:29:05] so fyi Pchelolo I can reproduce the broker transport failure thing, with the old service-runner version the worker is not restarted. [14:29:15] akosiaris: so [14:29:19] in the state this thing is currently in. [14:29:31] akosiaris: does k8s export any metrics about # of pods ready vs # of total pods for each deployment? [14:29:33] pod eventgate-analytics-production-5d866bc9dd-5mk7c with IP 10.64.64.11 [14:29:37] ottomata: ok, is it restarted with the new one? if yes - at least one bug was fixed :) [14:29:37] is not responding to http [14:29:42] mutante: yeah, fixing now in DNS [14:29:48] but, it seems it is not depooled at all [14:30:07] Pchelolo: it is currently using the old service runner with the bug [14:30:17] i downgraded to force a rolling restart to see if i could repro this part of the problem. [14:30:19] cdanis: hm, I 'll have to look it up [14:30:45] * akosiaris looking at eventgate-analytics-production-5d866bc9dd-5mk7c [14:31:03] i also noticed on the k8s pods dashboard in grafana that all container network traffic seems to get charged to the calico-policy-controller containers, which makes some sense, but it's also unfortunate there's no easy way there to do the association [14:31:29] cdanis: that probably is a bug in the dashboard [14:34:47] ya the liveness tcp check i think should also faila [14:34:51] i can't telnet to that port [14:35:16] i can get a shell in that node tho! [14:35:48] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Cmjohnson) Should've checked this but thumbor1004 is out of warranty. [14:35:56] it does looks like it's running [14:36:15] ya the master service-runner process is running. [14:36:21] but, the worker is not ( i think) [14:36:39] no ps on pod so I don't really know how to confirm thta [14:36:52] * akosiaris having a closer look at that pod [14:37:03] ottomata: heh, ls /proc maybe? [14:37:16] heh doh [14:37:18] if there's no ls you can echo /proc/* 🙃 [14:37:40] i believe the worker was pid 140 (according to logs) and that pid is not there [14:37:42] but thta one died [14:39:13] {"name":"eventgate","hostname":"eventgate-analytics-production-5d866bc9dd-5mk7c","pid":1,"level":40,"message":"first worker died during startup, continue startup","worker_pid":140,"exit_code":1,"startup_attempt":1,"levelPath":"warn/service-runner/master","msg":"first worker died during startup, continue startup","time":"2019-03-14T14:22:35.089Z","v":0} [14:39:24] so the master runs but the worker doesn't? [14:39:35] and the master service-runner process does not restart the worker? [14:39:41] is that the bug fixed yesterday? [14:40:41] yes that's what petr fixed yesterday [14:40:49] i dunno why this happened in the first place though [14:40:54] why did the broker transport fail? [14:41:14] it seems like it originally was able to connect to the broker [14:41:31] hmm maybe the connect code doesn't die on failure t here? will check... [14:42:32] so the service-runner with worker startups bug was fixed [14:42:45] get pods eventgate-analytics-production-5d866bc9dd-5mk7c -o yaml says no liveness/readiness probes [14:42:52] hm! [14:43:07] something wrong in the deployment.yaml chart? [14:43:20] that explains why it wasn't killed or even depooled [14:43:24] yeah looking [14:45:09] oh yes [14:45:11] akosiaris: i see it [14:45:17] liveness_probe vs livenessProbe [14:45:26] ah indeed [14:45:28] because I removed it from the values and it got a bad copy paste [14:45:29] fixing. [14:45:32] ok that is good to find! [14:46:49] 10serviceops, 10Operations, 10ops-eqiad: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Cmjohnson) The error didn't appear again (yet) but I created a task with Dell worst case they push back...best they send a DIMM. We're less than 30 days from end of warranty. [14:47:01] hmm, I missed that in review as well [14:47:18] * akosiaris notes to self to not allow people to deviate much from the scaffolding [14:47:26] akosiaris: one more thing I didn't like yesterday is 1 worker per pod [14:47:39] oh Pchelolo right, should I change that? [14:47:43] yeah that's a big discussion we need to jumpstart at some point [14:47:56] ottomata: what would you set it to though? [14:48:02] 0 ? [14:48:07] 2 [14:48:08] that's actually worse [14:48:09] 2!? [14:48:20] there was a situation when a worker would get overloaded, master would kill it and a for the duration of a restart the master would be dangling bombarded with requests [14:48:22] why 2 and not 3 ? [14:48:27] not knowinng what to do with the requests [14:48:41] hm [14:48:41] well, 2 worker would also be overloaded [14:48:42] aye [14:48:45] 3 as well [14:48:53] do I need to keep counting? ;-) [14:49:04] akosiaris: I get your point [14:49:06] but [14:49:09] (probe fix: https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/496454/ ) [14:49:37] with 2 workers at least we decrease the probability of both being dead together [14:50:26] sure, which is why I said it's a big discussion we need to have at some point. Decide on a sensible pod size. We 've had this discussion as well with cdanis recently [14:50:50] there are many factors at play, one of those being the extra reliability given to us by service-runner [14:51:13] there is also the issue of memory usage but there is also the issue of a sensible pod size for local development [14:51:36] I can tell you why I chose 1 and not 0 and I can tell you why I did not choose ncpu for that setting [14:51:52] but 1 vs 2 vs 3 etc? not so easily [14:52:29] ok, will leave at 1 til we figure that ouy :) [14:52:31] out* [14:52:53] yeah we need to run some number on this [14:53:07] anyway +1 ing the change [14:53:10] one thing I do not really know is what the master is doing when it has no workers. I'll do some reading on that [14:53:34] looks like it send back connection refused? [14:53:36] which is weird [14:53:53] we are in this state right now for eventgate-analytics-production-5d866bc9dd-5mk7c [14:54:00] and it is indeed returning connection refused [14:54:38] I should spend some time reading the service-runner code heh [14:54:48] what causes the worker process to die? what does service-runner attempt to do when that happens? [14:54:57] kinda makes sense... The alternative for it would be to buffer connections until a worker shows up [14:55:14] cdanis: I can give you a quick course of the service-runner stuff [14:55:32] fwiw with that approach and a proper liveness probe the pod would have been killed [14:55:48] with the former (piling up connections) it would have been depooled [14:56:10] I am wondering why the master did not restart the worker btw [14:56:21] but I 'll read up the fix in service-runner code [14:56:24] akosiaris: we can't really control what the master would do AFAIK, it's done by the node cluster module [14:56:37] akosiaris: the restarting was a stupid bug [14:56:56] how often is the liveness probe executed? [14:57:50] it's configurable [14:58:00] default is 10 secs [14:58:10] default timing out is 1s [14:58:37] ah and the default threshold is 3 [14:59:06] so if a pod does not answer to the probe in under 1 secs for 3 attempts (each every 10s) it will be killed/depooled [14:59:25] but we can tune that [14:59:56] so if a worker is dead then by default we will have a pod that for 30 seconds just refuses connections? [15:00:13] dead and not coming back for some reason [15:00:22] yup [15:00:58] but we probably don't want to be to aggressive in our tuning because the worker can come back [15:01:01] too* [15:04:09] we can also tune the mem constraints in service-runner [15:04:23] lower them and increase the no of workers to 2 or 3 [15:04:26] * mobrovac prefers 2 [15:04:42] ottomata: /me still looking into the pod, please don't kill it :-) [15:05:06] ottomata: one more thing - you create a single producer per worker right? [15:05:18] Flags [R.], seq 0, ack 2114374707, win 0, length 0 [15:05:26] heh, yeah the pod is sending back a TCP RST [15:05:36] so it does never accept() [15:05:43] interesting that the master does that [15:05:54] Pchelolo: there are actually 2, one guarunteed, one 'hasty' [15:05:59] akosiaris: k [15:10:15] hm.. akosiaris that's indeed weird... the default scheduling policy is round robin. According to node docs: "round-robin approach, where the master process listens on a port, accepts new connections and distributes them across the workers in a round-robin fashion" [15:10:23] so it should at least 'accept' [15:10:45] maybe there's some 'optimization' where it doesn't event accept if it knows it has no workers [15:14:02] funny, stracing it gives me nothing [15:14:08] it's stuck in an epoll_wait [15:14:37] it has a crapton of child threads in futex() [15:15:38] warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable. Connect to gdbserver inside the container. [15:15:43] great.. gdb is not going to be of any help [15:17:08] master does not do the accept() afaik [15:17:15] s/afaik/iirc/ [15:17:19] heh [15:17:27] it does not even listen() anymore [15:17:30] lsof says [15:17:37] nodejs 3481 900 16u IPv4 340464272 0t0 TCP 127.0.0.1:9229 (LISTEN) [15:17:38] ok that's weird [15:17:40] nodejs 3481 900 11u IPv4 340262059 0t0 UDP *:35342 [15:18:01] and the service port is on 8192 [15:18:16] so somehow when the worker died it took the listener with it [15:18:22] or at least that's my current theory [15:18:30] ah [15:18:32] yes [15:18:47] that's the it works [15:19:02] setting up listen() is done in the worker, but delegated to the master [15:19:14] wait what? [15:19:15] so when the worker goes away it's only correct for the master to close the socket [15:19:16] I don't follow [15:19:32] delegated? how? [15:19:32] ok so [15:19:39] the master starts up [15:19:41] no listeners [15:20:00] now, the worker starts up, does listen(), but that is actually not executed in the worker, but in the master [15:20:06] hence the "delegation" part [15:20:34] because the call to listen() originated in the worker, it is only logical that the master would abandon the listener once the (last) worker dies [15:21:14] I don't follow how the worker actually calls listen() but this is not executed by the worker, but rather the master [15:21:26] what am I missing? [15:22:40] anyway I guess I 'll have to take a peek at the node cluster module [15:22:54] ottomata: I think my debugging session is done, feel free to kill the thing [15:26:14] akosiaris: "As long as there are some workers still alive, the server will continue to accept connections. If no workers are alive, existing connections will be dropped and new connections will be refused. " [15:26:33] from https://nodejs.org/docs/latest-v10.x/api/cluster.html#cluster_how_it_works [15:29:25] mobrovac: thanks. that will be useful in me understanding what is going on there [15:31:28] akosiaris: ok cool will do [15:31:37] i'm working on adding kafka queue metrics to eventgate... [15:32:49] so I think we 've addressed 2 different bugs already (service-runner and liveness+readiness probes) The one left is the amount of workers we want, but it's best we file a task and start pouring some numbers in there to figure out what the best value is [15:34:47] you have just volunteered to do that i think :P [15:35:19] I can create the task, but don't expect much input into it for the next couple of weeks at least [15:35:56] thnx [15:44:37] do we have any way of performing eventgate load testing btw? [15:45:28] you can shove events to it? [15:47:11] cdanis: there's a draft idea in https://wikitech.wikimedia.org/wiki/User:Alexandros_Kosiaris/Benchmarking_kubernetes_apps [15:47:22] AFAIK otto.mata used it in the first drafts of the charts [15:47:32] few other people from releng as well [15:49:26] 10serviceops, 10Operations, 10ops-codfw, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10jijiki) We are still having errors, I am depooling. @Papaul ` [Thu Mar 14 11:56:00 2019] perf: interrupt took too long (4960 > 4946), lowering kernel.perf_event_max_samp... [15:52:33] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) We are still having errors ` [Thu Mar 14 14:42:19 2019] mce: [Hardware Error]: Machine check events logged [Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: HAN... [16:07:28] hello [16:07:58] some of CI builds fails due to a segfault in HHVM. Some trace point at libc, so we could use a rebuild of docker-registry.wikimedia.org/wikimedia-stretch [16:08:05] to get the latest libc (which mentions a fix) [16:08:22] but I have no idea where that images is defined or how to propose a change. [16:08:27] see https://phabricator.wikimedia.org/T216384#5024505 [16:25:06] akosiaris: should I find node service metrics from staging cluster in prometheus eqiad /ops ? [16:26:46] eqiad prometheus/k8s-staging [16:27:08] for the kubernetes stuff [16:27:17] and of course the hosts are under eqiad/ops as well [16:27:27] https://grafana.wikimedia.org/d/000000472/kubernetes-staging-kubelets?orgId=1 already exists btw [16:28:30] cool found it! [16:28:39] want to look in promethues for the new kafka metrics i'm emitting [16:38:14] statsd? [16:38:32] if it's statsd its probably already making it to prometheus [16:39:27] akosiaris: by any chance would you be able to rebuild docker-registry.wikimedia.org/wikimedia-stretch image please? I have CI builds failing due to a libc bug :/ [16:39:49] or maybe someone else from serviceops in a better timezone would be able to do so [16:40:01] I could not find where or how the image is build :-\ [16:42:46] 10serviceops, 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), and 4 others: Create puppet role for session storage service - https://phabricator.wikimedia.org/T215883 (10akosiaris) sessionstore hosts setup ` akosiaris@sessionstore1001:~$ nodetoo... [16:44:22] akosiaris: is the eventgate-analytics grafana dash generated automaticlaly? [16:44:29] can I edit it and add charts? [16:44:39] ottomata: yup you can edit it [16:46:26] coo [16:47:22] is there any plan for moving service-runner away from statsd_exporter and onto exporting prometheus-native /metrics btw? [16:52:05] if service-runner supports doing that, it might be possible [16:52:14] metrics.type: prometheus [16:52:47] not that I know of [16:52:50] but it would be nice [16:53:14] it's not supported right now [16:53:31] there are some fairly good packages in node for prometheus [16:53:55] hashar: yeah I can, what is the bug btw? [16:54:24] but we a) didn't have time to do it b) not sure how to do it without the need to rewrite all the apps [16:54:54] yeah makes sense :) [16:57:38] akosiaris: https://phabricator.wikimedia.org/T216384#5024505 "Integrate Stretch 9.8 point update" [16:57:46] I could use a rebuild today [16:57:58] akosiaris: do you know if there is an easy way to export/import charts from another dash and then edit them? [16:58:01] so I can then rebuild the fleet of affected CI containers tonight and callit done before its friday [16:58:14] (also that blocks some teams :() [16:58:17] i want to take the rdkafka related charts from https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1 and put them in eventgate and then edit to select the right metrics [17:00:25] ottomata: you can dump the json of the dashboard IIRC [17:00:31] yaa, hm [17:00:36] but prob not import, but maybe i can edit.... [17:00:53] oh i guess i could dump both, edit, and then import a new dash. [17:01:07] oh! i can edit! cool. [17:01:16] ok i got it, thanks akosiaris will figure it out [17:01:28] yw [17:08:39] hashar: done [17:08:39] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) a:05Cmjohnson→03RobH So this has a memory error and is out of warranty. This means we should look at decommissioning this host and ordering a replacement.... [17:18:41] akosiaris: amazing. thank you [17:18:44] kids & dinner time [17:18:51] will rebuild the ci ones tonight [17:46:56] 10serviceops, 10Scap: Scap2 to use etcd for target servers - https://phabricator.wikimedia.org/T218328 (10jijiki) p:05Triage→03Normal [17:47:35] 10serviceops, 10Scap, 10Release-Engineering-Team (Watching / External), 10User-jijiki: Allow scap sync to deploy gradually - https://phabricator.wikimedia.org/T212147 (10jijiki) [17:47:36] 10serviceops, 10Scap: Scap2 to use etcd for target servers - https://phabricator.wikimedia.org/T218328 (10jijiki) [19:01:01] 10serviceops, 10Release-Engineering-Team: Our docker base images lack tags - https://phabricator.wikimedia.org/T218342 (10hashar) [20:13:09] 10serviceops, 10Operations, 10RESTBase, 10RESTBase-API, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10Pchelolo) a:03holger.knust Verified that we can work with swagger-ui 3+ once we make the spec standard-compliant. Let's begi... [20:20:59] anybody still around that understands statsd prometheus exporter? [20:33:08] ottomata: post up a patch and I 'll have a look tomorrow [20:33:29] ok, i'm tessting in minikube and not having much luck [20:33:45] am able to curl /metrics , but not getting anythign matched. [20:34:13] lol. there should be the non matched metric then [20:34:18] in the logs? [20:34:21] nope [20:34:25] in /metrics [20:34:32] the one derived by s/./_/ [20:34:34] oh yes, i have the fully flattened autogenerated ones [20:34:39] e.g. eventgate_analytics_rdkafka_producer_guaranteed_eventgate_analytics_brokers_192_168_99_100_30092_1_rx [20:34:51] trying to match them to extract better names and labels [20:35:01] 10serviceops, 10Operations, 10Operations-Software-Development, 10User-Joe, and 2 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10crusnov) a:03crusnov [20:35:02] yeah the autogenerated one will disappear when the statsd metric is matched correctly [20:35:21] right, i wish i could get some logs out of it, i increased statsd_exporter logs to debug [20:35:30] but, then i looked at code, and there are barely any log statements in the codebase :p [20:35:32] well "disappear"... it will not increase any more [20:35:54] but it will disappear after a restart of the exporter (or just a pod restart) [20:36:32] ya, i'm deleting and installing new releases each time i change something. [20:40:44] AH I MATCHED ONE! [20:40:47] COOOOOOL [21:55:52] 10serviceops, 10Release-Engineering-Team (Watching / External): Our docker base images lack tags - https://phabricator.wikimedia.org/T218342 (10greg)