[07:50:13] 10serviceops, 10Operations, 10Core Platform Team (Needs Cleaning - Services Operations): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MoritzMuehlenhoff) [09:17:40] <_joe_> tarrow: sorry I was off yesterday (bank holiday across southern europe) [09:17:46] <_joe_> yes it would be ok for sure [09:18:09] <_joe_> urandom: did you find out what the problem was? [12:15:34] Which problem? The chart? [13:00:23] 10serviceops, 10MediaWiki-Maintenance-scripts, 10Operations: Stop forcing RUNNER=php for foreachwiki/foreachwikiindblist - https://phabricator.wikimedia.org/T230110 (10CDanis) [13:00:31] 10serviceops, 10MediaWiki-Maintenance-scripts, 10Operations: Stop forcing RUNNER=php for foreachwiki/foreachwikiindblist - https://phabricator.wikimedia.org/T230110 (10CDanis) p:05Triage→03Normal [13:18:43] _joe_: you mean the chart? [13:19:28] _joe_: actually, it doesn't really matter what problem you mean, I have many and haven't solved anyone of them, so, No. :) [13:19:37] s/anyone/any one/ [13:27:02] <_joe_> urandom: yeah the chart not updating [13:27:34] <_joe_> (I'm also alone both today and next week, so my support capabilities are quite limited [13:28:02] _joe_: no joy; I ended up creating a hacked version of the image that ignored TLS configuration, and updated the chart to use HTTP for liveness [13:28:48] <_joe_> that's pretty strange, and I'd like to try to debug this later [13:29:19] _joe_: the larger problem (for me), is the aberrant latency running in k8s [13:29:41] <_joe_> what is that? [13:30:03] I tried running some tests outside of a container in the restbase-dev environment, and couldn't replicate there either (same Cassandra cluster staging is using too, fwiw) [13:30:30] <_joe_> oh you mean the occasional slow request you were talking about at our last meeting? [13:30:44] _joe_: yeah, about 1 in2 [13:30:53] _joe_: yup, same issue [13:30:58] <_joe_> 1 out of 2 requests? [13:31:02] half, yes [13:31:08] <_joe_> uhm [13:31:19] half is ~5ms, half is ~50ms [13:31:24] <_joe_> you have it both in prod and in staging? [13:31:27] <_joe_> or just in staging? [13:31:28] yes [13:31:34] prod and staging [13:31:56] _joe_: https://grafana.wikimedia.org/d/000001590/sessionstore?panelId=47&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=sessionstore&from=1565958710359&to=1565962310360 [13:32:52] _joe_: it's like something blocks for ~40ms every other request or something [13:33:23] I've collected some profile information too (https://phabricator.wikimedia.org/T229697), but nothing stands out [13:34:33] <_joe_> I think you need to know which requests are slow [13:34:36] <_joe_> and whihc are not [13:34:46] <_joe_> also which pod caused it [13:35:17] <_joe_> you can repro this in staging as well, where there is no cross-dc replica or anything justifying any latency [13:35:22] <_joe_> this is *read* latency? [13:43:12] _joe_: sorry, was finishing up in a meeting [13:43:26] _joe_: it's not a particular kind of request [13:44:06] I should maybe post some other profile and test information to establish that for posterity sake, but it's reads, writes, mixed, it doesn't seem to matter [13:44:41] obviously the latency is a little different reads v writes, but the wonky distribution is still there [13:45:21] _joe_: and yes, it's staging and production, and cross-dc replication doesn't seem to factor in [13:45:30] _joe_: in fact, the latency isn't coming from Cassandra [13:46:06] <_joe_> urandom: it would be interesting to scale the deployment down to 1 pod [13:46:08] in the sense that there is no latency with Cassandra that corresponds to this anyway [13:46:13] <_joe_> and see if it happens with a single pod [13:46:21] isn't staging already one pod? [13:46:28] <_joe_> not necessarily [13:46:54] hrmm... wouldn't I see all of the pods in a `status`? [13:47:07] (there is only ever one that I can see) [13:47:29] <_joe_> yeah it's just one [13:47:39] <_joe_> so the same pod [13:47:43] <_joe_> this is insane [13:47:44] <_joe_> :P [13:47:57] <_joe_> I have one further question [13:48:04] sorry, everything seems insane these days...which part is insane? [13:48:06] <_joe_> how do you measure that latency? [13:48:17] I'm using `wrk` [13:48:30] <_joe_> how do you connect to sessionstore? [13:48:34] I also have a Lua script and a flat file of JSON requests [13:48:44] <_joe_> I mean what IP:port you use [13:48:50] port forward with kubectl [13:48:57] <_joe_> uhm [13:49:29] <_joe_> can you try to connect to stagingip:port directly? [13:49:53] OK, so this circles us back to our meeting when I asked about that [13:50:20] there is an instance on kubernetes1001.eqiad.wmnet:8081 (staging uses 8081, production 8080) [13:50:21] <_joe_> I'm trying to root out possible perturbations [13:50:33] <_joe_> that's a prod node though [13:50:38] but...when I started deploying new images, that one did not update [13:50:41] OK [13:50:51] I have no idea where to find it then [13:51:32] <_joe_> kubestage1001.eqiad.wmnet [13:51:53] _joe_: most everything I know here has been handed down to me as stories around a campfire [13:52:20] <_joe_> well there is a lot of data on wikitech nowadays [13:52:31] <_joe_> anyhow, I have a meeting in 5 minutes, I'll be around later [13:53:17] _joe_: links to that data would be helpful [13:53:49] _joe_: and `curl -D - -X GET http://kubestage1001.eqiad.wmnet:8081/healthz` does not work [13:56:32] <_joe_> try 10.64.0.247 (the ipv4 IP) [13:56:59] <_joe_> oh wait [13:57:06] <_joe_> it's only exposing on port 8080 [13:57:17] !? [13:57:51] what is the difference between Port, TargetPort, and NodePort? [13:57:58] just looking at KUBECONFIG=/etc/kubernetes/sessionstore-staging.config kubectl describe service kask-staging [13:58:27] <_joe_> there is a diffference but I see nodePort is set to 8081 [13:58:35] <_joe_> which means it should listen on port 8081 [13:58:40] <_joe_> sorry I really gtg [15:40:56] <_joe_> urandom: I'll try to take a better look on monday [15:45:26] _joe_: cool, thanks [15:58:10] <_joe_> sorry, I'm alone, have less familiarity with what changed in the last 6 months in our kube cluster, and have a ton of interviews and meetings. So my support is really limited :( [17:00:02] _joe_: I know, I understand [17:01:05] _joe_: I appreciate the help you are able to give :)