[08:58:31] We're going to do our second round of termbox load testing now; give us a shout if that's an issue or you want us to stop :) [09:35:40] <_joe_> tarrow: ack! [09:36:43] seems to be going fine right now. We've discovered a couple of very big entities are timing out but we'd probably expect that [09:37:29] <_joe_> do you need to change memory/cpu limits? as in, do you see resource starvation from grafana dashboards when you try to render those? [09:38:26] <_joe_> (also probably worth adding such a large entity to any test suite you have, in order to be able to measure if you can finish the processing part in time [09:48:24] I believe the processing is fine; it's getting it from mediawiki that times out [09:49:49] what amount of cpu saturation would be "too much"? [09:58:53] <_joe_> 100% [09:58:55] <_joe_> :P [09:59:14] <_joe_> but yeah if it's fetching from mediawiki that times out [09:59:26] <_joe_> that's, uhm, sad :P [09:59:38] <_joe_> have you tried the same request to the API using curl? [09:59:45] <_joe_> I can do it for you in case [10:13:39] <_joe_> urandom: I realized that for the latency bimodal distribution issue, we could try to record a perf map of kask running to see if that gives us any further information [11:09:09] _joe_: yeah, we looked at the requests that it was timing out for and they were particularly large entities. we had 4 cases in some ten thousand requests, so it's probably not a huge issue [11:09:56] but we might look into ways to optimize that e.g. requesting only the data that we really need as opposed to always requesting the whole entity [14:40:41] _joe_: yup; we need something that'll show hotspots from among the syscalls [14:40:55] _joe_: I'm just not sure how to approach that in k8s [14:41:00] or in a container in general [14:41:16] <_joe_> urandom: a container is just a process running in a few namespaces [14:41:18] <_joe_> more or les [14:41:20] <_joe_> s [14:41:31] <_joe_> so I can just strace -c the process from the host [14:41:39] <_joe_> or run perf record [14:42:04] _joe_: sure, maybe what I should have said was: I'm just not sure how *I* would approach that for a service running in k8s [14:42:06] :) [14:42:26] <_joe_> urandom: we can coordinate tests I guess, but not before next week [14:42:50] <_joe_> urandom: I suspect there is something going on with networking [14:42:54] yeah [14:43:01] <_joe_> but that's just a wild wild guess [14:43:03] I suspect that as well [14:43:45] something is blocking, and given the timing, networking seems the likeliest culprit [14:44:05] IOW, I don't think it's that wild of a guess :) [15:30:18] 10serviceops, 10Operations, 10WMF-Legal: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10BBlack) 05Open→03Stalled Just stalling this so that anyone following it doesn't try to pick this up or move with it yet. There's an ongoing ema... [15:36:48] _joe_: how close are we to being able to run one-off jobs we want scraped by Prometheus on Kubernetes? [15:37:43] <_joe_> cdanis: you mean running scheduled tasks on k8s? [15:37:57] this would be long-running [15:38:32] <_joe_> cdanis: what you want to do? [15:38:34] <_joe_> :P [15:38:35] there's no obviously-correct host on which to run a "conftool drain state exporter", so I half-jokingly suggested running such a thing on k8s [15:38:58] <_joe_> oh I see [15:39:24] (but if it is going to involve anything too involved, e.g. setting up a new LVS service I'd rather not) [15:39:24] <_joe_> well it can be run on k8s for sure, I'm not sure how well it adapts to our pipeline [15:39:46] <_joe_> no you just need, basically [15:39:48] <_joe_> a docker image [15:39:52] <_joe_> and a deployment chart [15:40:02] <_joe_> which I guess is going to be pretty simple right [15:40:05] I'm tempted to do so even if just for my own edification [15:40:09] it should be something like 30 lines of Python [15:40:17] <_joe_> yeah [15:40:21] <_joe_> otoh [15:40:24] <_joe_> I was wondering [15:40:55] is python even allowed in k8s? :-P [15:40:56] <_joe_> it would make sense to run this on all the etcd nodes, and progressively add new APIs to this :P [15:41:03] <_joe_> volans: sure, why not? [15:41:05] I thought you had to write all in Go by polixy :D [15:41:09] *policy [15:41:28] <_joe_> I mean we run our services that are mostly nodejs right [15:41:29] _joe_: or just the cluster-management nodes, one per main DC is totally fine [15:42:00] the thing is that for this metrics, that are centralized, ideally we'd need only one target for prometheus and that targed be HA [15:42:11] so a k8s endpoint makes actually sense [15:42:13] in this case [15:42:15] ehhh having multiple is fine [15:42:43] <_joe_> volans: I'm thinking of my old plan to supplant etcd with a conftool api :P [15:42:46] cdanis: what if they differ? for a maintenance on the passive etcd cluster? [15:43:18] volans: that's in general a hard problem, sure [15:44:28] I guess that's an argument for putting it on the etcd hosts actually volans, easier to coordinate maintenances then ;) [15:44:40] but harder to use it no? [15:45:23] my view is basically that the etcd data is offered from an API that is HA and consistent (more or less :-) ), and as such there should be only one set of metrics in prometheus derived from its data [15:45:33] coming always from the active data [15:45:54] without the need to add additional logic when querying this data based on pooled/depooled state and what not [15:45:58] this is funny because I'm used to a world where such jobs can take minutes to reschedule, so having only one replica on k8s isn't "HA" ;) [15:46:21] why must be only one? [15:46:36] sorry I thought you said only one [15:46:38] the endpoint for prometheus must be 1, if behind that there are N instances [15:46:41] who cares [15:46:43] they are clients [15:47:00] <_joe_> then you need lvs basically [15:47:02] if we do that in the VM/physical world I would do that with an LB in front [15:47:10] ehhh [15:47:17] <_joe_> if you want the outside world to talk to it [15:47:20] I think we're overengineering this now [15:47:30] it's a service metric, not a host metric ;) [15:47:35] <_joe_> I thought you planned to use the prometheus scraping of k8s to get it [15:48:04] that's what I was thinking _joe_, and possibly the service itself can detect if it is talking to a stale etcd (or talk to all of them) [15:49:55] <_joe_> anyways, I'm trying to figure out why I can't remove a specific dir on deploy1001 [15:50:12] <_joe_> I can't deal with your high level fancy designs now [15:50:21] sudo [15:50:22] lol [15:50:33] looks like Urbanecm is having some deploy1001 perms troubles as well [15:50:36] judging by #-operations [16:04:10] 10serviceops, 10Scap, 10Release-Engineering-Team-TODO (201908): `scap clean --delete 1.34.0-wmf.13` fails with `Permission denied` - https://phabricator.wikimedia.org/T230802 (10Jdforrester-WMF) 05Open→03Resolved a:03Joe Fixed by @Joe. [16:15:24] 10serviceops, 10Scap, 10Release-Engineering-Team-TODO (201908): `scap clean --delete 1.34.0-wmf.13` fails with `Permission denied` - https://phabricator.wikimedia.org/T230802 (10zeljkofilipin) Thanks!