[08:58:31] <tarrow>	 We're going to do our second round of termbox load testing now; give us a shout if that's an issue or you want us to stop :)
[09:35:40] <_joe_>	 tarrow: ack!
[09:36:43] <tarrow>	 seems to be going fine right now. We've discovered a couple of very big entities are timing out but we'd probably expect that
[09:37:29] <_joe_>	 do you need to change memory/cpu limits? as in, do you see resource starvation from grafana dashboards when you try to render those?
[09:38:26] <_joe_>	 (also probably worth adding such a large entity to any test suite you have, in order to be able to measure if you can finish the processing part in time 
[09:48:24] <tarrow>	 I believe the processing is fine; it's getting it from mediawiki that times out
[09:49:49] <tarrow>	 what amount of cpu saturation would be "too much"?
[09:58:53] <_joe_>	 100% 
[09:58:55] <_joe_>	 :P
[09:59:14] <_joe_>	 but yeah if it's fetching from mediawiki that times out
[09:59:26] <_joe_>	 that's, uhm, sad :P
[09:59:38] <_joe_>	 have you tried the same request to the API using curl?
[09:59:45] <_joe_>	 I can do it for you in case
[10:13:39] <_joe_>	 urandom: I realized that for the latency bimodal distribution issue, we could try to record a perf map of kask running to see if that gives us any further information
[11:09:09] <jakob_WMDE>	 _joe_: yeah, we looked at the requests that it was timing out for and they were particularly large entities. we had 4 cases in some ten thousand requests, so it's probably not a huge issue
[11:09:56] <jakob_WMDE>	 but we might look into ways to optimize that e.g. requesting only the data that we really need as opposed to always requesting the whole entity
[14:40:41] <urandom>	 _joe_: yup; we need something that'll show hotspots from among the syscalls
[14:40:55] <urandom>	 _joe_: I'm just not sure how to approach that in k8s
[14:41:00] <urandom>	 or in a container in general
[14:41:16] <_joe_>	 urandom: a container is just a process running in a few namespaces
[14:41:18] <_joe_>	 more or les
[14:41:20] <_joe_>	 s
[14:41:31] <_joe_>	 so I can just strace -c the process from the host
[14:41:39] <_joe_>	 or run perf record
[14:42:04] <urandom>	 _joe_: sure, maybe what I should have said was: I'm just not sure how *I* would approach that for a service running in k8s
[14:42:06] <urandom>	 :)
[14:42:26] <_joe_>	 urandom: we can coordinate tests I guess, but not before next week
[14:42:50] <_joe_>	 urandom: I suspect there is something going on with networking
[14:42:54] <urandom>	 yeah
[14:43:01] <_joe_>	 but that's just a wild wild guess
[14:43:03] <urandom>	 I suspect that as well
[14:43:45] <urandom>	 something is blocking, and given the timing, networking seems the likeliest culprit
[14:44:05] <urandom>	 IOW, I don't think it's that wild of a guess :)
[15:30:18] <wikibugs>	 10serviceops, 10Operations, 10WMF-Legal: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10BBlack) 05Open→03Stalled Just stalling this so that anyone following it doesn't try to pick this up or move with it yet.  There's an ongoing ema...
[15:36:48] <cdanis>	 _joe_: how close are we to being able to run one-off jobs we want scraped by Prometheus on Kubernetes?
[15:37:43] <_joe_>	 cdanis: you mean running scheduled tasks on k8s?
[15:37:57] <cdanis>	 this would be long-running
[15:38:32] <_joe_>	 cdanis: what you want to do?
[15:38:34] <_joe_>	 :P
[15:38:35] <cdanis>	 there's no obviously-correct host on which to run a "conftool drain state exporter", so I half-jokingly suggested running such a thing on k8s
[15:38:58] <_joe_>	 oh I see
[15:39:24] <cdanis>	 (but if it is going to involve anything too involved, e.g. setting up a new LVS service I'd rather not)
[15:39:24] <_joe_>	 well it can be run on k8s for sure, I'm not sure how well it adapts to our pipeline
[15:39:46] <_joe_>	 no you just need, basically
[15:39:48] <_joe_>	 a docker image
[15:39:52] <_joe_>	 and a deployment chart
[15:40:02] <_joe_>	 which I guess is going to be pretty simple right
[15:40:05] <cdanis>	 I'm tempted to do so even if just for my own edification 
[15:40:09] <cdanis>	 it should be something like 30 lines of Python
[15:40:17] <_joe_>	 yeah
[15:40:21] <_joe_>	 otoh
[15:40:24] <_joe_>	 I was wondering
[15:40:55] <volans>	 is python even allowed in k8s? :-P
[15:40:56] <_joe_>	 it would make sense to run this on all the etcd nodes, and progressively add new APIs to this :P
[15:41:03] <_joe_>	 volans: sure, why not?
[15:41:05] <volans>	 I thought you had to write all in Go by polixy :D
[15:41:09] <volans>	 *policy
[15:41:28] <_joe_>	 I mean we run our services that are mostly nodejs right
[15:41:29] <cdanis>	 _joe_: or just the cluster-management nodes, one per main DC is totally fine
[15:42:00] <volans>	 the thing is that for this metrics, that are centralized, ideally we'd need only one target for prometheus and that targed be HA
[15:42:11] <volans>	 so a k8s endpoint makes actually sense
[15:42:13] <volans>	 in this case
[15:42:15] <cdanis>	 ehhh having multiple is fine
[15:42:43] <_joe_>	 volans: I'm thinking of my old plan to supplant etcd with a conftool api :P
[15:42:46] <volans>	 cdanis: what if they differ? for a maintenance on the passive etcd cluster?
[15:43:18] <cdanis>	 volans: that's in general a hard problem, sure
[15:44:28] <cdanis>	 I guess that's an argument for putting it on the etcd hosts actually volans, easier to coordinate maintenances then ;)
[15:44:40] <volans>	 but harder to use it no?
[15:45:23] <volans>	 my view is basically that the etcd data is offered from an API that is HA and consistent (more or less :-) ), and as such there should be only one set of metrics in prometheus derived from its data
[15:45:33] <volans>	 coming always from the active data
[15:45:54] <volans>	 without the need to add additional logic when querying this data based on pooled/depooled state and what not
[15:45:58] <cdanis>	 this is funny because I'm used to a world where such jobs can take minutes to reschedule, so having only one replica on k8s isn't "HA" ;)
[15:46:21] <volans>	 why must be only one?
[15:46:36] <cdanis>	 sorry I thought you said only one
[15:46:38] <volans>	 the endpoint for prometheus must be 1, if behind that there are N instances
[15:46:41] <volans>	 who cares
[15:46:43] <volans>	 they are clients
[15:47:00] <_joe_>	 then you need lvs basically
[15:47:02] <volans>	 if we do that in the VM/physical world I would do that with an LB in front
[15:47:10] <cdanis>	 ehhh
[15:47:17] <_joe_>	 if you want the outside world to talk to it
[15:47:20] <cdanis>	 I think we're overengineering this now
[15:47:30] <volans>	 it's a service metric, not a host metric ;)
[15:47:35] <_joe_>	 I thought you planned to use the prometheus scraping of k8s to get it
[15:48:04] <cdanis>	 that's what I was thinking _joe_, and possibly the service itself can detect if it is talking to a stale etcd (or talk to all of them)
[15:49:55] <_joe_>	 anyways, I'm trying to figure out why I can't remove a specific dir on deploy1001
[15:50:12] <_joe_>	 I can't deal with your high level fancy designs now
[15:50:21] <volans>	 sudo 
[15:50:22] <cdanis>	 lol
[15:50:33] <cdanis>	 looks like Urbanecm is having some deploy1001 perms troubles as well
[15:50:36] <cdanis>	 judging by #-operations
[16:04:10] <wikibugs>	 10serviceops, 10Scap, 10Release-Engineering-Team-TODO (201908): `scap clean --delete 1.34.0-wmf.13` fails with `Permission denied` - https://phabricator.wikimedia.org/T230802 (10Jdforrester-WMF) 05Open→03Resolved a:03Joe Fixed by @Joe.
[16:15:24] <wikibugs>	 10serviceops, 10Scap, 10Release-Engineering-Team-TODO (201908): `scap clean --delete 1.34.0-wmf.13` fails with `Permission denied` - https://phabricator.wikimedia.org/T230802 (10zeljkofilipin) Thanks!