[10:01:35] <wikibugs>	 10serviceops, 10Performance-Team (Radar): Reconsider memcached connection method for MW in PHP7 world - https://phabricator.wikimedia.org/T235216 (10Joe) Mcrouter can't be configured to listen both on a unix socket and on a TCP port. This means, apart from how cumbersome the change is going to be if we want to...
[10:52:41] <wikibugs>	 10serviceops, 10Operations, 10Puppet, 10User-jbond: Rolling restart of etcd to pick up the renewed CA public certificate. - https://phabricator.wikimedia.org/T237362 (10Joe) Good news is we only need to do a rolling restart in eqiad, not in codfw, where we still don't use the ca for peer connections
[11:19:54] <_joe_>	 akosiaris: do we really need proxyfetch in pybal for services on k8s?
[11:19:58] <_joe_>	 I think it's wrong
[11:20:39] <_joe_>	 we just had a slowness on one of the eventgates, and all kube hosts were seen as down at the same time
[11:21:28] <akosiaris>	 how would that differ if it wasn't in k8s ?
[11:23:01] <akosiaris>	 you can ask if we need ProxyFetch overall for some services, but I don't see specifically how kubernetes changes this.
[11:23:19] <mark>	 what would you use instead?
[11:23:31] <akosiaris>	 just IdleConnection ?
[11:23:47] <mark>	 ugh
[11:23:54] <mark>	 is there no internal url or something pybal can check?
[11:24:01] <mark>	 or an always-up service ;p
[11:24:04] <mark>	 just doing 200 OK
[11:24:04] <akosiaris>	 tbh, overall I would just let the health check of kubernetes do the needfull
[11:24:18] <akosiaris>	 and not rely on pybal at all 
[11:24:35] <akosiaris>	 kube-proxy already knows if a pod is down or not
[11:24:42] <akosiaris>	 and will not route requests to it
[11:24:54] <akosiaris>	 so what remains is for pybal to know if a node is down or not
[11:24:56] <mark>	 what is pybal load balancing here even? multiple kube proxies?
[11:25:08] <mark>	 indeed
[11:25:19] <akosiaris>	 essentially yes, it's doing things for no reason
[11:25:34] <mark>	 well it's monitoring the wrong thing
[11:25:45] <mark>	 monitor the uptime of the proxy, not the service behind it :)
[11:26:02] <akosiaris>	 s/proxy/node/ but yeah that's the point
[11:26:07] <akosiaris>	 the proxy is just DNAT rules ;-)
[11:26:11] <mark>	 ok
[11:26:19] <akosiaris>	 but how do we change that in pybal? 
[11:26:35] <akosiaris>	 I mean how to we tell pybal to query a different port than the one the service is on?
[11:26:43] <mark>	 you can't
[11:26:50] <akosiaris>	 exactly :D
[11:27:07] <mark>	 well you can use RunCommand, but...
[11:28:46] <akosiaris>	 funnily enough both kubeproxy and kubelet components are ready 
[11:28:54] <akosiaris>	 curl http://kubernetes1001.eqiad.wmnet:10255/healthz
[11:28:54] <akosiaris>	 ok
[11:28:55] <mark>	 why do we have pybal in the path anyway?
[11:29:02] <akosiaris>	 and  curl http://kubernetes1001.eqiad.wmnet:10249/healthz
[11:29:02] <akosiaris>	 ok
[11:29:04] <mark>	 can't varnish just contact and monitor those nodes directly?
[11:29:24] <mark>	 or ATS
[11:29:35] <akosiaris>	 ah, that's a good question. Mostly to maintain the status quo of having HA services behind pybal
[11:29:46] <akosiaris>	 but we are in a point now that's it's probably a good time to question that
[11:29:48] <mark>	 seems like there's really no point to that anymore
[11:30:00] <mark>	 and a significant obstacle towards self service?
[11:30:32] <akosiaris>	 yes, but there is something that pybal does that we don't currently have
[11:30:41] <akosiaris>	 and it's to advertise the VIP to the routers
[11:31:00] <mark>	 that wouldn't be hard to substitute
[11:31:08] <mark>	 if a node can just advertise a service ip whenever it's up...
[11:31:19] <mark>	 doesn't calico do bgp already anyway? ;)
[11:31:31] <akosiaris>	 yeah, the hard part is to add extra config to it
[11:31:46] <akosiaris>	 but we can try that
[11:31:47] <mark>	 i suppose you could use any other bgp daemon but then they may conflict ;)
[11:31:49] <akosiaris>	 we have a task about that
[11:32:02] <akosiaris>	 yeah multiple bgp deamons won't really work
[11:32:23] <akosiaris>	 179/tcp must be used across both nodes and I don't even know if on junipers that's configurable
[11:32:34] <mark>	 it could be passive
[11:32:40] <mark>	 so node contacting the router and not vice versa
[11:32:49] <akosiaris>	 ah, so not two way? hmmm
[11:32:49] <mark>	 source port doesn't need to be 179
[11:32:56] <mark>	 but still
[11:33:02] <mark>	 the router would get confused, because same ip
[11:33:07] <akosiaris>	 we have a task about all btw
[11:33:43] <mark>	 what are you saying, 'please go ahead?' ;)
[11:34:10] <akosiaris>	 that we need to write down more info about that
[11:34:20] <akosiaris>	 it's still on the draft part
[11:34:25] <mark>	 happy to help
[11:36:13] <mark>	 so the bgp announcement would really only be for other services internally wanting to talk to the k8s service
[11:36:20] <mark>	 because varnish/ATS shouldn't need that
[11:37:49] <wikibugs>	 10serviceops, 10Operations, 10Pybal, 10SRE-tools, 10Traffic: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10akosiaris) >>! In T239392#5701649, @akosiaris wrote: > `need to be able to understand the...
[11:38:51] <akosiaris>	 mark: https://phabricator.wikimedia.org/T238909
[11:40:32] <akosiaris>	 it's more high level than pybal health checks, because at the end of the day, maybe hiding all the services behind an ingress solves most of ours problems as well
[11:43:24] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10mark) I agree - it seems that PyBal adds no real value here, because it's essentially load balancing the k8s load balanc...
[11:43:57] <mark>	 lmk if I got any of that wrong
[11:48:58] <mark>	 i see people doing it by simply changing the bird config template
[11:49:21] <mark>	 and if we make it so that k8s service ips are in a special range, a /24 or whatever
[11:49:26] <mark>	 then we don't have to do it per service either
[11:49:33] <mark>	 although that could have benefits I suppose
[11:49:49] <mark>	 because otherwise by definition one of the nodes is responsible for all traffic and the others are standby
[11:50:06] * mark goes back to his real job
[11:59:42] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) >>! In T238909#5727644, @mark wrote: > I agree - it seems that PyBal adds no real value here, because it's es...
[12:08:18] <wikibugs>	 10serviceops, 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) >>! In T238909#5727698, @mark wrote: >>>! In T238909#5727693, @akosiaris wrote: >  >> True. We could investig...
[12:27:24] <mark>	 akosiaris: yeah, anycast & ECMP are basically the same thing here
[12:37:48] <akosiaris>	 indeed
[14:55:02] <wikibugs>	 10serviceops, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10jlinehan) >>! In T236386#5725166, @Ottomata wrote: > @jlinehan thoughts?  I'm considering moving forward with intake-{analyt...
[15:14:03] <wikibugs>	 10serviceops, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) > Changing the URL is easy is not really that easy :/  Possible though.  Ooook.
[15:16:36] <_joe_>	 sorry mark, akosiaris I was afk working with reuven
[15:16:51] <_joe_>	 so my rationale for not using functional checks from pybal is
[15:17:07] <_joe_>	 the same checks are done by k8s 
[15:17:43] <_joe_>	 and what we check with pybal is just if the "service" in k8s is up
[15:19:53] <_joe_>	 the health status of the pods is actually checked by k8s, so what pod gets traffic is not determined by pybal at all
[15:20:33] <_joe_>	 so as expected when eventgate slows down, all pybal backends show the same behaviour
[15:28:13] <akosiaris>	 _joe_: see above the discussion. pybal assumes (and was written with that it mind) that it checks the service state, but in reality we want to check the node state. It's way more involved than just removing ProxyFetch if we want to solve this correctly
[15:29:14] <mark>	 and as you had already concluded on the task, there's really little point in having pybal in the loop
[15:29:21] <mark>	 and it's a big obstacle towards self service infra
[15:29:24] <mark>	 so let's get rid of it?
[15:30:24] <mark>	 how does that differ from being an "ingress" btw?
[15:33:08] <_joe_>	 ingress reunites all the services behind a L7 proxy
[15:33:21] <_joe_>	 but you still need to get your traffic to the k8s workers
[15:33:30] <akosiaris>	 yeah, single LVS IP pointing to the ingress
[15:33:38] <akosiaris>	 ingress can be made aware of the pods
[15:33:53] <akosiaris>	 depends on the ingress implementation IIRC
[15:34:12] <akosiaris>	 ingress is just a set of nginx/haproxy/ats instances that do the L7 load balancing
[15:34:45] * _joe_ throws istio to akosiaris
[15:34:56] <akosiaris>	 yeah... ehm no