[05:16:21] <_joe_> uhm yeah that needs fixing [06:21:42] <_joe_> mutante: uhm I'm not sure what you did, but now there is no mediawiki::webserver in role::parsoid::testing [09:08:22] omg, I like helmfile so much right now [09:08:43] thanks to fser.o it created the namespace and all associated stanzas in like 10s [09:09:12] and if I did not have a typo in my yaml it would have litterally taken that much [09:10:22] let's see if I can deploy restrouter for the first time in staging for fun [09:15:01] <_joe_> akosiaris: what did you do to create the ns? [09:19:29] root@deploy1001:/srv/deployment-charts/helmfile.d/admin/staging# KUBECONFIG=/etc/kubernetes/admin-staging.config HELM_HOME=/etc/helm helmfile -e restrouter sync [09:19:31] _joe_: ^ [09:19:56] <_joe_> sigh @ the env vars :P [09:20:10] after merging this https://gerrit.wikimedia.org/r/526719 [09:20:14] <_joe_> and what is in helmfile right now for restrouter? [09:20:25] <_joe_> lemme go see [09:20:31] it's split in 2 parts [09:20:46] the admin stuff (namespace creation, quotas, podsecuritypolices etc) [09:20:50] and the actual service stuff [09:21:01] for the actual service stuff we already hid somewhat better the env vars [09:21:20] but we are stuck on a bug on helm diff plugin to fully code them into helmfile itself [09:21:28] we should follow up on it [09:22:56] <_joe_> ok I found the file for restrouter's values is under [09:23:00] <_joe_> /srv/deployment-charts/helmfile.d/admin/staging/namespace/values [09:23:44] <_joe_> neat [09:23:57] <_joe_> and it also creates the resource quotas and the limitranges [09:24:01] <_joe_> for that namespace [09:24:04] yeah, way better than what we had up to now [09:24:14] <_joe_> and I guess it can also create the calico rules? [09:24:20] <_joe_> or those are still separated [09:24:36] still separated [09:24:39] <_joe_> uhm also [09:24:40] under admin/calico [09:24:45] <_joe_> since we're rebuilding the clusters [09:24:55] but at least now enforced instead of being just informational [09:25:02] <_joe_> shall I reinstall etcd1004-1006 and 2004-2006 with etcd 3 [09:25:16] <_joe_> and we tell k8s to use etcd3? [09:25:49] bundle the 2 migrations? [09:25:55] <_joe_> :P [09:25:57] I 'd rather not. That being said, the exact same process is the one we should follow anyway [09:26:05] <_joe_> ok, makes sense [09:26:15] <_joe_> let's try to do it when you're back from vacations then [09:26:27] if recreating one of the clusters ends up being pain free [09:26:37] I am all for it being redone for etcd3 [09:26:46] s/for/with/ [09:29:28] yeah, those are private ASes [09:29:30] it's internal ones [09:29:42] 7d? [09:29:54] 7d is a lot of time to have those around [09:30:09] ah, no scratch that [09:30:12] my misunderstanding [09:34:52] <_joe_> akosiaris: btw https://gerrit.wikimedia.org/r/#/c/operations/cookbooks/+/527487 [09:36:18] <_joe_> in the meantime, we got bad news on the php front [09:36:33] <_joe_> mw1270 yesterday evening exploded and started responding to requests very slowly [09:36:46] <_joe_> it was moved to be a full-php appserver the preceding morning [09:37:09] <_joe_> what happened, AFAICT, is that it filled up its apc capacity and then was trapped in apc GC cycles [09:37:18] <_joe_> that's ugly though [09:55:36] _joe_: I 'll have a look at the cookbook later [09:56:43] btw I 've been trying to figure out if https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/conftool-data/service/services.yaml#11 is being used anywhere [09:56:58] and I mean the "port" in every stanza [09:57:33] can't find anything accessing it specifically, but I may be failing somewhere [09:57:45] <_joe_> no it's not [09:57:48] <_joe_> I planned to [09:58:04] should we remove it ? [09:58:07] <_joe_> but never got around to use that in pybal [09:58:16] it's mildly confusing [09:58:16] <_joe_> we might want to remove the whole "services" object [09:58:28] that was my next question [09:58:31] aside from the defaults [09:58:32] <_joe_> sooner or later, I'll get around to it [09:58:38] what else is used from it? [09:58:43] <_joe_> it will simplify a few things in conftool a lot [09:59:05] <_joe_> it's used to determine default values for nodes right now [09:59:56] <_joe_> I might even give it a shot today [12:16:04] <_joe_> cdanis: see the discussion above [12:17:47] I'll have to refresh myself on this part of conftool [12:18:01] <_joe_> the tldr is I want to remove the default "service" entity, which is useless as of now [12:18:05] yeah [12:18:09] <_joe_> it will simplify some code :) [12:24:55] _joe_: Do remember the timeout thing I was complaining about yesterday? It was causing some flapping alerts again yesterday evening; do you think we need to do "something" about it or can we keep ignoring them? [12:25:22] <_joe_> tarrow: still only codfw? [12:25:27] yes [12:25:40] but I guess the healthchecks trigger alerts on both [12:25:50] <_joe_> tarrow: I don't think they should be ignored, but I don't have time to dedicate to them for now [12:26:52] Of course :). I was wondering if there is a way we can silence the service-checks just for one DC or something to save disturbing people [12:27:26] tarrow: maybe just the timeouts need to be increased? [12:27:44] cdanis: maybe, but we want them that low in the "on" data center [12:30:04] 3s is actually probably a bit high for the service to be useful and we'd hoped to be able to drop it. I guess it's just that high due to the cross datacenter requests [12:31:41] cdanis: is such a thing possible with services? have the config swapped when one dc is active vs not? [12:32:57] I am not sure, but I think you'd be better served here by not trying to reuse the rudimentary blackbox probing that Icinga does as also a latency alert; the latter is better measured IMO with metrics that aggregate over all traffic [12:33:47] cdanis: it's not really a latency alert; just that the service returns a 500 if the request fails [13:31:20] Thinking about this another way; would we rather not have our service reading from ...discovery..wmnet? would it make more sense for each k8s cluster to read from it's local DC? [13:45:36] doubtful. That would mean that when we do DC failovers we would have to read the service specially, which just means it will be forgotten [13:45:45] s/read/treat/ [13:46:15] remind me what the issue was again [13:46:33] termbox failing to read from appservers.discoveryw.wmnet within the timeout threshold? [13:50:28] akosiaris: yeah. Would we need to do anything at failover time though though? just always point the eqiad release (terminology?) at the eqiad api servers and the codfw at the codfw ones [13:52:04] the point of the switchover is to move everything to the secondary DC and have the primary do as little as possible (ideally nothing). That's the only way to make sure that when needed, it will actually work. [13:53:12] that is the workload of both DCs' traffic should be served by 1 of them [13:53:38] otherwise we run the risk of the secondary being over capacity when it is most needed [14:04:50] akosiaris: Sorry, I'm being slow. Wouldn't that be exactly what would happen if termbox always used it's local DC? e.g. MW traffic goes to codfw, termbox.svc.discovery... now points at codfw, and termbox service requests from api-ro.codfw. [14:05:55] Or to ask an even stupider question: is the termbox service on codfw already currently "active" and getting traffic from termbox.svc.discovery..? [14:08:06] because if so I then understand the problem [14:14:27] termbox on codfw is already active and receiving traffic [14:14:52] or at least per our configuration able to receive traffic [14:24:39] <_joe_> akosiaris: but its only client is mediawiki in eqiad [14:24:47] <_joe_> which then goes to termbox in eqiad [14:24:51] yep [14:25:21] <_joe_> the issue is that the service in codfw calls mediawiki in eqiad cross-dc and there seem to be timeouts in doing so [14:25:33] <_joe_> rather rare timeouts, but timeouts nonetheless [14:26:06] <_joe_> so if we had only termbox in codfw pooled, the traffic flow would be [14:26:27] <_joe_> varnish -> mw eqiad -> tb codfw -> mw eqiad and all the way back [14:26:30] <_joe_> twice across dcs [14:27:03] I wasn't too worried about the timeouts until I realised that it means the healthcheck fails which means we needlessly trigger alerts and desensitize ourselves [14:27:44] <_joe_> I think you need to debug the issue at the application level [14:27:57] but now I realise that we are actually getting real traffic there then I think this is a bigger problem sine we cross the country twice for half the requests [14:27:59] <_joe_> understand where it's timing out, for instance [14:28:15] <_joe_> tarrow: there is no real traffic right now to termbox in codfw [14:28:23] <_joe_> what made you think that's the case? [14:28:28] _joe_: I'm not sure how to debug more. We log the exact requests that timeout [14:28:42] <_joe_> tarrow: that wasn't clear in the task [14:28:51] _joe_: Oh? I though akosiaris just said it was "termbox on codfw is already active and receiving traffic" [14:29:03] <_joe_> akosiaris> or at least per our configuration able to receive traffic [14:29:17] <_joe_> if mediawiki reaches it via .discover.wmnet (and it should [14:29:32] <_joe_> then mediawiki will reach the nearest termbox [14:29:46] <_joe_> in eqiad, where all of mw requests are served, the nearest termbox is eqiad [14:29:58] right [14:30:43] that is what I thought was how it worked; so if that's the case I'm now back to less worried [15:58:05] _joe_: thanks for the change.. i wanted to ask you if we need php_restarts on scandium and also tell you what i found the LVS comes from .. i see you already fixed it :) [15:59:06] oh.. and the DSH thing as well.. so it's not conftool and not files but the .yaml [19:31:22] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10Mholloway) Patching up a few crashers (captured in T229521 and T229630) looks to ha... [21:03:53] gerrit2001 is now a working slave for the first time ever, since we got the needed dbproxy from DBAs that unblocked [21:04:42] after a few more fixes now the gerrit service, its httpd and its sshd are up as 'gerrit-slave.wikimedia.org' [21:05:05] (404 expected, though plugins url seem to work && cloning too) [21:05:09] so f.e. https://gerrit-slave.wikimedia.org/r/plugins/gitiles/operations/puppet/ or ssh to it on 29418 [21:05:31] if something would happen to the master we should now be able to promote the slave to a new master [21:06:41] lunch & [23:55:10] re scandium: but i do see "include ::profile::mediawiki::webserver" in parsoid::testing