[09:43:55] We're thinking of smoke testing the termbox service on eqiad later today as pretty much our last step before going live. We'll be staying under 5 requests/s so we expect if anything were to fall over it will just be our service. The ticket (and patches with the test code) are https://phabricator.wikimedia.org/T229907. I'll also mention something in #-operations just before we start [09:52:57] <_joe_> tarrow: before doing so give us a voice? [09:53:11] <_joe_> we're in the process of rebuilding the eqiad k8s cluster today :) [09:53:17] sure! [09:53:31] any times we should aim for /avoid? [09:53:36] <_joe_> fsero: about that... [09:53:46] <_joe_> fsero: let's start in ~ 30 minutes? [09:53:53] sure [09:54:02] <_joe_> tarrow: just ask, in general we should be done pretty quickly, but one never knows [09:54:27] ok! We're probably not going to be ready ourselves until after lunch anyway [09:54:44] <_joe_> fsero: in eqiad, I'd depool services from varnish immediately, but I'd depool the individual services just before recreating them [09:54:55] <_joe_> if we're going to tear down/up a namespace [09:55:00] k [09:55:02] <_joe_> I mean namespace by namespace [09:55:10] yeah i git you :) [09:55:28] lets start with zotero [09:55:33] it always gives luck [09:57:03] <_joe_> sure [09:57:11] <_joe_> so we start with zotero at 12:30 [09:57:25] <_joe_> do we need to do admin parts first? [10:01:49] the only problematic one is calico [10:01:54] so lets do it at 12:30 [10:02:25] <_joe_> that needs a full depool I guess [10:02:52] not really as long you dont reboot a node, configuration will be kept in memory [10:02:58] and recreation is fast [10:03:16] i didnt see any error while recreating it [10:03:17] <_joe_> yeah but if something is wrong, we create a service interruption [10:03:44] yep but is the same config applied in codfw and staging [10:03:53] anyhow [10:03:59] we can depool everything to be safe [10:04:05] lets do the varnish part first [10:04:11] <_joe_> yes [10:05:51] _joe_: should I wait for you guys to finish up before pushing more php7 traffic [10:06:00] or I can go ahead? [10:06:53] <_joe_> uhm [10:07:04] <_joe_> you should probably either do it *right now* or wait [10:07:13] ok I will wait [10:07:18] too many variables at the same time [10:07:36] I will break for lunch earlier then [10:12:13] _joe_: https://gerrit.wikimedia.org/r/c/operations/puppet/+/529927 review? [10:12:28] <_joe_> fsero: just gave +1 [10:13:11] <_joe_> fsero: do you have a depool "script" for depooling all services in a k8s cluster? [10:14:36] sudo confctl --object-type=discovery select 'dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=eqiad' set/pooled=no [10:14:46] thats my "script" [10:15:16] ideally kubernetes discovery entries should have a cluster variable [10:15:38] so we can do k8s-cluster=eqiad and depool all services [10:15:50] but this is not there now [10:16:19] <_joe_> yeah [10:17:07] so.. shall i merge? [10:17:41] <_joe_> sure, go on [10:19:06] lets leave cxserver and blubber to be the last ones [10:19:15] that way i dont need to run puppet in cache nodes [10:19:36] <_joe_> ok! [10:19:49] <_joe_> let's keep an eye on icinga, both of us [10:20:26] _joe_: if you want to do it, go to deploy1001 cd into /srv/deployment-charts/helmfile.d/admin/eqiad [10:20:30] and source .hfenv [10:20:53] then we should depool all services in eqiad cluster, and after that recreate calico [10:21:01] <_joe_> fsero: I have a tmux open under my user [10:21:06] <_joe_> if you want to join :P [10:21:11] ok [10:22:27] lets depool services [10:22:30] doing it now [10:22:36] <_joe_> ack [10:23:46] lets downtime LVS.. [10:24:44] <_joe_> nah :P [10:25:23] <_joe_> we'll do that once it's time for each service [10:25:31] <_joe_> we might flood people with alerts now, but it's fine [10:26:12] i dont want to hear brandon complain again :P [10:26:24] <_joe_> he's still not getting paged :P [10:26:24] ok, so delete the calico deploy and configmap [10:26:40] <_joe_> via kubectl manually? [10:27:05] first time yep [10:27:22] https://www.irccloud.com/pastebin/F8Hc2soV/ [10:28:16] kubectl delete deploy calico-policy-controller [10:28:26] and later kubectl delete cm wmf-default-policy [10:28:32] with the -n kube-system of course [10:28:36] <_joe_> yeah, that, the config map, then helmfile apply? [10:29:48] yep, giving calico is a template is easier with ./apply-calico-policy.sh diff and ./apply-calico-policy.sh apply [10:30:02] <_joe_> ok [10:30:09] <_joe_> should I first run the diff command? [10:30:16] <_joe_> just to be sure? [10:30:22] run diff is always recommended [10:30:27] is a sanity check :) [10:30:47] <_joe_> ok running now [10:31:00] remember that if something is not managed via helm it will appear as not present [10:31:02] <_joe_> heh ofc [10:31:05] <_joe_> yeah [10:31:10] but then when helm try to apply it it will fail because its there [10:31:14] <_joe_> I was about to say [10:31:27] <_joe_> ok, do the removal then [10:33:28] <_joe_> all looks ok, but let's wait for confirmation from icinga and co [10:34:33] looks ok from my end [10:35:02] <_joe_> ok [10:35:14] <_joe_> I'm not sure what to test at this point [10:35:36] if you want to test calico, you can create a test deployment with wmfdebug and do a curl [10:35:53] but giving that the pod is running and healthy i would say everything should be working [10:36:30] lets move to rbac podsecpolicies and coredns [10:36:48] <_joe_> ok [10:37:22] you need to delete the existing deploy clusterrole [10:37:26] <_joe_> rbac, should just try a helmfile diff? [10:37:33] because roles are not updateable [10:37:38] <_joe_> oh sigh [10:37:41] sure, you can always do a diff [10:37:45] <_joe_> what will that mean [10:38:31] kubectl delete clusterrole deploy and then helmfile apply [10:38:41] <_joe_> sure I wanted to make sure of the name [10:40:41] podsecpolicies are not present [10:40:45] so you can just do a diff and apply [10:40:53] also kubectl get psp :) [10:40:54] <_joe_> oh lol [10:40:56] <_joe_> ahahahah [10:41:05] <_joe_> yeah just got that from google [10:41:35] that EOF error is because our etcd doesnt like list operations so much [10:41:39] is worrysome [10:41:45] but if you retry it will work [10:41:48] <_joe_> what you mean? [10:42:12] <_joe_> ok now I see [10:42:14] <_joe_> lol [10:42:21] i mean that sometimes doing a helm list operation, tiller will last more than 5 seconds and the helm cli will timeout [10:42:22] <_joe_> we need to upgrade to etcd3 [10:42:26] and the timeout message is EOF [10:42:31] which is not convenient [10:42:48] <_joe_> is coredns active? [10:42:51] nope [10:43:03] <_joe_> so I just have to install it I guess [10:43:04] is deployed in codfw and will be deployed now in eqiad by you [10:43:15] but even when deployed will be only active after a puppet change [10:43:34] we need to pass a flag to kubelets to use it [10:44:08] <_joe_> so that they try to resolve a specific domain via coredns? [10:44:40] <_joe_> oh there is an error in the coredns files [10:44:41] mmm crap i forgot to update the range [10:44:43] yep [10:44:45] <_joe_> yeah :P [10:44:45] lemme fix it [10:44:48] <_joe_> ok [10:45:50] <_joe_> it's a good thing k8s detects it and rejects the resource :) [10:47:17] please retry [10:47:24] you might find a common error [10:48:03] Error: UPGRADE FAILED: "foo" has no deployed releases [10:48:13] if it outputs that is expected [10:48:15] yeah [10:48:20] run a helm list [10:48:28] you will see a failed coredns release [10:48:42] you need to purge it before retrying to install [10:48:58] you need to purge it because is the first revision, in a release with more than one revision you can also rollback [10:49:03] so do a helm del --purge coredns [10:49:04] <_joe_> helm delete coredns right? [10:49:10] <_joe_> oh --purge [10:49:18] purge removes any remnant kubernetes objects [10:49:27] noew retry [10:49:45] <_joe_> ok ofc now it works [10:50:04] <_joe_> this only happens when you bootstrap a helm release right? [10:50:06] yup [10:50:25] <_joe_> because of the lack of a rollback, makes "sense" [10:50:46] it could also happens between upgrades, if you perform an upgrade and fails and it goes unadvertently (because everything keeps working) next upgrade will also fail [10:51:05] however in that sceneario either you purge or more probably tou rollback to the last one working [10:51:31] now is time to start the namespace dance [10:51:34] <_joe_> mskrd drndr [10:51:45] <_joe_> *makes sense [10:51:50] for NS in $(echo "termbox zotero"); do source .hfenv && kubectl delete ns $NS && helmfile -e $NS apply && pushd ../../services/eqiad/$NS && source .hfenv && sleep 30 && helmfile apply && popd; done [10:52:06] i did this to delete the namespace, recreate it and also reapply the application [10:52:42] <_joe_> from the admin directory I guess [10:52:45] you can do step by step if you want, first delete the namespace, then create it via helmfile and after that applying the service helmfile [10:52:45] <_joe_> makes sense [10:52:47] yep [10:53:02] <_joe_> when you delete the namespace [10:53:07] <_joe_> what happens to all resources? [10:53:13] <_joe_> they get destroyed as well? [10:53:13] all resources are deleted [10:53:30] when you delete a namespace you will see the namespace in terminating state [10:53:31] <_joe_> why the "sleep 30"? [10:53:50] <_joe_> to optimistically wait enough time for a ns to be removed? [10:53:57] i didnt want to overload etcd [10:54:06] <_joe_> makes sense, yes [10:54:07] it gets stressed with too many helm actions [10:55:29] <_joe_> ok so lemme start with zotero [10:57:02] <_joe_> so the downtime cookbook doesn't work, great [10:57:07] <_joe_> for logical hosts [11:00:19] <_joe_> fsero: you did downtime zotero already [11:01:10] yep [11:01:18] the ns deletion will take some time.. [11:01:21] <_joe_> yeah [11:01:23] <_joe_> I imagined [11:03:12] <_joe_> ok it is down now [11:03:37] <_joe_> hopefully k8s will be done soon as well [11:07:17] <_joe_> "thank you for installing zotero." [11:08:08] looks good :) [11:08:11] zotero is working [11:08:36] <_joe_> how did you test it? [11:08:42] <_joe_> I'm waiting for icinga to agree [11:09:22] all pods running, service is present [11:09:27] it will recover [11:10:16] <_joe_> yeah I was looking pybal primarily [11:10:21] <_joe_> it's now ok AFAICT [11:11:08] there is the recovery [11:11:09] <_joe_> fsero: I just realized I have an interview at 3 pm [11:11:13] <_joe_> so yeah :) [11:11:15] i can continue [11:11:21] <_joe_> so, I need to go to lunch now [11:11:21] i think you got the gist of it [11:11:23] <_joe_> yeah, thanks [11:11:27] <_joe_> I did, yes [11:11:35] <_joe_> I'll try to write down most of what I got today [11:12:28] great [11:14:13] _joe_: what's the issue? [11:14:18] (re cookbook) [11:14:36] <_joe_> volans: "zotero.svc.eqiad.wmnet" is a perfectly valid host for icinga [11:14:50] <_joe_> yet the cookbook seems not to like it [11:14:53] <_joe_> but, laters! [11:15:09] it runs icinga-downtime -h "{hostname}" -d {duration} -r {reason} [11:15:11] on the icinga host [11:15:15] it's a script that is in puppet [11:21:33] we can move the logic into spicerack and use just plain icinga commands [11:39:03] <_joe_> volans: yeah there must be something wrong in that script [11:47:02] effie: apergos: could you decline meetings for thursday assuming you'll be off? [11:47:16] oh sure [11:48:19] done, did monday too (was going to do those this evening but thanks for the reminder) [11:48:35] if you make an "out of office" calendar event for vacations etc [11:48:40] then it will do that automatically for you [11:52:34] duly noted [11:59:59] <_joe_> jijiki: mw1227 is in bad shape [12:00:07] <_joe_> php7-only api with low specs [12:00:57] I will have a look [12:01:55] let it be for now [12:01:58] so I can look [12:02:17] <_joe_> jijiki: more interstingly [12:02:21] <_joe_> all the servers running php7 [12:02:29] <_joe_> have a way higher cpu usage since 10:25 [12:03:03] I just saw that [12:03:05] https://grafana.wikimedia.org/d/5E7tdiGWz/xxxx-effie?panelId=17&fullscreen&orgId=1 [12:03:45] ok go to lunch and I will start looking [12:04:28] <_joe_> I'm back from lunch [12:04:34] already ? [12:04:44] :D [12:04:46] ok [12:05:05] I will start checking for changes [12:05:09] <_joe_> it started at 10:27 [12:05:09] around that time [12:05:15] <_joe_> I have no idea what could've happened [12:05:38] me neither, I was having lunchp [12:05:41] :p [12:05:42] <_joe_> I'll restart mw1221 to understand if this is a transient problem [12:05:47] sure [12:06:06] I want to look a bit on sal [12:06:18] <_joe_> there is fabian depooling eqiad around that time [12:07:14] <_joe_> jijiki: the spike is only on php-only servers, correct? [12:07:16] <_joe_> https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=All [12:07:22] <_joe_> "cpu per host" [12:08:16] yeap [12:08:29] the other thing around that time is john deploying gs [12:08:58] did you restart m1227 ? [12:09:02] or mw1221 [12:09:24] <_joe_> 21 [12:09:42] <_joe_> https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-node=mw1221 shows a server that is more busy but otherwise healthy [12:10:17] I am doing some log checks on mw1348 [12:10:24] see if there is anythere [12:10:32] before we start looking deeper [12:12:08] there are some [12:12:10] [Tue Aug 13 10:22:15 2019] Task in /mediawiki/job/21926 killed as a result of limit of /mediawiki/job/21926 [12:12:22] <_joe_> those are absolutely normal [12:12:33] <_joe_> now my only hope is that I'm right in my hypothesis [12:12:50] <_joe_> fsero: where are you on the service recreations? [12:13:42] _joe_: which one? [12:14:02] <_joe_> jijiki: that somehow the depool of services is causing this [12:15:56] *any* services? [12:16:04] <_joe_> jijiki: so [12:16:11] we can try one more thing [12:16:11] <_joe_> flamegraph for apis in the last hour [12:16:14] <_joe_> https://performance.wikimedia.org/arclamp/svgs/hourly/2019-08-13_12.excimer.api.svgz [12:16:25] <_joe_> before the issue [12:16:27] <_joe_> https://performance.wikimedia.org/arclamp/svgs/hourly/2019-08-13_09.excimer.api.svgz [12:16:41] <_joe_> it's because php-fpm does zero connection pooling [12:16:57] <_joe_> that's why we need a proxy to do that work for it [12:17:13] _joe_: all services recreated i need to do some commits for after cleanup [12:17:23] but you can repool them [12:17:27] <_joe_> fsero: can you repool now? [12:17:32] <_joe_> I'm pretty sure that will solve the issue [12:17:45] I am afraid that I won't like what will happen when we pool back [12:17:49] <_joe_> the flamegraph samples can sometimes be very useful in debugging an issue [12:17:53] don [12:18:01] <_joe_> jijiki: the flamegraph is pretty clear [12:18:12] <_joe_> the apis are spending most of their time in curls [12:18:16] need to be afk for at least half an hour [12:18:26] <_joe_> fsero: go go! [12:18:55] pff yeah [12:19:16] so we will need to do what we did for cirrussearch [12:19:37] for all other services that mw directly talks to ?> [12:21:30] <_joe_> yes [12:21:45] <_joe_> lucky for us, I am writing down a plan to that end :P [12:21:57] <_joe_> btw [12:22:00] <_joe_> it's not recovering [12:22:05] I know [12:22:15] I am giving it a little time [12:22:18] <_joe_> or better [12:22:29] <_joe_> the load is going down a bit, but slowly [12:22:51] <_joe_> jijiki: maybe there is some parameter of the curl extension we can tune? [12:22:59] <_joe_> can you take a look into it? [12:23:14] I can check how is working on this [12:23:21] and dig [12:23:23] k [12:23:26] <_joe_> this seems like good old resource contention to me [12:24:05] <_joe_> on mw1348 the load is going down fast [12:24:48] <_joe_> ok I can also say [12:24:52] <_joe_> given the flamegraph [12:25:00] <_joe_> this was happening in post-send [12:25:07] <_joe_> hence the response times not affected [12:25:24] <_joe_> and I'm ready to bet the issue was eventgate not playing well with depooling [12:25:28] <_joe_> one datacenter [12:25:50] we can verify that easily [12:26:26] <_joe_> eventgate-main is declared active_active true [12:26:31] we can rolling restart the php7 servers [12:26:37] <_joe_> so I didn't think it would cause disruption [12:26:43] <_joe_> jijiki: nah no need [12:26:51] <_joe_> but I've also seen wdqs having issues [12:27:00] <_joe_> so it seems eventgate in codfw is doing something wrong [12:27:09] <_joe_> elukey: ^^ [12:27:18] I am not followinhg [12:27:30] <_joe_> https://performance.wikimedia.org/arclamp/svgs/hourly/2019-08-13_12.excimer.api.svgz [12:27:43] <_joe_> this shows what happened was inside EventBus::send [12:27:58] <_joe_> the eventbus extension uses eventgate-main [12:28:14] <_joe_> we did depool eventgate-main [12:28:38] <_joe_> uhm did we also depool eventgate-analytics? [12:28:54] ok let's stop here and verify or debunk [12:29:12] : set/pooled=false; selector: dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=eqiad [12:29:15] yes [12:29:21] <_joe_> ok so [12:29:25] I have an idea [12:29:38] what if we depool briefly what we depooled before [12:29:40] minus eventgate [12:30:08] <_joe_> why do you need to verify that when you have data proving it was the problem? [12:30:09] see how it goes [12:30:20] <_joe_> https://performance.wikimedia.org/arclamp/svgs/hourly/2019-08-13_12.excimer.api.svgz is clear data [12:31:16] <_joe_> now I have to leave soon, but you should maybe look at eventgate's logs from codfw, or mediawiki errors related to eventbus [12:31:17] (I am very ignorant about eventgate-*) [12:31:25] we could gather some data while this happens [12:31:29] <_joe_> elukey: is otto working today? [12:31:37] and figure out what is up [12:31:43] I think otto is on holiday [12:31:49] he is on holidays without connectivity up to the 26th [12:33:30] <_joe_> ok interesting, so who knows things about eventgate? [12:34:21] I think petr as well [12:34:27] Petr is probably the best contact [12:34:31] they have been doing something with otto [12:34:42] they have worked together on eventgate [12:34:44] I'll talk to petr [13:25:57] jijiki: how is going the rolling restart? [13:26:00] is it complete? [13:29:33] just did [13:30:07] looks ok [13:31:24] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1565699477846&to=1565703077846&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 [13:31:41] interestingly seems to have affected more PHP7 rather than hhvm in latency terms [13:32:12] but HHVM errors spike is bigger [13:32:30] maybe HHVM is more strict erroring when latency increase for a given request? [13:32:59] but yep looks better now [13:37:28] FYI please take a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/529923 when you get a chance, re: high cpu vs latency alerts for appservers [13:47:21] I would love to know what triggered tha behaviour [13:50:32] https://logstash.wikimedia.org/goto/b57089a8739d83b80884ac1e6df75843 [13:51:58] ok there is something actually worst [13:52:34] https://logstash.wikimedia.org/goto/e0049f30637508935d7ccf5710182487 <- all php7 only servers [13:56:53] fsero: is it ok if jakob_WMDE and I start smoke testing? [13:59:46] i think so [13:59:51] tarrow: go ahead [14:01:02] fsero: cool! we'll be about an hour. I'll be hitting the eqiad service from the maintenance server on eqiad [14:02:40] ping me if you see any issues [14:02:52] 👍 [14:04:51] 10serviceops, 10Operations: Migrate pool counters to Buster - https://phabricator.wikimedia.org/T224572 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin2001 for hosts: `poolcounter2001.codfw.wmnet` - poolcounter2001.codfw.wmnet - Removed from Puppet master and PuppetDB - Do... [14:19:41] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10jijiki) @Mholloway We are getting a number of alerts where scb* servers return 504... [14:24:33] fsero: Have you run smoke tests on services before? If so which machine did you run them from? We were planning to use mwmaint1002 after someone suggesting it but we just discovered we're lacking the ability to have a python venv for it [14:27:26] no sorry didnt perform an smoke test at wmf [14:38:27] tarrow: maybe a little overkill but what about creating a new docker image based on python and including all your deps, and then you can create a stress test release in helmfile that could be deployed in staging for instance [14:39:26] fsero: That could be a really neat way to do it :) are you ok with us abusing our staging release just for that? [14:39:55] Or would I make a totally new release? [14:39:58] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10Mholloway) The plan is to relieve pressure on mobileapps by pushing forward on the... [14:40:07] i dont know what akosiaris and _joe_ thinks about it, but for me this kind of use cases is legit as long you are not doing crazy stress tests abusing network or cpu [14:40:12] and that is clearly not the case [14:40:25] is basically a smoke test [14:41:09] <_joe_> tarrow: well does your staging release call test.wikidata? [14:41:17] <_joe_> that's still a production wiki [14:41:40] <_joe_> so, within the limits of not killing the mw api, go on and do what you need to do [14:42:05] <_joe_> tarrow: for smoke testing k8s I'd use the deployment servers [14:43:01] _joe_: we were planning to use locust; am I ok using a python venv there and pulling in the deps? [14:43:56] We know what code we want to run, and we want to point it at the real termbox eqiad, we're just not sure where to run the code [14:48:05] <_joe_> so it would be great if we could engineer the whole thing a bit [14:48:15] <_joe_> but yes, deploy1001 has "virtualenv" [14:53:04] right, but (unsurprisingly) no pip [15:21:40] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10Mholloway) [15:26:35] <_joe_> yeah no pip [15:26:47] <_joe_> the easiest way is to clone a repo containing the wheels you need [15:33:12] _joe_: ok! so I can just checkin my venv? [15:33:25] if that's fine that's great [15:53:31] _joe_: I really think we need some proper way of doing this, it's come up before. a somewhat-generic load generator chart seems reasonable to me [15:54:51] id bet on vegeta rather than locust :P [15:55:02] https://github.com/tsenart/vegeta [16:02:07] <_joe_> locust has some advantages imho if you want to do realistic load testing, but in general cdanis yes, I agree [16:02:28] <_joe_> now we just need to find someone who has the time to work on a properly engineered solution :P [16:02:34] 👀 [22:59:41] 10serviceops, 10Scap, 10PHP 7.2 support, 10Patch-For-Review, and 3 others: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10thcipriani) >>! In T224857#5413024, @gerritbot wrote: > Change 530014 had a related patch set uploaded (by Thcipriani; owner: Thcipri...