[07:11:56] good morning :) [07:12:21] my patch for the prometheus mcrouter exporter has been merged by upstream (to enable per server/shard metrics) [07:12:52] I created the new deb package and tested it on deployment-mediawiki-07, looks good [07:13:36] if everybody is ok I'll upload the new package, roll it out, and slowly enable the new option (that needs a puppet change) [07:15:24] +1 [07:25:02] testing it on mw1315 first [08:02:30] first results only from mw1315 https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1 [08:02:35] I created 3 rows [08:03:07] and grouped by memcached server in the last two rows [08:04:49] thank you elukey ! [08:04:57] the set of memcached's results can be expanded with future versions of the exporter [08:05:04] :) [08:11:58] 10serviceops, 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) >>! In T226236#5296968, @MoritzMuehlenhoff wrote: > Well, if there's a security update for Docker we'll want it... [08:39:21] 10serviceops, 10Wikidata-Termbox-Iteration-19: Create termbox release for test.wikidata.org - https://phabricator.wikimedia.org/T226814 (10Tarrow) [08:39:36] 10serviceops, 10Wikidata-Termbox-Iteration-19: Create termbox release for test.wikidata.org - https://phabricator.wikimedia.org/T226814 (10Tarrow) a:03Tarrow [09:02:13] 10serviceops, 10Wikidata-Termbox-Iteration-19: Create termbox release for test.wikidata.org - https://phabricator.wikimedia.org/T226814 (10Tarrow) [09:02:21] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Tarrow) [09:47:35] 10serviceops, 10Continuous-Integration-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10hashar) 05Open→03Resolved Perfect thank you @thcipriani , those tests were exactly... [09:51:39] 10serviceops: docker registry swift replication is not replication content between DCs - https://phabricator.wikimedia.org/T227570 (10fsero) [09:52:16] 10serviceops: docker registry swift replication is not replicating content between DCs - https://phabricator.wikimedia.org/T227570 (10fsero) [10:17:02] Hi! I was looking at the infrastructure needed for the termbox SSR on test and started making a crack at the needed patches. In the process I think I found this: https://gerrit.wikimedia.org/r/c/operations/dns/+/521457 if anyone could take a look? [10:19:11] tarrow: I will let alex +1 it, and I can merge it for you [10:19:45] jijiki: Thanks! :) [10:19:54] :) [11:04:15] 10serviceops, 10Wikidata-Termbox-Iteration-19, 10Patch-For-Review: Create termbox release for test.wikidata.org - https://phabricator.wikimedia.org/T226814 (10Tarrow) [11:07:29] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Tarrow) @akosiaris Happy to say that were are now have code have sending out request metrics form master. In our investigation about going... [11:20:00] 10serviceops: docker registry swift replication is not replicating content between DCs - https://phabricator.wikimedia.org/T227570 (10fsero) [11:23:08] 10serviceops: docker registry swift replication is not replicating content between DCs - https://phabricator.wikimedia.org/T227570 (10fgiunchedi) What I could find so far from swift logs: Four PUTs last friday, one in codfw and three in eqiad. As expected the one in codfw is from the registry and the rest are f... [11:52:19] hi all is someone able to tell me the best way to restart mcrouter? [13:03:51] elukey: ^^^ [13:19:16] jbond42: tell me [13:19:30] what do you want to do [13:19:35] (sorry was in a meeting) [13:19:52] * elukey just seen the ping, but it is already handled :) [13:20:09] need to restart mcrouter on a few mw servers and not sure if they need any special treatment [13:20:46] but it can wait if oyu are still in a meeting [13:21:53] jbond42: if you depool them, it should be just fine [13:21:59] meeting is done :) [13:22:15] ok cool so just depool and restart, thanks [13:31:36] yep, ping me if something goes funny [13:31:40] but it shouldnt [13:32:07] thanks it looks like its gone fine [14:18:36] akosiaris: If you have a moment in the near future could you take a look at: https://phabricator.wikimedia.org/T226814 and its patches? I think that should be all we need to start using the service on test [14:24:44] 10serviceops, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10Pchelolo) [14:28:35] tarrow: sure, I am not sure we should be adding a new LVS service though [14:29:01] we 've recently had some concerns about the scalability of it, I 'll reach out to traffic to make sure [14:29:15] akosiaris: thanks! [14:30:36] akosiaris: if other people are pinging you, I will do too :) [14:30:56] about restrouter k8s deployment and the load testing [14:31:17] so, Marko said you guys wanted to test every endpoint. I think it's a major major overkill [14:32:24] I've done some and the results are pretty much the same for all, do you think you'd be ok with this https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/512923/6/charts/restrouter/values.yaml#23 ? [14:35:19] Pchelolo: well, the danger is that if some endpoint does end up not meeting your guesstimated values, the pod will end up sized wrongly and when for whatever reason that endpoint ends up being exercised more than you expect it will more or less crumble pretty quickly (and not very visibly) [14:35:30] and by visibly I mean it might not be very easy to debug [14:35:56] how many endpoint are we talking about btw? [14:36:15] akosiaris: all endpoints in RB is over a 50 [14:36:37] the thing is that in RESTBase/RESTRouter most of them do exactly the same thing [14:37:19] they either proxy through to the backend service or proxy to storage [14:38:05] these 2 cases are quite different, but within the 2 cases it doesn't really matter which backend service you're proxying too. [14:39:08] well, you can group those under 1 big group I guess, test 1 and just do the math based on current access patterns. What about the other endpoints though? I am guessing we want some perf tests against those? [14:39:32] akosiaris: I've got this grouping https://phabricator.wikimedia.org/T226538#5314652 [14:40:03] this should pretty much cover different things RB does imho [14:40:27] also, I'm wondering why are we even doing that and not just taking current data from produciton? [14:41:02] cause production doesn't have the notion of pods ? [14:41:13] but you can take traffic stats of course [14:41:34] and use them to extrapolate expected pod usage [14:42:09] akosiaris: so, we have for example https://grafana.wikimedia.org/d/000000068/restbase?panelId=4&fullscreen&orgId=1 - here we can figure out the request/limit for memory [14:43:10] Pchelolo: remind me, that's per process (aka service-runner worker) right? [14:43:23] akosiaris: ye, should be per process [14:43:31] as, per worker [14:43:50] for limits we know that service-runner currently kills a worker if the heap grows > 700 megs [14:44:13] is doing something like this `KUBECONFIG="/etc/kubernetes/termbox-staging.config" kubectl logs foo bar` the recommended way of seeing the logs? Or is there some kubectl wrapper like scap-helm? [14:44:14] so, we can have a limit of 750 megs per worker, 2 workers per pod = 1.5gig per pod [14:44:26] tarrow: logstash? [14:44:44] that should be easier than fetching logs manually [14:44:55] but that will work [14:45:38] tarrow: kubectl logs -l INSERT_LABEL here works slightly better since groups all the logs from all pods [14:46:00] however the recommended way should be logstash [14:46:27] Pchelolo: we are going to go with 2 workers per pod? I guess it should give us some hints whether that's a better approach than 1 worker per pod. [14:46:28] fsero: but I think I then default to localhost: `The connection to the server localhost:8080 was refused - did you specify the right host or port?` [14:46:46] akosiaris: ye, RB takes REALLY long time to start up [14:46:53] you need KUBECONFIG tarrow [14:47:04] if not you are pointing to the default api server localhost:8080 [14:47:09] so we really don't want to have dangling masters if a single worker dies [14:47:36] ok [14:47:59] 10serviceops, 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) Makes sense, I am now wondering if we should create a generic and configurable alarm or not :) [14:48:02] fsero: Right! but there isn't a snazzy kubectl wrapper that adds it? [14:48:19] not now [14:48:25] maybe soon :) [14:48:31] Pchelolo: I see mean RSS around ~450MB so that should be the requests section for memory [14:48:41] limits sounds fine to me to have 1.5G [14:49:00] CPU wise though you will still need the benchmarking process, right ? [14:49:22] akosiaris: as you write in the doc... it's tricky :) [14:49:28] given that the boxes also host cassandra you can't really separate restbase from cassandra [14:49:34] quite tricky indeed [14:49:40] fsero :D cool. I was just going to stick something on wikitech about how to run kubectl and wanted it to be right [14:49:40] akosiaris: top -u restbase :) [14:50:13] but with cpu I've had 2 very different results.. [14:50:28] when we go directly to storage, I can max out CPU quite easily [14:50:42] tarrow: premium customer preview [14:50:48] https://www.irccloud.com/pastebin/FdtedvwA/ [14:50:57] when we proxy to a backend service, no matter what I do I'd get really low CPU usage - which is all understandable [14:52:07] the problem is that it's very hard to calculate the ratio between these 2 very different modes of operation [14:53:05] well, we do have the numbers in grafana don't we? It's by endpoint isn't it ? [14:53:28] so we could come up with the ratio we currently see in production, right ? [14:53:58] with a reasonable error margin, yes [14:54:00] well, an approximation ofc. Being too precise here won't payoff. Past experiences don't guarantee futures one and all [14:55:29] Pchelolo: looking at https://phabricator.wikimedia.org/T226538#5314652 . I 'd say increase the limit to 2k+ (or even more) millicores cpus and see if you can hit the ceiling again. 1K seems to be artificially setting it right now [14:56:03] 10serviceops, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [14:56:07] But this looks pretty cool already. [14:56:12] akosiaris: will try. [14:56:43] last kinda dumn question - how do I get the tgz and the index.yaml entry? is there a script? [14:57:00] it's not like we are going to etch the pod size in stone. We just want a good guidelines to inform the scheduler [14:57:07] fsero: looks awesome! [14:57:29] Pchelolo: helm package restrouter; helm repo index . ; git add restrouter-X.Y.Z.tgz index.yaml [14:57:39] we have a TODO task to automate that last part [14:58:11] ok. Great. Thank you. I will come back to you tomorrow with some more mathematics and, I guess, the final version of the helm chart [14:58:14] power outage [14:58:52] I am on UPS power, not sure how long it's going to last, I might disappear suddenly [14:59:10] power outage on one of the hottest days of the year. How nice [14:59:28] I am impressed that the grid lasted this much really [17:43:39] the PhD student who studies Puppet is back again. "Previously, you have helped me with survey participation, and I really appreciate it. This time I am working on types of defects that happen for Puppet code. " heh [17:44:09] last time i answered some questions and won his 50$ Amazon gift card [18:43:26] 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Workboards (Team 2): k8s liveness check(?) generating session storage log noise - https://phabricator.wikimedia.org/T227514 (10WDoranWMF) [18:48:55] 10serviceops, 10Operations, 10Release Pipeline, 10Core Platform Team (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10Pchelolo) > Regarding the deployment plan, the main pain point is that we will need to... [20:29:05] 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, 10Services (watching): Undeploy electron service from WMF production - https://phabricator.wikimedia.org/T226675 (10Pchelolo) Seems like only 2 items are remaining: [] Remove from DNS (SRE) [] Manually stop and disable s... [20:40:46] 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, 10Services (watching): Undeploy electron service from WMF production - https://phabricator.wikimedia.org/T226675 (10Dzahn) >>! In T226675#5312750, @jijiki wrote: > We are going to break this into steps: Going through the... [20:44:58] stopped pdfrender service on scb* for good [20:45:08] merged change to delete remaining classes [20:45:19] uploaded one more change to remove remaining DNS records [20:45:42] we probably also still want to delete files on scb* (find -name pdfrender* -delete ?) [20:45:54] updated ticket with checkbox [21:13:57] 10serviceops, 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, and 2 others: Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) Next we need to make a decision whether we keep phab1003 as the prod host permanently (why not i guess?) then we deco...