[00:29:40] 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10BBlack) It's a confusing set of things going on here, and it's going to need fixups on both the `network/data/data.yaml` side a... [06:45:55] 10serviceops, 10Cloud-VPS, 10Operations, 10Traffic: Difficulties to create offline version of Wikipedia because of HTTP 429 response - https://phabricator.wikimedia.org/T213475 (10Kelson) [08:12:09] <_joe_> ok so, writing here a braindump about https://phabricator.wikimedia.org/T210717 [08:13:23] <_joe_> the tl;dr is we need a proxy between php and the services [08:13:56] <_joe_> now to solve the immediate problem, I can just add a small nginx vhost that points to search.discovery.wmnet [08:14:49] <_joe_> but franly I'd like for the whole thing to be a bit more refined. Specifically I'd like to do as follows: [08:15:33] <_joe_> - mediawiki points to something like `search.local:80` that we resolves to localhost [08:16:06] <_joe_> - nginx knows which clusters are pooled or not, and directs the request to the closest pooled cluster [08:17:02] <_joe_> and more ideally, I'd use envoy instead of nginx, because envoy has much more refined logic for acting as a "forward" proxy [08:18:20] <_joe_> now, since this is a blocker for the deployment of php7, I'll go with the simplest implementation possible for now, but tbh, we need to work on envoy for k8s anyways, we could generalize the work [08:19:09] <_joe_> so I'd probably start working on that solution for the mid term instead of putting too much effort in an nginx-based solution [08:22:09] <_joe_> one of the annoying parts of using nginx is some of the desired functionalities for such a thing to work, like setting a maximum number of connections to upstream, only work in the commercial version [09:19:31] <_joe_> gehel: context is, I'm trying to use nginx to do the connection pooling that was done by hhvm [09:19:56] <_joe_> and I need to benchmark it against the unpooled option for php7 and the current hhvm setup, basically [09:20:11] the part that I find surprising so far is the cost of SSL inside a DC [09:20:40] <_joe_> part of it is parsing every time our whole cert chain [09:20:44] <_joe_> as dcausee pointed out [09:20:46] the 10ms measured by David looks huge to me (but what do I know) [09:21:47] yep, but even for that, 10ms seems a lot [09:22:17] <_joe_> anyways, that was for a single curl request [09:22:33] <_joe_> what I need is to do a series of requests in a row and see the total time it takes [09:22:36] <_joe_> basically [09:23:30] yep [09:24:15] and do it from php [09:24:54] _joe_: how can I help? [09:24:58] <_joe_> so 1) what url should I benchmark? [09:25:06] <_joe_> 2) do you already have a script doing that? [09:25:30] 1) dcausse was proposing just https://search.svc.eqiad.wmnet:9243/ [09:26:18] it is served by elastic and is fast enough that most of the time will be the SSL overhead [09:26:27] the script is here https://phabricator.wikimedia.org/T130219#2160570 [09:27:27] <_joe_> oh nice and simple [09:27:37] <_joe_> ok lemme try from mwdebug2001 [09:29:02] we could test the completion suggester itself, but performance on the elastic side will be similar (compsuggest is usually 5-10ms) [09:29:18] <_joe_> yeah better to avoid adding variability [09:37:16] <_joe_> ok so going via nginx takes 80 ms vs 40 ms of the curl_init_pooled, meh [09:37:19] <_joe_> pretty shitty [09:37:39] there is always a way to make things worse ! [09:40:04] <_joe_> I'm not sure this benchmark is great, but I surely need to refine the nginx configuration at the very least [09:40:48] <_joe_> ok interestingly this new test showed better responses [09:42:15] 10serviceops, 10Operations: Canaries canaries canaries - https://phabricator.wikimedia.org/T210143 (10jijiki) [09:42:27] <_joe_> meh I did a rookie mistake before [09:44:43] <_joe_> ok, so curl_init_pooled: ~ 39 ms; curl_init: 171 ms (!!!); via nginx: 43 ms [09:44:47] <_joe_> this is via HHVM [09:44:51] <_joe_> lemme try with php7 [09:45:24] <_joe_> 4 ms is a 10% penalty but I guess we can live with that? [09:47:02] <_joe_> interestingly php7 does a bit better with the simple curl (161 ms) and with nginx it goes to... 39 ms [09:47:06] <_joe_> like hhvm [09:47:20] <_joe_> using curl_init_pooled [09:47:45] 39ms already seems high, we had ~7ms for compsuggest in november (https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?panelId=51&fullscreen&orgId=1&from=now-90d&to=now) [09:47:59] unless that is measuring soemthing entirely different (dcausse: opinion?) [09:47:59] <_joe_> gehel: this is cross-dc [09:48:06] Oh, make sense then! [09:48:15] <_joe_> I'm explicitly testing cross-dc to get the worst-case scenario [09:48:25] <_joe_> now I'll test within the same dc [09:48:43] <_joe_> "tagline" : "You Know, for Search" ahahahahah [09:49:06] they do have a sense of humour! [09:53:24] <_joe_> so, locally [09:54:10] <_joe_> with curl_pooled: 2 ms , non-pooled: 16 ms, via nginx: 6 ms (hhvm) [09:54:32] <_joe_> non-pooled: 12ms, nginx: 1.5 ms (php7.2) [09:54:53] <_joe_> so it seems php 7.2 outperforms hhvm thoroughly on simple http requests [09:55:10] <_joe_> and using nginx will give us a defined advantage when we switch [09:55:22] <_joe_> I'll report all this on the ticket [09:55:48] nice! [09:56:21] <_joe_> yeah this is definitely good news [09:56:36] <_joe_> keep in mind such a benchmark is really quite flawed [09:56:52] if you want a clinical case, bye, I just tested and https://phabricator.wikimedia.org/T176370#4789096 is still true [09:56:57] <_joe_> the right way to benchmark this would be using AB against differently configured servers [09:57:46] <_joe_> jynus: oh we've had plenty of revisions that render only in HHVM or only in php [09:58:00] <_joe_> during the last migration, I don't expect that to be different this time [09:58:58] <_joe_> although we seem to keep getting examples of pages that work in php7.x and not in hhvm, but I suspect a selection effect is happening, I'm basically the only person that navigates all of our sites in php7 :P [10:00:21] _joe_: thanks! [10:08:38] 10serviceops, 10Operations, 10Patch-For-Review, 10User-jijiki: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10jijiki) [11:15:44] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Operations: Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10Joe) I did some *very* lame benchmarking of the response of the banner url for elasticsearch (`/`), with the following code: ` https://github.com/etcd-io/etcd/blob/688043a7c2ac8bbb0e73ca5694c7815275865e24/Documentation/upgrades/upgrading-etcd.md i'll leave this around [13:37:42] discussing with etcd developers they recommend go to 2.3 then 3.0 then towards every minor (3.0 -> 3.1 -> 3.2 ) [14:08:42] <_joe_> fsero: that's not really sensible imho, we're going to replicate the data, we won't upgrade the cluster in place [14:09:00] <_joe_> there was even a guide pointing to that method [14:09:07] <_joe_> for kubernetes [14:09:16] i wasnt proposing to upgrade in place [14:09:19] i never do that [14:09:28] i have fight several people about that [14:09:38] my intent was about how to migrate data properly [14:09:44] <_joe_> so if you're not doing that, I'm pretty sure which version of etcd3 you move to doesn't mean much [14:09:57] <_joe_> but I might be wrong, lemme find that old document [14:09:58] you need to migrate data at least to etcdv3 format [14:10:07] i know which one [14:10:21] https://gravitational.com/blog/kubernetes-and-offline-etcd-upgrades/ [14:10:23] this one? [14:14:05] so my point was specifically about the data migration if we can do data migration directly to the latest version then better [14:14:06] :) [14:19:45] <_joe_> I think it's possible, yes, and we can just try, we have a spare etcd cluster :P [14:19:57] <_joe_> that we just need to reinstall with stretch [15:42:34] akosiaris: we don't have kube part 3 today right? [15:44:14] <_joe_> akosiaris: slacker [15:46:28] o/ [15:46:44] just making sure I don't miss it [15:47:12] current q (from mw-sec), can I infer the deployed DC in a chart somehow? [15:47:25] or, i suppose that jus tneeds to be set during chart install/deployment? [15:47:36] scap-helm install ... --set datacenter=eqiad ? [15:47:46] fsero: ^ ? [15:48:32] <_joe_> so that's one possibility, but you're really asking to have different config by datacenter [15:48:38] <_joe_> why? [15:48:50] <_joe_> need to connect to different hosts? [15:49:50] no, need to prefix topic names [15:49:52] based on DC [15:50:20] mean, yes we need to connect to different hosts too [15:50:22] :) [15:50:54] actually, i was thikning about that, maybe we shoudl said up an LVS for Kafka. it would only be used for discovery (as kafka itself will return the brokers to actually use after bootstrap) [15:51:02] <_joe_> that latter part should be solved with discovery records [15:51:05] aye [15:51:08] that would make sense [15:51:12] i'll make a task for that [15:51:19] that'll simplify a lot of kafka config stuff [15:51:21] <_joe_> or even some SRV records [15:51:22] in other places too [15:51:33] <_joe_> as far as integrating shit from puppet into your charts [15:51:42] aye [15:51:52] <_joe_> I would say the best thing I can come up with is you fill a default values file [15:52:00] <_joe_> from puppet [15:52:03] <_joe_> but that's lame [15:52:19] <_joe_> anyways, it's a good question, one we have to solve [15:52:33] <_joe_> not /in the charts/, mind you [15:52:50] <_joe_> but in how we can interpolate values in deployments that come from puppet [15:53:30] <_joe_> ottomata: if we were insane, we'd develop a puppet native resource for managing helm deployments [15:53:51] <_joe_> I'm sure fsero is already crying just because I named the possibility [15:54:04] _joe_ a generic puppet values file is not a bad idea. one with just simple defaults (not app specific ones) [15:54:09] like $::site, etc., [15:54:27] oh, but will that work? how does helm-scap work in prod? I'd assume it deploys to both DCs? [15:55:15] <_joe_> it does, but IIRC we have dc-specific values? [15:55:49] <_joe_> ottomata: it would be helpful if you could try to generalize a bit your problems and formulate tasks [16:03:47] ottomata: passing the dc as a helm value is one option and then you can configure the topic [16:04:52] another option will be to set a kubernetes label over the deployment and then use the downward api https://kubernetes.io/docs/tasks/inject-data-application/downward-api-volume-expose-pod-information/#the-downward-api [16:07:35] _joe_: thanks for the nginx proxy! I'm having a look at the patches [16:08:05] so the current implementation is routing based on port only, no name based routing [16:08:43] this is fine for elastic, but I remember a number of conversations around how we want to route in case of multiple instances [16:09:00] named based routing did pop up a number of times [16:09:34] with this proxy, we're putting another nail into that name based routing coffin... [16:09:51] I'm not complaining, just wanting to make sure it's not something we'll regret at some later point [16:10:05] re the puppet option to manage the values file _joe_, yes you made me cry, since that you deserve this https://forge.puppet.com/puppetlabs/helm why we will want to write one when there is one available [16:10:47] ottomata: imo a datacenter values would be the easiest option , at least for now [16:17:22] yeah will be easiest to start wtih, but i'm not sure how that would work with scap-helm if it deploys to both DCs; we'd have to do two deploys with --limit (?) [16:17:29] and two different --set datacenter= [16:17:37] _joe_: ok i'll make a task [16:19:22] presumably passing '--set datacenter=' to each helm invocation could be part of scap-helm's contract, right? [16:24:32] fsero: downward Api ok i think i get it. [16:24:51] is that something the pods need to be created with (the labels?) or are they assigned from the nodes they run on [16:25:03] e.g. if a node is in eqiad, it could/should have a datacenter=eqiad label? [16:33:12] <_joe_> the problem is that if annotations need to be added to the deployment, they need to come from a values file, right? and we need then changing by dc. [16:38:35] _joe_: another option will be populate a generic configmap as a contract in each namespace called wmf-config including the datacenter and other generic values there [16:38:48] and how we create that is another discussion itself [16:39:18] but it will leverage users to get the datacenter they are among other things like kafkas redises etc [16:39:45] cdanis: probably we should drop scap-helm is a thin wrapper in top of helm or at least make sever ammendments to it [16:39:53] hm [16:41:18] i think we have a task around for setting up how to do helm deployments and keep them in code [16:41:32] i'll share the task or if not create it so you can be kept in the loop cdanis [16:41:45] need to run now sorry :) [16:41:53] have a good weekend :) [16:52:36] 10serviceops, 10Analytics, 10EventBus, 10Services (watching): Datacenter aware configs for EventGate topic prefixes - https://phabricator.wikimedia.org/T213564 (10Ottomata) p:05Triage→03Normal [16:53:15] added ^ [16:58:57] <_joe_> ottomata: thanks [17:00:48] 10serviceops, 10Analytics, 10EventBus, 10Services (watching): Datacenter aware configs for EventGate topic prefixes - https://phabricator.wikimedia.org/T213564 (10Pchelolo) > can render the service-runner config.yaml template with values provided by it. We can also include it as an env variable into the c... [17:18:00] q: whats the difference between the wmf.chartname template function, and the helm .Release.Name ? [17:18:13] or wmf.releasename [17:18:15] i guess [17:18:41] oh truncated. [17:18:42] ok [18:04:27] hashar says this works: ServerName https://doc.wikimedia.org I say that's invalid syntax ... but he actually tested :o [18:04:48] context is DirectorySlash redirect issue https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/483775/ [18:05:09] it immediately jumped at me as "that cant be right" .. but ... [18:05:27] protocol in ServerName that is , ofc [19:46:22] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, 10User-jijiki: Investigate systemd hardening to replace Firejail for Thumbor - https://phabricator.wikimedia.org/T212941 (10jijiki) [19:48:22] 10serviceops, 10Developer-Advocacy, 10Gerrit, 10Operations: Remove port 29418 from cloning process - https://phabricator.wikimedia.org/T37611 (10herron) p:05Triage→03Normal