[09:28:58] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10akosiaris) Copying from the last comment of https://gerrit.wikimedia.org/r/... [13:10:27] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Ottomata) @akosiaris Hm, yes, let's try! We are going to have issues with... [13:28:31] akosiaris: o/ shall we move some more evenstreams clients to k8s? [13:29:48] ottomata: o/ [13:29:59] sure [13:30:41] we are at ~2% right now [13:30:53] which isn't much. https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams-k8s?orgId=1&refresh=1m&from=now-7d&to=now at least has almost nothing [13:31:20] yeah [13:31:49] 8% ? [13:32:23] there are currently about 70 connected clients [13:32:41] let's go to 20% [13:34:06] ok [13:36:54] ottomata: done. We are at 20% [13:37:15] but... with the longlived connections, we either have to force them to reconnect or just wait it out to happen on its own [13:37:16] <_joe_> ottomata: can you add the tunable tls limits to eventgate though? I'd need that to happen this week [13:37:59] akosiaris: ya we could restart the service on an scb host or 2 [13:38:05] _joe_: sure [13:38:08] can do that [13:38:35] <_joe_> <3 [13:38:37] I should do that rather than resolving the routing_tag stuff into templates? probably will be less resistence eh? [13:39:21] <_joe_> whatever you prefer, I'm mainly interested in timeboxing the migration at this point [13:39:30] ottomata: probably faster indeed [13:39:47] k [13:40:01] akosiaris: to move more es traffic ot k8s, is it just a matter of pooling more kube nodes? [13:40:04] via confctl? [13:40:44] all are pooled now [13:41:21] oh akosiaris all I need to do is restart es on an scb then? [13:41:50] it's a matter of calculating 6*x/6*x+46=wanted %, solving that and then sudo confctl select 'dc=eqiad,service=eventstreams,name=kubernetes.*' set/weight=2 on cumin1001 [13:42:13] sorry [13:42:19] s/weight=2/weight=x/ [13:42:39] and if you want to force the redistribution instead of waiting it out, yes, restart eventstreams [13:43:02] with 70 clients in all, I am this close to saying "just depool scb" tbh [13:43:23] I mean if we see nothing in logs today... let's just wrap it up? [13:48:06] ah [13:48:25] agree let's just get stuff over to k8s quickly if we can [13:48:30] ok so to fix the port problem [13:48:58] hmmm [13:49:15] not as simple as the LVS renames with new ports we did when everything was in k8s [13:49:22] there are different hosts and different ports [13:49:30] e.g. check_https_lvs_on_port!eventstreams.discovery.wmnet!4892!/_info [13:49:55] i think i need a new LVS IP, ya? [13:53:53] akosiaris: ^? [13:54:51] port, not IP [13:55:00] but yes, we need a new LVS service it seems [13:55:11] trying to thing another way out of this, but it doesn't seem easy [13:55:25] OH right the port is given by the requestor always [13:55:26] ok [13:55:52] the alternative is to have eventstreams on k8s move back to HTTP for a bit, finish the migration and then do exactly the same dance for the move to TLS [13:56:02] which I am not sure it really buys us something [13:56:29] oh so we could do a incremental rollover? [13:56:34] hm since the client connections are long lived anyway [13:57:03] can we just swap out the port for new connections [13:57:07] yeah we would be able to do the incremental rollover but we would also have to update the chart to publish both the TLS and the non TLS service [13:57:08] sigh [13:57:09] without having thhe old ones killed? [13:57:40] if we change the port that ats routes to, will it break the existing connections? [13:57:44] not sure what ATS would do [13:57:50] ema ^ ? [13:58:43] what we are essentially discussing is allowing previously established connections to survive on a port that is not on ATS configuration [13:58:56] I would expect the software to honor the new configuration tbh [14:00:10] on the other hand... it's just 70 clients. Maybe just announce a maint window and just do it? [14:00:42] ya, i feel fairly confident it will work [14:00:50] (famous last words)( [14:00:53] lol [14:05:57] https://gerrit.wikimedia.org/r/c/operations/puppet/+/578525 [14:05:58] ottomata: ok, so here's a plan. We switchover LVS configuration in puppet and release it across the fleet but not restart pybal. icinga shouldn't page because there's the critical: false flag already. But we will anyway schedule downtime. [14:05:58] akosiaris: ^ [14:06:26] ah, you add a new service? [14:06:33] oh ya, but i guess we could do it all at once [14:06:41] thought ^ would make rollback easier, we just change the ats config [14:06:41] actually maybe it's better [14:06:48] but maybe LVS siwtch is better? [14:06:48] dunno [14:06:55] whichever you think is best [14:07:00] just switching it is definitely riskier [14:07:37] ya this way we have both lvs services up, and just choose which one to route frontends to [14:08:03] akosiaris, ottomata: I don't think we've ever tried, give it a go and let me know what happens :) [14:08:11] hahah ok [14:08:26] lol [14:08:33] yeah, let's avoid that [14:08:35] I'd expect persistent connections to keep on working, and new ones to be established using the new port [14:08:35] ProxyFetch: [14:08:35] url: [14:08:35] - http://localhost/_info [14:08:41] ottomata: https:// ? [14:09:09] ema: ah, so you expect the inverse from me. OK now you have me intrigued [14:09:27] but I ain't gonna try it in production :P [14:10:10] ah ya [14:10:18] oh akosiaris doesn't pybal mangle that??.. [14:10:28] ah no you oare right [14:10:29] https [14:10:30] fixing [14:10:46] it mangles Host: and port. It doesn't mangle schema and url [14:19:53] akosiaris: patched [14:20:47] merging [14:21:41] ottomata: ah wait, we can't switch over the traffic to TLS endpoints without the changes from https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/578478/ incorporated in the chart [14:21:57] or at least, it would not be prudent [14:22:11] it might work just fine with just 70 long lived connections [14:22:35] in fact, exactly because they are so few and long lived it probably would [14:22:50] haha ok [14:22:52] but it would be prudent to first incorporate that [14:22:53] well let me conquer that first [14:22:55] gottaa do it anyway [14:23:09] I 'll setup LVS in the meantime [14:23:59] k danke [14:30:43] _joe_: which are the 'tunable tls limits' are they all in _tls_helpers.tpl? [14:30:48] if so, I think evenstreams has them already [14:30:53] (eventgate does not) [14:31:08] is it just {{ .Values.tls.upstream_timeout }} ? [14:31:13] <_joe_> not just that [14:31:18] <_joe_> the resource limits [14:31:25] <_joe_> see the latest version of the helpers [14:31:27] <_joe_> alex added it [14:31:36] OH maybe i am not up todate [14:32:09] ah yeah ok [14:32:12] i see em [14:38:36] ottomata: didn't work btw, I had to rollback. pybal depooled all backends because of the conftool cluster change. But to the drawing board [14:38:57] uhhhh ooook [14:39:24] https://grafana.wikimedia.org/d/000000336/eventstreams?orgId=1&refresh=1m&from=now-30m&to=now [14:39:29] at least they reconnect fast [14:40:25] ah good, ya, most SSE clients have will auto reconnect. they also should auto-resume from where they left off! :) [14:40:27] in the stream [14:43:02] akosiaris: should I make an e.g. common_templates/0.2-beta for a _tls_helpers.tpl used by eventstreams and eventgate and symlink them, or should I just copy/paste into each of those charts the stuff I need [14:43:05] ? [14:46:04] I 'd say copy them over for now? And we probably want to revisit that and figure out which approach makes globally sense and bring into version 0.2 [14:48:10] ok [14:49:05] hm akosiaris the default tls limits is 500Mi, and i was setting eventsreams limits to 1900Mi [14:49:14] are those incompatible beacuse of a 2G global pod limit? [14:49:39] evenstreams should work at 1500, it will just handle fewer clients, but probably be within our range. [14:49:45] we could increase replicas a bit too if we have to(eventaully) [14:49:49] ah indeed. good point. [14:50:19] i'll juust do es at 1500 and we can modify things later if we need to [14:50:24] +1 [14:52:01] is there a global CPU limit i'll bump into? [14:52:22] global? like a quota ? [14:52:27] per pod [14:52:28] like mem [14:52:31] ah yes [14:52:39] 3 CPUs [14:53:06] ok am under that then, es has 2000m as limit [15:03:59] wee are really not consistent with our indentation of yaml arrays are we? :p [15:04:13] do you guys havee a pref akosiaris ? [15:04:30] do you prefer indendent with name or with 2 extra spaces? [15:04:30] e.g. [15:04:32] array: [15:04:34] oops [15:04:46] array: [15:04:47] - item1 [15:04:47] - item2 [15:04:47] OR [15:04:59] array: [15:04:59] - item1 [15:04:59] - item2 [15:05:00] 2 spaces please :P [15:05:03] k [15:05:16] but in the charts I 've tried to be as consistent as possible [15:05:25] I think vim will even autocorrect these things for me [15:06:09] i will fix things where i see them different in my charts then [15:06:12] to use 2 spaces [15:07:23] fwiw, not my personal preference, but I 've "inherited" it and just decided to stick with it for consistency's sake [15:12:05] ok akosiaris https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/578537 [15:12:57] ottomata: take #2. I 'll try not to create an outage now https://gerrit.wikimedia.org/r/#/c/578538/ [15:13:56] the outage was cased by the rename of the old one? [15:14:02] yes [15:14:12] k, can we eeventaully rename? :) [15:14:20] they new "service" in conftool ended up being empty [15:14:21] to just keep 'eventstreams' ? [15:14:34] yes, like we did with the other stuff [15:14:36] cool [15:15:01] it looks like we are ok this time around [15:21:28] coo [17:21:43] hmmmm [17:22:10] nm [18:17:02] 10serviceops, 10Operations, 10Performance-Team, 10Core Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10Krinkle) I've gone through dozens of very old diffs from unpopular pages (hoping for a cache mi... [18:17:50] 10serviceops, 10Operations, 10Performance-Team, 10Core Platform Team Workboards (Clinic Duty Team), 10Wikimedia-Incident: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10Krinkle) [18:18:21] 10serviceops, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance-Team (Radar), 10Wikimedia-Incident: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10Krinkle) a:05Krinkle→03None [18:18:25] 10serviceops, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance-Team (Radar), 10Wikimedia-Incident: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10Krinkle) [19:48:12] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Ottomata) @Joe @akosiaris all deployments of eventgate and eventstreams hav... [20:19:15] do we need more jobrunners or are we ok and use all new servers for appserver/API [20:29:40] 10serviceops, 10Operations, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 27 host(s) and their services with reason: new_install ` mw[235... [21:29:31] 10serviceops, 10Operations: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 27 host(s) and their services with reason: new_install ` mw[2350-2376].codfw.wmnet ` [23:11:38] 10serviceops, 10Operations: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 6 host(s) and their services with reason: new_install ` mw[2366,2368,2370,2372,2374,2...