[06:59:17] <_joe_> akosiaris: any reason why mobileapps has 8 pods in staging? [07:02:17] <_joe_> ah lol, my bad [07:03:14] 10serviceops, 10Operations, 10Performance-Team, 10Datacenter-Switchover: Unexplained increase in save times, possibly associated with DC switchover - https://phabricator.wikimedia.org/T261763 (10jcrespo) Save times keep being back to previous levels, aproximatelly. For historical purposes, was something mo... [07:26:16] 10serviceops, 10Operations, 10Performance-Team, 10Datacenter-Switchover: Unexplained increase in save times, possibly associated with DC switchover - https://phabricator.wikimedia.org/T261763 (10elukey) Everything started by this graph: https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?viewPanel=41&orgI... [08:03:30] FYI, I'm going to reboot mwmaint1002 (since it's currently inactive) in a bit [08:58:30] 10serviceops, 10Operations, 10Performance-Team, 10Datacenter-Switchover: Unexplained increase in save times, possibly associated with DC switchover - https://phabricator.wikimedia.org/T261763 (10Joe) We have moved restbase-async to eqiad again, as the load was still too high. We might have to consider expa... [09:27:06] <_joe_> hnowlan: FYI, I'm preparing the migration of echostore, so you have a blueprint for sessionstore [09:28:11] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [09:29:44] _joe_: cool [09:33:12] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventstreams to use TLS only - https://phabricator.wikimedia.org/T255874 (10JMeybohm) [09:33:33] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventstreams to use TLS only - https://phabricator.wikimedia.org/T255874 (10JMeybohm) 05Open→03Resolved a:03JMeybohm All done here. [09:33:35] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [09:34:47] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) [09:35:23] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventgate-logging-external to use TLS only - https://phabricator.wikimedia.org/T255872 (10JMeybohm) [09:36:47] thanks joe for the comment, the other ticket wasn't clear for outsiders like me if one or the 2 mentioned actionables were done. It is clear to me now [09:56:47] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) @Ottomata We would like the HTTP services to be decommissioned from Kubernetes after a service is switched to TLS only in LVS but I think t... [09:58:51] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [10:13:58] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Recommendation-API, 10Release-Engineering-Team, and 2 others: Migrate recommendation-api to kubernetes - https://phabricator.wikimedia.org/T241230 (10hashar) hi, is that still worked on? Asking cause CI still has to maintain a Jessie based image / Nod... [11:57:48] _joe_: do you have a second for an apache mystery? [11:59:00] <_joe_> hnowlan: in a couple of hours. What's the mystery task? [12:04:16] <_joe_> if there is no task, it's ok to describe it here :) [12:04:55] Currently api.wikimedia.org redirects to the foundation wiki, as with other unconfigured wikis. I'm trying to get the apache config in place (the wiki has already been created), but no matter what I try I just get directed to foundationwiki. The Apache config looks good but I could easily be missing something. [12:06:20] it's complicated by having the api-gateway Envoy in front, but the same behaviours can be seen when using curl and httpbb [12:07:13] I've been debugging for a while so I'm gonna revert for now and get puppet reenabled [12:08:49] the actual task is T246945 but there's not much info in it [12:09:57] short-term, if anyone has a sec to look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/623894 I'll back out [12:10:36] <_joe_> uh wait [12:10:42] <_joe_> don't revert [12:10:50] <_joe_> is the patch applied everywhere? [12:11:24] no [12:11:27] only mwdebug2001 [12:11:34] <_joe_> ok so if I had to guess [12:11:52] <_joe_> the problem is the positioning of the include [12:12:06] <_joe_> https://gerrit.wikimedia.org/r/c/operations/puppet/+/623833/2/modules/mediawiki/files/apache/sites/wikimedia.conf [12:12:17] <_joe_> try to move the include by hand at the start of that file [12:12:23] ohhh heh [12:12:39] <_joe_> probably just one place higher is enough [12:12:47] <_joe_> but this is just a hunch [12:13:32] <_joe_> I kinda remember www.wikimedia.org.conf to have some catchall [12:16:11] omg [12:16:13] that's it [12:16:14] nice one <3 [12:16:52] <_joe_> hnowlan: lol [12:16:54] <_joe_> # FIXME: Should this still be here? [12:16:56] <_joe_> ServerAlias *.wikimedia.org [12:16:58] <_joe_> ahahahah [12:17:20] <_joe_> odules/mediawiki/templates/apache/sites/included/www.wikimedia.org.conf.erb [12:17:25] <_joe_> modules [12:17:32] hah [12:17:46] <_joe_> welcome to apache land [12:17:58] <_joe_> where sanity has left the building, screaming [12:18:26] <_joe_> and well, lucky you - this is the MUCH improved version of the insanity [12:19:06] :D [12:19:17] if you have a sec https://gerrit.wikimedia.org/r/624037 [12:20:25] <_joe_> done [12:20:40] <_joe_> jeez, there are so many traps in the apache configs [12:22:02] that's the second one I hit on this particular journey - I assumed adding a site to the other_sites list would be enough. I might template that file after this tbh [12:29:27] alright, tested on another appserver and looks okay. Gonna enable puppet [12:33:06] <_joe_> what file? [12:33:26] wikimedia.conf - it contains all the entries from other_sites [12:33:55] <_joe_> oh ok. Sure be my guest. Actually do it with all of them [12:34:22] <_joe_> basically when I got to that point of refactoring I was so disgusted by apache that I gave up [12:35:01] hah, understandable [12:35:21] <_joe_> this is the horror we had in 2014 [12:35:23] <_joe_> https://gerrit.wikimedia.org/r/plugins/gitiles/operations/apache-config/+/91f78f4e5db66af4f7ef1b487b7d789a1fab81f1 [12:36:05] hehhh [12:37:25] <_joe_> but my point is - there is plenty of room for improvements [12:38:27] I think once you get to a webserver config of sufficient size it's more or less an infinite task [13:01:53] I'll be about two minutes, sorry, on my way [13:34:31] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) Sure! Although I have to admit I don't know what this means. It already runs envoyproxy as a sidecar for TLS. Is this so that its outgoin... [13:41:55] 10serviceops, 10Analytics, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) a:03Ottomata [14:25:19] hello! i have a question I think I asked before months and got an answer...but i forget and did not write it down [14:25:55] i would like to use puppet to render a config with lvs service urls [14:26:06] i'd like to keep it DRY rather than manually maintaining this list [14:26:56] basically this [14:26:56] https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia-event-utilities/+/refs/heads/master/eventutilities/src/main/java/org/wikimedia/eventutilities/core/event/WikimediaDefaults.java#26 [14:27:04] I'd like to use puppet to render a simple key val yaml file [14:27:31] is that info somehow discoverable or referencable in puppet? [14:28:59] hm just found [14:28:59] wmflib::service::get_url [14:29:00] is that it? [14:29:36] oh taht only returns the discovery one [14:32:54] ottomata: there's a bunch of wmflib functions for getting service catalog data [14:33:20] you should be able to get what you need [14:34:37] hmmm [14:34:45] I don't think there's an existing function that does what you want [14:35:00] i guess i can use fetch [14:35:04] and just build what I need? [14:35:09] maybe fetch + get_url? [14:35:30] or i could add a param to get_url to allow for returning a svc lvs name instead of always discovery? [14:35:41] you want a particular site name? [14:35:45] can I ask why? [14:36:08] ya, we want to be able to differentiate between when a kafka topic has no data, and when the pipeline is broken for some reason [14:36:27] eventgate produces into DC prefixed topics [14:36:32] we want to emit canary events into the pipeline [14:36:57] so we need to send events to eventgate in all DCs explicitly [14:36:58] mmmm okay [14:37:46] in hadoop this is especially useful, since we use the presence of hourly data to trigger launches of jobs [14:38:15] if some topic (like eqiad topics during DC switchover) dont' have data all of hte sudden for normal reasons, jobs might not launch, or we might just get alerts about failing to import data [14:38:15] makes sense [14:38:33] so we're just going to emit one canary event per hour per stream per dc [14:38:48] anyway I think you could modify get_url [14:38:51] and we build the canary events with stream config + the schema examples.. [14:38:54] ya? [14:39:06] i guess that would only work for the non proxy listeners case? [14:39:22] yeah, you could also write another function [14:39:24] that was similar [14:39:26] hm [14:39:44] the existing ones are simple shims written at the time to adapt the existing references in puppet that weren't hardcoded [14:39:58] Q, [14:40:02] is https://phabricator.wikimedia.org/T235411 relevant? [14:40:15] I could imagine splitting out the $listeners == undef case into a separate function [14:40:17] are all services eventually only going to be addressable by some different envoyproxy url? [14:40:31] Eventually(tm) I think that is the plan, yes [14:40:43] and i guess that url will only work with discovery addies? [14:40:50] j.oe has been talking about writing an xDS control plane plugin, even [14:41:16] we could probably add something to support talking directly to one DC [14:41:22] it seems reasonable for monitoring-y purposes [14:43:57] hmm it seems the .svc names are only defined in service::catalog for monitoring [14:44:38] can I assume that the service name always results in $service_name.svc.$site.wmnet lvs urls? [14:50:07] I think so [14:50:45] there's nothing that actually enforces that AFAIK but it is the documented convention [14:51:05] we do have discovery records where the dns differes from the etcd name, but probably unrelated for the scope you're looking for [14:51:28] (things like appservers-ro for example) [14:51:54] hm interesting, ya i guess if i can solve it for the main case that'll do for me [14:52:56] volans: yeah and those are things we should fix eventually, IMO [14:53:12] but you know, typical problems in going from many/no sources of truth to one :) [14:54:03] eheheh don't get me started [14:54:57] ottomata: btw joe or alex are the best people to review that [14:55:05] ya ok, i'll add them [15:17:47] 10serviceops, 10Analytics, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) >>! In T255871#6433346, @Ottomata wrote: > Sure! Although I have to admit I don't know what this means. It already runs env... [15:29:40] 10serviceops, 10Analytics, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) OH! yes...there was a reason we left HTTP on...I think it was before MW was using a local envoyproxy to do TLS, because PHP... [15:59:17] 10serviceops, 10Operations, 10Performance-Team, 10Datacenter-Switchover: Unexplained increase in save times, possibly associated with DC switchover - https://phabricator.wikimedia.org/T261763 (10Krinkle) 05Open→03Resolved Looks good to me now: [Grafana: Save Timing](https://grafana.wikimedia.org/d/000... [16:32:07] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10Cmjohnson) [16:32:41] 10serviceops, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10Cmjohnson) 05Open→03Resolved Completed [16:48:53] 10serviceops, 10Performance-Team, 10Platform Engineering, 10Wikimedia-Rdbms, 10Sustainability (MediaWiki-MultiDC): Determine multi-dc strategy for ChronologyProtector - https://phabricator.wikimedia.org/T254634 (10Krinkle) From the Multi-DC sync meeting on 31 Aug 2020. Attending: Gilles D, Timo T, Even E... [16:56:25] another one of those elusive helm CI failures https://integration.wikimedia.org/ci/job/helm-lint/2369/console [17:01:01] thanks hnowlan, noted at https://phabricator.wikimedia.org/T261313 :-( [17:13:06] oh wow, I hadn't seen the functionality in /etc/helmfile-defaults. that's *great*, much more sensible than overloading the secrets stuff [17:36:00] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) There don't see to have been any actionable comments regarding the test installation, either in phabricator or OTRS Cafe. In th... [18:20:15] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10NoFWDaddress) Look good. thank you Akosiaris for all you work. It might be worth it to notify the various OTRS mailing list of the downti... [21:14:12] 10serviceops, 10OTRS, 10Operations, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10Johan) Included in https://meta.wikimedia.org/wiki/Tech/News/2020/37 going out on Monday.