[08:47:48] _joe_: are we doing a team meeting today? I guess it's a WMF holiday so we'll be short at least one person, but there's no one going to Wikimania from our group so the rest of us should be available, right? [08:48:20] yup [08:50:27] <_joe_> apergos: I think it's a bit pointless in the absence of the in-person SRE meeting later in the day [08:51:06] <_joe_> fsero: so, I was taking a look at the helmfile data [08:51:57] <_joe_> and in helmfile.d/services/codfw/mathoid/values.yaml for instance I saw "1.1.1.1" used as externalIP [08:52:00] it's for us though, not just for reporting to the larger team. but if you'd rather not, I can certainly put that 30 minutes to good use [08:53:56] _joe_: the values there are the values that were stored in scap helm, so either that was a typo from staging values [08:54:10] Or something is odd [09:02:37] <_joe_> fsero: externalIP is probably useless there too [09:02:42] <_joe_> given we're using nodeport [09:05:27] <_joe_> fsero: I'll have more questions shortly :P [09:07:25] yeah i think its a remainder of some tests [09:22:36] <_joe_> fsero: where is .hfenv coming from? [09:22:56] from a template in puppet [09:23:14] modules/profile/templates/kubernetes/.hfenv.erb [09:23:40] <_joe_> ok [09:24:51] <_joe_> fsero: we have a newer mathoid in codfw than in eqiad [09:24:56] <_joe_> any idea why? [09:25:15] eqiad is not managed via helmfile yet [09:25:52] maybe version in code was updated for staging? and moved towards codfw? [09:25:59] <_joe_> - chart: mathoid-0.0.20 [09:26:01] <_joe_> + chart: mathoid-0.0.23 [09:26:05] <_joe_> just the chart [09:26:19] <_joe_> so the image is the same [09:26:38] <_joe_> just the label of the configmap [09:26:47] thats for trying to fix the quota i guess [09:27:00] <_joe_> yeah probably [09:27:20] <_joe_> fsero: is the traffic restored to codfw or not? I didn't check last week [09:28:01] services are pooled [09:28:17] but caching layer is still not sending traffic to codfw [09:28:35] however this only affects blubber and cxserver [09:28:36] https://gerrit.wikimedia.org/r/c/operations/puppet/+/528409 [09:28:38] <_joe_> ok [09:28:44] <_joe_> we should merge that today [09:28:54] <_joe_> giving a shoutout to traffic [09:29:22] <_joe_> anyways, I'm looking at initialize_cluster.sh [09:29:32] <_joe_> this is for initializing the namespace for a new service? [09:29:45] no [09:30:02] this is for initializing a cluster to be managed via helmfile [09:30:17] <_joe_> NAMESPACE=$1 [09:30:18] for a new namespace look under admin/cluster/namespace [09:30:36] <_joe_> I was wondering what the namespace name should be there [09:30:48] its kube-system right now [09:31:09] where the admin tiller is installed and all the helm releases [09:31:17] <_joe_> ok [09:31:44] <_joe_> so if you're installing a new cluster you need to do [09:32:05] <_joe_> initialize_cluster.sh kube-system $host $port [09:32:28] yep, if you look at that script what essentially does is creating some roles and setting up tiller [09:33:05] <_joe_> cluster-helmfile.sh is evil :P [09:33:13] <_joe_> I like evil [09:36:10] <_joe_> ok so when I want to add a new service, I need to create a values file in admin/$dc/values/, run cluster-helmfile.sh [09:36:49] yep, that should create the namespace and the "environment" [09:37:13] then for the actual service you need to create a services/cluster helmfile.yaml where you define the app [09:37:23] and made some puppet changes like now [09:38:11] the puppet changes are required to maintain that hfenv madness and for secrets et al [09:38:23] <_joe_> ok [09:38:32] <_joe_> have you written all this somewhere? [09:40:48] <_joe_> even if not polished, please dump it on wikitech somewhere, I can sort through it [09:41:59] its on the works [09:42:04] :) [09:44:04] <_joe_> fsero: for reconstructing eqiad, we need to ensure that switching eventgate from the mediawiki active dc to the other is seamless [09:45:12] well otto told me that some minutes of downtime are acceptable [09:45:28] and giving how we reconstruct the cluster which is deleting a namespace one by one and applying one by one [09:45:41] eventgate itself should be down only for a few minutes [09:45:45] <_joe_> yeah but it should be switchable :) [09:45:46] but lets look into it [09:46:01] <_joe_> I'm testing a few commands [09:46:29] <_joe_> we're still not collecting stdout from the containers to logstash, right? [09:46:41] nop [09:46:41] <_joe_> I don't remember where are we on that [09:46:49] <_joe_> ok [09:47:01] there is a rsyslog plugin that should send things to kafka [09:47:19] i think all the pieces are ready we are just not put the tofgle on [09:47:30] <_joe_> so we're sending stdout => rsyslog already? [09:47:44] <_joe_> then it's pretty easy to ingest that in logstash [09:49:03] https://phabricator.wikimedia.org/T207200 [09:50:29] <_joe_> uhm I see that the idea was to use fluentbit afterall? [09:50:39] <_joe_> I'm confused, I was convinced we went the other way [09:50:44] <_joe_> (using rsyslog) [09:51:03] <_joe_> anyways, I'll try to help alex with these things (including tls termination) [09:52:13] <_joe_> fsero: so the private/ dir containing secrets is populated by puppet I gather [09:53:51] <_joe_> (It's so sad to have certs written inside yaml files, so that they're loaded in memory and then written to "disk") [09:55:08] <_joe_> ok I think I have a grasp of how helmfile is organized and how to do basic operations on it [09:56:32] _joe_: no the idea is rsyslog for sure [09:56:49] <_joe_> please work on writing down what you can in the next two days though :P [09:57:03] it is populated by puppet, and i agree is sad [09:57:25] <_joe_> puppet automates a small part fo that misery at least [09:57:25] specially because that yaml is going top become ynmanageable soon [09:57:37] <_joe_> which yaml? [09:57:57] the one that keeps the certs under the private repo [09:58:10] in role/common/deployment [09:58:27] <_joe_> uhm I think that can be managed better, in puppet terms [09:58:32] <_joe_> I'll look into it :) [09:58:55] <_joe_> well not now, it's good to have something working as expected, now we can make it not-miserable to manage on our side [09:59:10] <_joe_> but I was more interested in giving the devs a tool they can use with confidence [09:59:16] <_joe_> and helmfile is that I think [09:59:16] the one that keeps the certs under the private repo [09:59:21] <_joe_> it won't scare them too much [09:59:29] yeah is working fine for them [09:59:37] im talking more about our side [09:59:41] <_joe_> yeah [09:59:45] that needs rework [09:59:54] <_joe_> indeed [13:41:07] 10serviceops, 10Operations, 10PHP 7.2 support, 10Patch-For-Review: Socket Errors on PHP7 - https://phabricator.wikimedia.org/T224538 (10jijiki) 05Open→03Resolved {F30008369} Fixed! [13:45:28] <_joe_> jijiki: \o/ [13:45:47] haha [14:53:49] apergos: should we skip today's meeting ? [14:53:55] it is just me you and joe [14:54:04] and me [14:54:11] hehe [14:54:21] _joe_ had already said he didn't think there was much point [14:54:25] given there's no larger team meetin [14:54:26] g [14:54:29] I agree with joe [14:55:47] I am declining [14:55:59] fsero: you were mostly ill last week, yes? [14:56:18] yeah and in my 2 days mark [14:56:19] so [14:56:25] im writing docs mostly [14:56:40] jijiki: nice catch btw on T224538 [14:57:05] sigh it was silly really [14:57:41] I had made the stupid deduction that those errors don't come from lo [15:00:26] <_joe_> jijiki: it's the kind of "silliness" that can defeat detection for months :P [15:02:26] :/ [15:18:44] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki) [15:18:59] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki) [15:26:03] 10serviceops, 10MediaWiki-extensions-Mailgun, 10Operations, 10cloud-services-team, and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki) @Dzahn Thank you! We'll see how things are since we now have merged https://gerrit.wikimedia.org/r/425027, and... [15:26:24] 10serviceops, 10MediaWiki-extensions-Mailgun, 10Operations, 10cloud-services-team, and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki)